Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in "real" training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?
We don't have a good explanation. One idea could be that you need bridge entities to be somehow more internalized to support latent two-hop reasoning, e.g. they need to occur in many facts as first and as second entities or maybe they need to occur in other two-hop questions. The Grokked transformers paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy grokking settings).
Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model.
I think those are useful, but I think more basic science results are also important, especially when they take the form "LLMs differ from humans learning from data in way X" because it lets you find ways in which modeling LLMs as "like humans, but weaker" is wrong. In particular, I think the reversal curse paper updated me much more than this GDM paper (though I think the GDM paper has other qualities, like illustrating a specific hope for CoT monitoring, which I think is important). Just because you failed at finding such a result in this case and got a more messy "LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases" doesn't mean there aren't other reversal-curse-like results that remain to be found.
But I think these "direct observation" papers will get somewhat more informative as we get closer to AIs that are actually scary, since the setups could actually get quite close to real catastrophic reasoning (as opposed to the more toy setups from the GDM paper).
Just because you failed at finding such a result in this case and got a more messy "LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases" doesn't mean there aren't other reversal-curse-like results that remain to be found.
I think that's fair, it might be that we've over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon).
Two big reasons why I'm hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don't offer a clean explanation of why we see the results we see.
RE part 6:
I think there's a more intuitive/abstract framing here. If a model has only seen e_2 with respect to two different facts, it probably won't have generated an abstraction for e_2 in its world model at all. An abstraction is mostly useful as a hub of different inferences, like in the old blegg/rube diagram.
Something which has come up in pretraining will already be an abstraction with an easy-to-reach-for handle that the model can pull.
Might be testable by fine-tuning on only some of (or some pairs of) the spokes of a blegg/rube diagram, to see whether the final spoke-pairs fill in.
I.e.
"This object is round, so it's a blegg, so it's blue"
"This object is smooth, so it's a blegg, so it's round"
"This object is smooth, so it's a blegg, so it's bouncy"
"This object is round, is it bouncy?"
Something like that might cause "blegg" to be bound up and assembled into an abstraction in the AI, with a single representation.
Overall I consider this work to be weak evidence in favour of multi-step reasoning being an issue, since the latter parts show that it definitely can occur (just not if both facts are fine-tuned separately)
Yeah, it seems plausible that entity being activated across different context is necessary for it being represented saliently enough to facilitate multi-hop reasoning. The Grokked transformer paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy settings).
We did some related work: https://arxiv.org/pdf/2502.03490.
One of our findings was that with synthetic data, it was necessary to have e1->e2 as the first hop in some two-hop question and e2->e3 as the second hop in some two hop question in order to learn e1->e3. This differs from your finding with "natural" facts: if e2->e3 is a "natural" fact, then it plausibly does appear as a second hop in some of the pretraining data. But you find generalization even when they synthetic e1->e2 is present only by itself, so there seems to be a further difference between natural facts and synthetic facts that appear as second hops.
We also found that learning synthetic two hop reasoning seems to take about twice as many parameters (or twice as much "knowledge capacity") as learning only the one-hop questions from the same dataset, supporting the idea that, for transformers, learning to use a fact in either hop of a latent two-hop question requires something like learning that fact twice.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop? It would be interesting to know whether "natural facts" can be composed flexibly with new facts or whether they can only be composed with new first hops. Our results suggest that there's a substantial cost to making facts latently composable, so I think it would be surprising if many facts were flexibly composable, especially if many of those facts were reasonably rare.
Thanks! I somehow missed this paper, looks interesting!
Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn't intend to leave readers with an impression that it "just works". It seems very plausible to me that it's less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop?
We did not, but Jiahai Feng had an experiment like this in his paper.
Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud'' using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?
In our new paper, we investigate LLM latent reasoning capabilities using two-hop question answering as a case study. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture:
These findings offer a humbling lesson: researchers observing models answering two-hop questions about real-world facts might incorrectly infer robust latent reasoning capabilities, not realizing that performance relies on facts appearing together in pretraining data. Conversely, our synthetic fine-tuning results alone might suggest models fundamentally cannot perform latent composition—a kind of limitation that could make oversight via reasoning traces highly reliable. Yet this too would be misleading given models' success in other settings. This sensitivity to experimental conditions serves as a cautionary tale for the broader project of understanding LLM reasoning: apparent limitations may be artifacts of experimental setups far removed from how frontier models are trained, while apparent successes may rely on shortcuts invisible to experimenters.
Moving forward, we believe that researchers hoping to gain confidence in LLM agent oversight based on CoT monitoring should move beyond multihop question answering as a toy model of reasoning. What ultimately matters for monitoring capable LLM agents is whether oversight mechanisms can catch misbehavior in practice. Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model. Such evaluations could reveal whether models are capable of latent reasoning that would allow carrying out harmful tasks without externalizing harmful intentions in a readable way.