AI ALIGNMENT FORUM
AF

AI

9

Lessons from Studying Two-Hop Latent Reasoning

by Mikita Balesni, Tomek Korbak, Owain_Evans
11th Sep 2025
2 min read
0

9

This is a linkpost for https://arxiv.org/abs/2411.16353
AI
New Comment
Moderation Log
More from Mikita Balesni
View more
Curated and popular this week
0Comments

Twitter | ArXiv

Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud'' using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?

In our new paper, we investigate LLM latent reasoning capabilities using two-hop question answering as a case study. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture:

  1. Finding 1: Models completely fail to compose synthetic facts they learned through fine-tuning without explicit chain-of-thought reasoning, achieving only chance-level accuracy despite perfect recall of the individual facts.
  2. Finding 2: Interventions to (i) force a correct fact storage order across transformer layers and (ii) encourage the first reasoning hop both fail to enable models to compose newly learned facts without chain-of-thought.
  3. Finding 3: Models successfully compose newly learned synthetic facts without chain-of-thought when those facts co-occur in the same fine-tuning document or in the same test-time prompt.
  4. Finding 4: LLMs are capable of composing two separately learned facts, as long as one of the facts is naturally acquired during pretraining (the second fact can be synthetic and acquired through fine-tuning).

These findings offer a humbling lesson: researchers observing models answering two-hop questions about real-world facts might incorrectly infer robust latent reasoning capabilities, not realizing that performance relies on facts appearing together in pretraining data. Conversely, our synthetic fine-tuning results alone might suggest models fundamentally cannot perform latent composition—a kind of limitation that could make oversight via reasoning traces highly reliable. Yet this too would be misleading given models' success in other settings. This sensitivity to experimental conditions serves as a cautionary tale for the broader project of understanding LLM reasoning: apparent limitations may be artifacts of experimental setups far removed from how frontier models are trained, while apparent successes may rely on shortcuts invisible to experimenters.

Moving forward, we believe that researchers hoping to gain confidence in LLM agent oversight based on CoT monitoring should move beyond multihop question answering as a toy model of reasoning. What ultimately matters for monitoring capable LLM agents is whether oversight mechanisms can catch misbehavior in practice. Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model. Such evaluations could reveal whether models are capable of latent reasoning that would allow carrying out harmful tasks without externalizing harmful intentions in a readable way.