Marius Hobbhahn

I'm the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://www.mariushobbhahn.com/aboutme/

I subscribe to Crocker's Rules

Posts

Sorted by New

39100+ concrete projects and open problems in evals

26d

70Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

1mo

38We should start looking for scheming "in the wild"

1mo

24For scheming, we should first focus on detection and then on prevention

1mo

46Detecting Strategic Deception Using Linear Probes

2mo

63Catastrophe through Chaos

2mo

129What’s the short timeline plan?

3mo

84Frontier Models are Capable of In-context Scheming

4mo

33Training AI agents to solve hard problems could lead to Scheming

5mo

21Which evals resources would be good?

5mo

Wikitag Contributions

Comments

Sorted by

Newest

We should start looking for scheming "in the wild"

Marius Hobbhahn1mo42

Good point!

Yes, I use the term scheming in a much broader way, similar to how we use it in the in-context scheming paper. I would assume that our scheming term is even broader than Joe's alignment-faking because it also includes taking direct covert action like disabling oversight (which arguably is not alignment-faking).

What’s the short timeline plan?

Marius Hobbhahn3mo1212

That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.

What’s the short timeline plan?

Marius Hobbhahn3mo60

I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).

These are all pretty basic thoughts and IMO we should invest significantly more effort into clarifying this as part of the "let's make sure CoT is faithful" part. A lot of safety strategies rest on CoT faithfulness, so we should not leave this to shallow investigations and vibes.

What’s the short timeline plan?

Marius Hobbhahn3mo911

Something like the OpenPhil AI worldview contest: https://www.openphilanthropy.org/research/announcing-the-winners-of-the-2023-open-philanthropy-ai-worldviews-contest/
Or the ARC ELK prize: https://www.alignment.org/blog/prizes-for-elk-proposals/

In general, I wouldn't make it too complicated and accept some arbitrariness. There is a predetermined panel of e.g. 5 experts and e.g. 3 categories (feasibility, effectiveness, everything else). All submissions first get scored by 2 experts with a shallow judgment (e.g., 5-10 minutes). Maybe there is some "saving" mechanism if an overeager expert wants to read plans that weren't assigned to them. Everything in the top N% then gets scored by all experts with a more detailed review. Then, there is a final ranking.

I'd hope that the time spent per expert is only 5-10 hours in total. I'd be fine with missing a bunch of posts that contain good ideas that are badly communicated or otherwise easy to miss on the shallow review.

My main goal with the contest would be that writing a good plan and communicating it clearly is incentivized.

What’s the short timeline plan?

Marius Hobbhahn3mo713

I would love to see a post laying this out in more detail. I found writing my post a good exercise for prioritization. Maybe writing a similar piece where governance is the main lever brings out good insights into what to prioritize in governance efforts.

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn4mo112

Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn't have a lot of time to investigate them, so we didn't want to make them the headline result. Due to the natural deadline of the o1 release, we couldn't do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO.

We will potentially further look into these findings in 2025.

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn4mo*73

(thx to Bronson for privately pointing this out)

I think directionally, removing parts of the training data would probably make a difference. But potentially less than we might naively assume, e.g. see Evan's argument on the AXRP podcast.

Also, I think you're right, and my statement of "I think for most practical considerations, it makes almost zero difference." was too strong.

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn4mo*1017

We write about this in the limitations section (quote below). My view in brief:

Even if they are just roleplaying, they cause harm. That seems bad.
If they are roleplaying, they will try to be consistent once they are caught in that persona in one rollout. That also seems bad.
I think the "roleplaying" argument is very overrated. It feels to me as if existing models change behavior throughout their rollouts, and I would expect that stuff like outcome-based RL will make them more consistently move away from "roleplay."
I also think it's philosophically unclear what the difference between a model roleplaying a schemer and "being" a schemer is. I think for most practical considerations, it makes almost zero difference. It feels about as shallow as when someone says the model doesn't "truly understand" the poem that it's writing.

Quote:

Uncertainty about source of scheming behavior A possible objection to our results is that the models merely “roleplay as evil AIs” rather than “truly” employing scheming to achieve their goals. For example, our setups might most closely align with stories about scheming in the training data. While we cannot show the root-cause of the observed scheming behaviors, we think they are concerning for either explanation. We find that multiple models consistently engage in in-context scheming across a range of scenarios, settings, variations and phrasings. Even if the models were merely “roleplaying as evil AIs”, they could still cause real harm when they are deployed. Our working hypothesis is that scheming can be an effective strategy to achieve goals and when models are trained to achieve ambitious goals or solve complex tasks, they will learn how to scheme. However, we do not provide evidence for or against this hypothesis in this paper.

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn5mo20

Thanks. I agree that this is a weak part of the post.

After writing it, I think I also updated a bit against very clean unbounded power-seeking. But I have more weight on "chaotic catastrophes", e.g. something like:
1. Things move really fast.
2. We don't really understand how goals work and how models form them.
3. The science loop makes models change their goals meaningfully in all sorts of ways.
4. "what failure looks like" type loss of control.

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn5mo40

Some questions and responses:
1. What if you want the AI to solve a really hard problem? You don't know how to solve it, so you cannot give it detailed instructions. It's also so hard that the AI cannot solve it without learning new things -> you're back to the story above. The story also just started with someone instructing the model to "cure cancer".
2. Instruction following models are helpful-only. What do you do about the other two H's? Do you trust the users to only put in good instructions? I guess you do want to have some side constraints baked into its personality and these can function like goals. Many of the demonstrations that we have for scheming are cases where the model is too much of a saint, i.e. it schemes for the right cause. For example, it might be willing to deceive its developers if we provide it with strong reasons that they have non-HHH goals. I'm not really sure what to make of this. I guess it's good that it cares about being harmless and honest, but it's also a little bit scary that it cares so much.

My best guess for how the approach should look is that some outcome-based RL will be inevitable if we want to unlock the benefits, we just have to hammer the virtues of being non-scheming and non-power-seeking into it at all points of the training procedure. And we then have to add additional lines of defense like control, interpretability, scalable oversight, etc. and think hard about how we minimize correlated failures. But I feel like right now, we don't really have the right tools, model organisms, and evals to establish whether any of these lines of defense actually reduce the problem.