All of Chris_Leong's Comments + Replies

That would make the domain of checkable tasks rather small.

That said, it may not matter depending on the capability you want to measure.

If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.

On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.

Points for creativity, though I'm still somewhat skeptical about the viability of this strategy,

I was taking it as "solves" or "gets pretty close to solving". Maybe that's a misinterpretation on my part. What did you mean here?

1Joar Skalse
No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn't possible, then I think this research agenda will produce a precise understanding of why that isn't possible (which would in turn help to inform subsequent research). I don't think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.

First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”)

 

I wouldn't quite go so far as to say it "tackles" the problem of outer alignment, but it does tie into (pragmatic) attempts to solve the problem by identifying the ontology of realistically specifiable reward functions. However, maybe I'm misunderstanding you?

1Joar Skalse
I'm not sure -- what significance are you placing on the word "tackle" in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function --- the main questions for this area of research may even be mostly orthogonal to that question.

Acausal positive interpretation

I believe that Anthropic should be investigating artificial wisdom:

I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.

I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.

Nice article, I especially love the diagrams!

In Human Researcher Obsolescence you note that we can't completely hand over research unless we manage to produce agents that are at least as "wise" as the human developers.

I agree with this, though I would love to see a future version of this plan include an expanded analysis of the role that wise AI plays would play in the strategy of Magma, as I believe that this could be a key aspect of making this plan work.

In particular:

• We likely want to be developing wise AI advisors to advise us during the pre-hand-off... (read more)

1Josh Clymer
• It's possible that we might manage to completely automate the more objective components of research without managing to completely automate the more subjective components of research. That said, we likely want to train wise AI advisors to help us with the more subjective components even if we can't defer to them. Agree, I expect the handoff to AI agents to be somewhat incremental (AI is like an intern, a new engineer, a research manager, and eventually, a CRO)

I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.

Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:

Looking at behaviour is conceptually straightforward, and valuable, and being done

I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.

Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.

3DanielFilan
When I wrote that, I wasn't thinking so much about evals / model organisms as stuff like: * putting a bunch of agents in a simulated world and seeing how they interact * weak-to-strong / easy-to-hard generalization basically stuff along the lines of "when you put agents in X situation, they tend to do Y thing", rather than trying to understand latent causes / capabilities

"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.

I'd strongly bet that when you break this down in more concrete detail, a flaw in your plan will emerge. 

The balance of industries serving humans vs. AI's is a suspiciously high level of abstraction.

Thanks for posting this. I've been confused about the connection between shard theory and activation vectors for a long time!

AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other

This confuses me.

I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn't two full motivational circuits, but it kind of is tow motivational circuits.

I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment). 

I’m confused by your use of Shapley values. Shapley values assume that the “coalition” can form in any order, but that doesn’t seem like a good fit for language models where order is important.

Great article.

One point of disagreement: I suspect that the difficulty of the required high-impact tasks likely relates more to what someone thinks about the offense-defense balance than the alignment difficulty per se.

2Samuel Dylan Martin
Good point. You're right to highlight the importance of the offense-defense balance in determining the difficulty of high-impact tasks, rather than alignment difficulty alone. This is a crucial point that I'm planning on expand on in the next post in this sequence. Many things determine the overall difficulty of HITs: 1. the "intrinsic" offense-defense balance in related fields (like biotechnology, weapons technologies and cybersecurity) and especially whether there are irresolutely offense-dominant technologies that transformative AI can develop and which can't be countered 2. Overall alignment difficulty, affecting whether we should expect to see a large number of strategic, power seeking unaligned systems or just systems engaging in more mundane reward hacking and sycophancy. 3. Technology diffusion rates, especially for anything offense dominant, e.g. should we expect frontier models to leak or be deliberately open sourced 4. Geopolitical factors, e.g. are there adversary countries or large numbers of other well resourced rogue actors to worry about not just accidents and leaks and random individuals 5. The development strategy (e.g. whether the AI technologies are being proactively developed by a government or in public-private partnership or by companies who can't or won't use them protectively) My rough suspicion is that all of these factors matter quite a bit, but since we're looking at "the alignment problem" in this post I'm pretending that everything else is held fixed.  The intrinsic offense-defense balance of whatever is next on the 'tech tree', as you noted, is maybe the most important overall, as it affects the feasibility of defensive measures and could push towards more aggressive strategies in cases of strong offense advantage. It's also extremely difficult to predict ahead of time.

Interesting work.

This post has made me realise that constitutional design is surprisingly neglected in the AI safety community.

Designing the right constitution won't save the world by itself, but it's a potentially easy win that could put us in a better strategic situation down the line.

2Samuel Dylan Martin
Yes, I do think constitution design is neglected! I think it's possible people think constitution changes now won't stick around or that it won't make any difference in the long term, but I do think based on the arguments here that even if it's a bit diffuse you can influence AI behavior on important structural risks by changing their constitutions. It's simple, cheap and maybe quite effective especially for failure modes that we don't have any good shovel-ready technical interventions for.

I would love to hear what shard theorists make of this.

We could describe this AI as having learned a meta-shard - pace around at the start so that you have time to plan.

But at the point where we've allowed meta-shards, maybe we've already undermined the main claims of shard theory?

1Adrià Garriga-Alonso
Maybe in this case it's a "confusion" shard? While it seems to be planning and produce optimizing behavior, it's not clear that it will behave as a utility maximizer.

You mean it might still Goodhart to what we think they might say? Ideally, the actual people would be involved in the process.

My intuition is that the best way to build wise AI would be to train imitation learning agents on people who we consider to be wise. If we trained imitations of people with a variety of perspectives, we could then simulate discussions between them and try to figure out the best discussion formats between such agents. This could likely get us reasonably far.

The reason why I say imitation learning is because that would give us something that we could treat as an optimisation target which is what we require for training ML systems.

1Gordon Seidoh Worley
Seems reasonable. I do still worry quite a bit about Goodharting, but perhaps this could be reasonably addressed with careful oversight by some wise humans to do the wisdom equivalent of red teaming.

How are applications processed? Sometimes applications are processed on a rolling basis, so it's important to submit as soon as possible. Other times, you just need to apply by the date, so if you're about to post something big, it makes sense to hold-off your application.

This criticism feels a bit strong to me. Knowing the extent to which interpretability work scales up to larger models seems pretty important. I could have imagined people either arguing that such techniques would work worse on larger models b/c required optimizations or better because less concepts would be in superposition. Work on this feels quite important, even though there's a lot more work to be done.

Also, sharing some amount of eye-catching results seems important for building excitement for interpretability research.

Update: I skipped the TLDR when ... (read more)

Note that scasper said:

Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights,

I (like scasper) think this work is useful, but I share some of scasper's concerns.

In particular:

  • I think prior work like this from the anthropic interp team has been systematically overrated by others and the anthropic interp team could take actions to avoid this.
    • IMO, buzz on twitter systematically overrates the results of this paper and their importance.
  • I'm uncertain, but I think I might prefer less excitement for this style of
... (read more)

That's an excellent point.

I agree. I think that's probably a better way of clarifying the confusion that what I wrote.

“But also this person doesn't know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”

Have you written about this anywhere?

Nabgure senzr gung zvtug or hfrshy:

Gurer'f n qvssrerapr orgjrra gur ahzore bs zngurzngvpny shapgvbaf gung vzcyrzrag n frg bs erdhverzragf naq gur ahzore bs cebtenzf gung vzcyrzrag gur frg bs erdhverzragf.

Fvzcyvpvgl vf nobhg gur ynggre, abg gur sbezre.

Gur rkvfgrapr bs n ynetr ahzore bs cebtenzf gung cebqhpr gur rknpg fnzr zngurzngvpny shapgvba pbagevohgrf gbjneqf fvzcyvpvgl.

I wrote up my views on the principle of indifference here:

https://www.lesswrong.com/posts/3PXBK2an9dcRoNoid/on-having-no-clue

I agree that it has certain philosophical issues, but I don’t believe that this is as fatal to counting arguments as you believe.

Towards the end I write:

“The problem is that we are making an assumption, but rather than owning it, we're trying to deny that we're making any assumption at all, ie. "I'm not assuming a priori A and B have equal probability based on my subjective judgement, I'm using the principle of indifference". Roll to... (read more)

IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs


You don't know where they heard that?

One of the main challenges I see here is how to calibrate this. In other words, if I can't break a model despite adding an activation vector of strength x, what does this mean in terms of how safe we should consider the model to be? ie. How much extra adversarial prompting effort is that equivalent to or how should I modify my probabilities of the model being safe?

Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?

4Ryan Greenblatt
I think RLHF probably doesn't particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished. And this competence will decompose into: * Actually doing the task well * Making it look like the task was done well even when it wasn't The second one isn't a capability we want, but it doesn't particularly look like persuasiveness IMO.

I think it’s worth also raising the possibility of a Kuhnian scenario where the “mature science” is actually missing something and further breakthrough is required after that to move it into in a new paradigm.

I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no ti... (read more)

Do you have any thoughts on what kind of experiments you’d like to see people run that would be more directly analogous?

Great post! The concept of meta-strategies seems very useful to add to my toolkit.

One thing that would make this more symmetrical is if some errors in your world model are worse than others. This makes inference more like a utility function.

1Thane Ruthenis
Yup. I think this might route through utility as well, though. Observations are useful because they unlock bits of optimization, and bits related to different variables could unlock both different amounts of optimization capacity, and different amounts of goal-related optimization capacity. (It's not so bad to forget a single digit of someone's phone number; it's much worse if you forgot a single letter in the password to your password manager.)

Hmm... I suppose this is pretty good evidence that CCS may not be as promising as it first appeared, esp. the banana/shed results.

https://www.lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1

Update: Seems like the banana results are being challenged.

This is very tricky. On one hand, this may actually Streisand effect these results to greater prominence. On the other hand, at the point where people were specifically working around this to gain access to log-in gated LW resources, this would probably enhance our community status/prestige which might actually increase our influence.

Overall, I'd lean towards carefully experimenting with a log-in filter, with the notion of abandoning this idea if it doesn't seem to be achieving its goals.

I'm confused by your Fast Deceptive Mesaoptimiser example. How does the speed prior come in here? It isn't mentioned in the description.

I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source." 


The problem with this plan is that it assumes that there are easy ways to robustify the world. What if the only proper defense against bioweapons is a complete monitoring of the entire internet? Perhaps this is something that we'd li... (read more)

I'm still pretty skeptical of what would happen without explicit focus. The Bletchley Park declaration was a super vague and applause-lighty declaration, which fortunately mentions issues of control, but just barely. It's not clear to me yet that this will end up receiving much-dedicated focus.

Regarding biosecurity and cyber, my big worry here is open-source and it seems totally plausible that a government will pass mostly sensible regulation, then create a massive gaping hole where open-source regulation should be.

  • If I’d tried to assume that models are “trying” to get a low value for the training loss, I might have ended up relying on our ability to incentivize the model to make very long-term predictions. But I think that approach is basically a dead end.

Why do you believe that this is a dead-end?

Do phase transitions actually show up? So far, the places where theoretically predicted phase transitions are easiest to confirm are simplified settings like deep linear networks and toy models of superposition. For larger models, we expect phase transitions to be common but "hidden." Among our immediate priorities are testing just how common these transitions are and whether we can detect hidden transitions.

 


What do you mean by 'hidden"?

Thinking this through.

There's a lot of ways in which speedrunning is like paperclip maximisation: speedrunning doesn't contribute to society and further paperclips after we've produced a certain amount become useless.

I'm still confused by the analogy though. Because seems like a lot of people may do speedrunning for fun - but maybe you see it as more about status - while paperclip production isn't fun. I think this makes a difference though, as even though we don't want our society to produce absurd amounts of paperclips, we probably do want lots of niche ways to have fun.

1Jessica Taylor
Competitive paperclip maximization in a controlled setting sounds like it might be fun. The important thing is that it's one thing that's fun out of many things, and variety is important.

Your comment focuses on GPT4 being "pretty good at extracting preferences from human data" when the stronger part of the argument seems to be that "it will also generally follow your intended directions, rather than what you literally said".

I agree with you that it was obvious in advance that a superintelligence would understand human value.

However, it sure sounded like you thought we'd have to specify each little detail of the value function. GPT4 seems to suggest that the biggest issue will be a situation where:

1) The AI has an option that would produce ... (read more)

An alternative framing that might be useful: What do you see as the main bottleneck for people having better predictions of timelines (as you see it)?

Do you in fact think that having such a list is it?

What we need to find, for a given agent to be constrained by being a 'utility maximiser' is to consider it as having a member of a class of utility functions where the actions that are available to it systematically alter the expected utility available to it - for all utility functions within this class.


This sentence is extremely difficult for me to parse. Any chance you could clarify it?

In most situations, were these preferences over my store of dollars for example, this would seem to be outside the class of utility functions that would meaningfully constrain my action, since this function is not at all smooth over the resource in question.

Could you explain smoothness is typically required for meaningly constraining our actions?

Thanks so much for not only writing a report, but taking the time to summarise for our easy consumption!

Oh sorry, just realised that davinci-002 is separate from text-davinci-002.

Note that davinci-002 and babbage-002 are the new base models released a few days ago.


You mean davinci-003?

1nostalgebraist
No.
  • Learn more about the possible failures, to understand how likely they are, what causes them to arise, and what techniques may mitigate the failures (discussed here).
  • Inform the current conversation about AI risk by providing the best evidence of misalignment risks, if any. We hope this will be helpful for labs, academia, civil society, and policymakers to make better decisions (discussed here). If misalignment issues end up being serious, then it will be critical to form a strong scientific consensus that these issues are real, for which examples
... (read more)
2Carson Denison
Thank you for catching this.  These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.
2Ethan Perez
Fixed (those were just links to the rest of the doc)

It’s not clear to me that the space of things you can verify is in fact larger than the space of things you can do because an AI might be able to create a fake solution that feels more real than the actual solution. At a sufficiently high intelligence level of the AI, being able to avoid this tricks is likely harder than just doing the task if you hadn’t been subject to malign influence.

2Buck Shlegeris
If the AI can create a fake solution that feels more real than the actual solution, I think the task isn't checkable by Joe's definition.
Load More