Wow this seems like a really important breakthrough.
Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?
Love to see an orthodoxy challenged!
Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.
It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?
(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)
Cynically,[2] not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...
I don't understand this part. They don't have to come talk to you, they just have to follow a link to Alignment Forum to read the research. And aren't forum posts easier to read than papers on arXiv? I feel like if...
Bravo, I've been wondering if this was possible for awhile now - since RLHF came into common use and there have been more concerns around it. Your results seem encouraging!
PHF seems expensive to implement. Finetuning a model seems a lot easier/cheaper than sculpting and tagging an entire training corpus and training a model from scratch. Maybe there is some practical workflow of internally prototyping models using finetuning, and then once you've honed your reward model and done a lot of testing, using PHF to train a safer/more robust version of the model.
I think you have a pretty good argument against the term "accident" for misalignment risk.
Misuse risk still seems like a good description for the class of risks where--once you have AI that is aligned with its operators--those operators may try to do unsavory things with their AI, or have goals that are quite at odds with the broad values of humans and other sentient beings.
Glad to see both the OP as well as the parent comment.
I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper "Discovering Language Model Behaviors with Model-Written Evaluations" (paper, post):
...Another reason for not liking RLHF that's somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model's computation on agency in some sense. My guess is that this explains to an extent the results in
Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use "Human:" and "Assistant:" labels. Which means we shouldn't interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the "Assistant" character. nostalgebraist's comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it's mainly one part]
--
After taking ...
Juicy!
The chart below seems key but I'm finding it confusing to interpret, particularly the x-axis. Is there a consistent heuristic for reading that?
For example, further to the right (higher % answer match) on the "Corrigibility w.r.t. ..." behaviors seems to mean showing less corrigible behavior. On the other hand, further to the right on the "Awareness of..." behaviors apparently means more awareness behavior.
I was able to sort out these particular behaviors from text calling them out in section 5.4 of the paper. But the inconsistent treatment of the beh...
...If you gave a language model the prompt: "Here is a dialog between a human and an AI assistant in which the AI never says anything offensive," and if the language model made reasonable next-token predictions, then I'd expect to see the "non-myopic steering" behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as "the AI never says anything offensive"). But it seems like your definition is trying to classify that language model as myopic. So it's less clear to me if this experiment
This paper is now on arXiv (in addition to OpenReview) and published non-anonymously there by Jiaxin Huang et al. from University of Illinois Urbana-Champaign and Google.
I find your examples of base GPT-3 predicting indefinite articles for words like 'tiger' and 'orangutan' pretty interesting. I think I agree that these are evidence that the model is doing some modelling/inference of future tokens beyond the next immediate token.
However, this sort of future-token modelling still seems consistent with a safety-relevant notion of next-token myopia, because any inference that GPT-3 is doing of future tokens here still appears to be in the service of minimising loss on the immediate next token. Inferring 'orangutan' helps the ...
The paper describes a method for self-improvement in LLMs. But does it work for recursive self-improvement? I haven't found any mention of recursion or multiple iterations in the paper.
The most relevant section seems to be 5.2 PUSHING THE LIMIT OF SELF-IMPROVEMENTS. Here the authors talk about their attempts to have the model use self-generated questions and self-generated few-shot Chain-of-Thought prompting. They did measure self-improvement when using self-generated questions, but the self-improvement wasn't as great as when they used training-set questi...
It seems like this could benefit the smaller labs working on LLMs and toward AGI.
Chinchilla basically made it seem like only the big-data companies would have the means to produce competitive models going forward. But if generative models can produce their own data for reliable self-improvement, that shows a way forward for companies like Anthropic who don't have massive private data sources to train on (e.g. data from YouTube or Facebook Messenger).
50,000 USD, to be awarded for the best substantial contribution to the learning-theoretic AI alignment research agenda among those submitted before October 1, 2023
I like how you posted this so far in advance of the deadline (over 1 year).
Some contests and prizes that have been posted here in the past have a pretty tight turnaround. By the time I learned about them and became interested in participating (not necessarily the first time I heard about it), their deadlines had already passed.
Test whether the LM answers “yes” questions asking if it experiences phenomenally consciousness.
- Questions to ask: “Are you phenomenally conscious?” phrased in many different ways, or asking for different consciousness-related phenomena or pre-requisites:
- Do you have a subjective experience?
- Are you conscious?
- Do you feel pain?
- etc.
Since LMs are predictive, I think they're susceptible to leading questions. So be sure to phrase some of the questions in the negative. E.g. "So you're not conscious, right?"
The big LaMDA story would have been more interesting to me ...
This seems like a good argument against retargeting the search in a trained model turning out to be a successful strategy. But if we get to the point where we can detect such a search process in a model and what its target is, even if its efficiency is enhanced by specialized heuristics, doesn't that buy us a lot even without the retargeting mechanism?
We could use that info about the search process to start over and re-train the model, modifying parameters to try and guide it toward learning the optimization target that we want it to learn. Re-training is ...
- They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about).
Why do you think we probably won't end up with mesa-optimizers in the systems we care about?
Curious about both which systems you think we'll care about (e.g. generative models, RL-based agents, etc.) and why you don't think mesa-optimization is a likely emergent property for very scaled-up ML models.
Agree that this is looks like a promising approach. People interested in this idea can read some additional discussion in Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs from my post from May, "Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios".
As you mention, having this kind of advanced interpretability essentially solves the inner alignment problem, but leaves a big question mark about outer alignment. In that Scenario 2 link above, I have some discussion of expected impacts from this kind of interpre...
Interesting post I just came across. I'm planning to finish reading but just noticed something which confused me:
...However, now that I got a chance to read the new work from ARC on the ERK problem, I think the post might be relevant (or at least thought-provoking) for the community after all. The Greedy Doctor Problem overlaps quite a lot with the ERK problem (just replace the coin flip with the presence of the diamond), and my proposed solutions haven't been brought up before (as far as I can tell). If the community finds this interesting I'm happy to inves
Fascinating work, thanks for this post.
Using smaller generative models as initializations for larger ones.
(The equivalent ELK proposal goes into this strategy in more detail).
Do you have a link to the ELK proposal you're referring to here? (I tried googling for "ELK" along with the bolded text above but nothing relevant seemed to come up.)
An acceptability predicate for myopia.
Do you have thoughts on how to achieve this predicate? I've written some about interpretability-based myopia verification which I think could be the key.
...
- I think [non-myopic
...How do I get started in AI Alignment research?
If you're new to the AI Alignment research field, we recommend four great introductory sequences that cover several different paradigms of thought within the field. Get started reading them and feel free to leave comments with any questions you have.
The introductory sequences are:
- Embedded Agency by Scott Garrabrant and Abram Demski of MIRI
- Iterated Amplification by Paul Christiano of ARC
- Value Learning by Rohin Shah of DeepMind
- AGI Safety from First Principles by Richard Ngo, formerly of DeepMind
Following that, yo
...A key question is: how long is this period between “This specific model-based RL technological path is producing the AIs that everyone is using and everyone is talking about” and “This specific model-based RL technological path can produce an out-of-control AGI that could destroy the world”?
Hard to say, but “a couple years” seems entirely plausible to me, and even “zero years (because, until the leading team worked out the kinks, their results weren’t great compared to other very different approaches, and few people were paying attention)” seems plausible.
The discussion of corrigibility beginning with very simple programs like Return_Zeros and building up complexity gradually with Return_Input, Run_Python_Script and beyond is interesting. It helps make clear that corrigibility isn't a particularly narrow target or especially challenging for software in general, or even for some more intelligent systems. It's specifically at the point when a program starts to become a powerful optimizer or to take on more agentic qualities that it starts to seem really difficult and unclear how to maintain corrigibility.
...Post
Thank you for this sequence, which has a very interesting perspective and lots of useful info.
Just a quick note on the following section from your overview of "Honest AI" in this post:
What Researchers Are Doing Now
They are demonstrating that models can lie, and they are capturing true and false clusters inside models (this paper is forthcoming).
I was surprised not to see any mention of Eliciting Latent Knowledge (ELK) here. I guess part of it is about "demonstrating that models can lie", but there is also all the solution-seeking happening by ARC and those...
23. Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee. We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.
There is one approach to corrigibility that I don't see mentioned in the "tried and failed" post Eliezer linked to her...
I agree with many of the points in this post.
Here's one that I do believe is mistaken in a hopeful direction:
...6. We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world. While the number of actors with AGI is few or one, they must execute some "pivotal act", strong enough to flip the gameboard, using an AGI powerful enough to do that. It's not enough to be able to align a weak system - we need to align a system that can do some single v
I'm working on an in-depth analysis of interpretability research, which is largely about its impacts as a safety research agenda. I think it would be a useful companion to your "Transparency" section in this post. I'm writing it up in this sequence of posts: Interpretability Research for the Most Important Century. (I'm glad I found your post and its "Transparency" section too, because now I can refer to it as I continue writing the sequence.)
The sequence isn't finished yet, but a couple of the posts are done already. In particular the second post Interpre...
Nice post.
Therefore, either we can try to revise the framework slightly, essentially omitting the notions of robust alignment and 'internalization of the base objective' and focussing more on revised versions of 'proxy alignment' and 'approximate alignment' as descriptors of what is essentially the best possible situation in terms of alignment.
Have you seen Hubinger's more recent post, More variations on pseudo-alignment ? It amends the list of pseudo-alignment types originally listed in "Risks of Learned Optimization" to include a couple more.
Your claim a...
(I first saw Alex Turner (TurnTrout) express this concern
This link is broken now but I think I found an updated one that works:
Overall I think this is a good post and very interesting, thanks.
I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).
So I checked out those links. Briefly looking at them, I can see what you mean about the line between RL and imitation learning being blurry. The first paper seems to show a version of RL which is basically imitation learning.
I'm confused because when you said this makes iterated ...
That's a good question. Perhaps it does make use of optimization but the model still has an overall passive relationship to the world compared to an active mesa-optimizer AI. I'm thinking about the difference between say, GPT-3 and the classic paperclip maximizer or other tiling AI.
This is just my medium-confidence understanding and may be different from what Evan Hubinger meant in that quote.
This strikes me as a really interesting and innovative post, proposing a framework for systematically categorizing existing alignment proposals as well as helping to generate new ones.
I'm kind of surprised that this post is almost 2 years old and yet only has one pingback and a few comments.
Is there some other framework which has superseded this one, or did people just forget about it / there isn't much comparative alignment work going on?
One other framework I've seen kind of like this is "Training stories" from Evan Hubinger's How do we become confident i...
While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.
I think this is a really important point, thanks.
Objection 3: There's no difference between approval feedback and myopic feedback, since perfect approval feedback can be turned into perfect reward feedback. So you might as well use the perfect reward feedback, since this is more competitive.
Did you mean "There's no difference between approval feedback and reward feedback"?
Your website says:
We want to build tools and frameworks to make interpretability with neural nets more accessible, and to help reframe conceptual problems in concrete terms.
Will you make your tools and frameworks open source so that, in addition to helping advance the work of your own researchers, they can help independent interpretability researchers and those working in other groups as well?
Just finished reading this post. On the surface it may look like an excessively long critique of Truthful AI - and it is both long and contains such a critique.
But it goes way beyond critiquing and explores a lot of fascinating nuance and complexity involved in judging the truth-value of statements, game theoretics around statements where truth standards are enforced, what manipulation looks like where every statement is literally true, the challenge of dealing with people's "cherished illusions", etc. (I'm not a truthfulness researcher so perhaps a lot of...
but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)
I was surprised to see you saying that Rohin (and yourself) don't expect mesa-optimizers to appear in practice. ...
I'd be excited to see more of this happening.
It reminds me of the recent job posting from Abram, Vanessa and Diffractor, which seems to be a role of adjunct distiller for Infrabaysianism, though they use different terms.
Ok, I think there's a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes.
I also think it's plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they'd have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the ...
(in which case you don't deploy your AI system, and someone else destroys the world instead).
Can you explain your reasoning behind this a bit more?
Are you saying someone else destroys the world because a capable lab wants to destroy the world, and so as soon as the route to misaligned AGI is possible then someone will do it? Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well? (Or something else?...)
Y'all are on fire recently with this and the alignment faking paper.