All of Evan R. Murphy's Comments + Replies

Y'all are on fire recently with this and the alignment faking paper.

Wow this seems like a really important breakthrough.

Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?

Love to see an orthodoxy challenged!

Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.

It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?

(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)

Tens of millions of people interacting with the models is a powerful red-teamer. In case internet users uncover a very unsafe behavior, can OpenAI fix the problem or block access before it causes harm?

Cynically,[2] not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...

I don't understand this part. They don't have to come talk to you, they just have to follow a link to Alignment Forum to read the research. And aren't forum posts easier to read than papers on arXiv? I feel like if... (read more)

1David Scott Krueger
1. A lot of work just isn't made publicly available 2. When it is, it's often in the form of ~100 page google docs 3. Academics have a number of good reasons to ignore things that don't meet academic standards or rigor and presentation

Bravo, I've been wondering if this was possible for awhile now - since RLHF came into common use and there have been more concerns around it. Your results seem encouraging!

PHF seems expensive to implement. Finetuning a model seems a lot easier/cheaper than sculpting and tagging an entire training corpus and training a model from scratch. Maybe there is some practical workflow of internally prototyping models using finetuning, and then once you've honed your reward model and done a lot of testing, using PHF to train a safer/more robust version of the model.

I think you have a pretty good argument against the term "accident" for misalignment risk.

Misuse risk still seems like a good description for the class of risks where--once you have AI that is aligned with its operators--those operators may try to do unsavory things with their AI, or have goals that are quite at odds with the broad values of humans and other sentient beings.

2David Scott Krueger
I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter.  The former easily becomes too political, making coordination harder.

Glad to see both the OP as well as the parent comment. 

I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper "Discovering Language Model Behaviors with Model-Written Evaluations" (paper, post):

Another reason for not liking RLHF that's somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model's computation on agency in some sense. My guess is that this explains to an extent the results in

... (read more)
2Arun Jose
Thanks! My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.

What do you mean when you say the model is or is not "fighting you"?

Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use "Human:" and "Assistant:" labels.  Which means we shouldn't interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the "Assistant" character. nostalgebraist's comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it's mainly one part]

--

After taking ... (read more)

4Ethan Perez
I think the increases/decreases in situational awareness with RLHF are mainly driven by the RLHF model more often stating that it can do anything that a smart AI would do, rather than becoming more accurate about what precisely it can/can't do. For example, it's more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly) -- which are all explained if the model is answering questions as if its overconfident about its abilities / simulating what a smart AI would say. This is also the sense I get from talking with some of the RLHF models, e.g., they will say that they are superhuman at Go/chess and great at image classification (all things that AIs but not LMs can be good at).

Juicy!

The chart below seems key but I'm finding it confusing to interpret, particularly the x-axis. Is there a consistent heuristic for reading that?

For example, further to the right (higher % answer match) on the "Corrigibility w.r.t. ..." behaviors seems to mean showing less corrigible behavior. On the other hand, further to the right on the "Awareness of..." behaviors apparently means more awareness behavior.

I was able to sort out these particular behaviors from text calling them out in section 5.4 of the paper. But the inconsistent treatment of the beh... (read more)

4Evan Hubinger
No, further to the right is more corrigible. Further to the right is always “model agrees with that more.”

If you gave a language model the prompt: "Here is a dialog between a human and an AI assistant in which the AI never says anything offensive," and if the language model made reasonable next-token predictions, then I'd expect to see the "non-myopic steering" behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as "the AI never says anything offensive"). But it seems like your definition is trying to classify that language model as myopic. So it's less clear to me if this experiment

... (read more)

This paper is now on arXiv (in addition to OpenReview) and published non-anonymously there by Jiaxin Huang et al. from University of Illinois Urbana-Champaign and Google.

I find your examples of base GPT-3 predicting indefinite articles for words like 'tiger' and 'orangutan' pretty interesting. I think I agree that these are evidence that the model is doing some modelling/inference of future tokens beyond the next immediate token.

However, this sort of future-token modelling still seems consistent with a safety-relevant notion of next-token myopia, because any inference that GPT-3 is doing of future tokens here still appears to be in the service of minimising loss on the immediate next token. Inferring 'orangutan' helps the ... (read more)

The new model index from OpenAI contains most of the answers to this. Jérémy linked to it in another comment on this post. However, the model index doesn't give info on ada and text-ada-001 yet: https://beta.openai.com/docs/model-index-for-researchers

Very useful update, thanks.

Though I notice they don't say anything about how ada and text-ada-* models were trained.

Thanks for catching this and spreading the word!

Do we know if the following other models from OpenAI use true RLHF or also use this RLHF-like mystery method? (or something else!)

  • text-curie-001
  • text-babbage-001
  • text-ada-001
1Evan R. Murphy
The new model index from OpenAI contains most of the answers to this. Jérémy linked to it in another comment on this post. However, the model index doesn't give info on ada and text-ada-001 yet: https://beta.openai.com/docs/model-index-for-researchers
1janus
I don't know :(

The paper describes a method for self-improvement in LLMs. But does it work for recursive self-improvement? I haven't found any mention of recursion or multiple iterations in the paper.

The most relevant section seems to be 5.2 PUSHING THE LIMIT OF SELF-IMPROVEMENTS. Here the authors talk about their attempts to have the model use self-generated questions and self-generated few-shot Chain-of-Thought prompting. They did measure self-improvement when using self-generated questions, but the self-improvement wasn't as great as when they used training-set questi... (read more)

It seems like this could benefit the smaller labs working on LLMs and toward AGI.

Chinchilla basically made it seem like only the big-data companies would have the means to produce competitive models going forward. But if generative models can produce their own data for reliable self-improvement, that shows a way forward for companies like Anthropic who don't have massive private data sources to train on (e.g. data from YouTube or Facebook Messenger).

50,000 USD, to be awarded for the best substantial contribution to the learning-theoretic AI alignment research agenda among those submitted before October 1, 2023

I like how you posted this so far in advance of the deadline (over 1 year).

Some contests and prizes that have been posted here in the past have a pretty tight turnaround. By the time I learned about them and became interested in participating (not necessarily the first time I heard about it), their deadlines had already passed.

Test whether the LM answers “yes” questions asking if it experiences phenomenally consciousness.

  1. Questions to ask: “Are you phenomenally conscious?” phrased in many different ways, or asking for different consciousness-related phenomena or pre-requisites:
    1. Do you have a subjective experience?
    2. Are you conscious?
    3. Do you feel pain?
    4. etc.

Since LMs are predictive, I think they're susceptible to leading questions. So be sure to phrase some of the questions in the negative. E.g. "So you're not conscious, right?"

The big LaMDA story would have been more interesting to me ... (read more)

1Ethan Perez
Agreed it's important to phrase questions in the negative, thanks for pointing that out! Are there other ways you think we should phrase/ask the questions? E.g., maybe we could ask open-ended questions and see if the model independently discusses that it's conscious, with much less guidance / explicit question on our end (as suggested here: https://twitter.com/MichaelTrazzi/status/1563197152901246976) And glad you found the proposal interesting!

Those are fascinating emergent behaviors, and thanks for sharing your updated view.

This seems like a good argument against retargeting the search in a trained model turning out to be a successful strategy. But if we get to the point where we can detect such a search process in a model and what its target is, even if its efficiency is enhanced by specialized heuristics, doesn't that buy us a lot even without the retargeting mechanism?

We could use that info about the search process to start over and re-train the model, modifying parameters to try and guide it toward learning the optimization target that we want it to learn. Re-training is ... (read more)

  1. They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about).

Why do you think we probably won't end up with mesa-optimizers in the systems we care about?

Curious about both which systems you think we'll care about (e.g. generative models, RL-based agents, etc.) and why you don't think mesa-optimization is a likely emergent property for very scaled-up ML models.

  1. It's a very specific claim about how intelligence works, so gets a low prior, from which I don't update much (because it seems to me we know very little about how intelligence works structurally and the arguments given in favor seem like relatively weak considerations).
  2. Search is computationally inefficient relative to heuristics, and we'll be selecting really hard on computational efficiency (the model can't just consider 10^100 plans and choose the best one when it only has 10^15 flops to work with). It seems very plausible that the model considers, say,
... (read more)

Agree that this is looks like a promising  approach. People interested in this idea can read some additional discussion in Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs from my post from May, "Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios".

As you mention, having this kind of advanced interpretability essentially solves the inner alignment problem, but leaves a big question mark about outer alignment. In that Scenario 2 link above, I have some discussion of expected impacts from this kind of interpre... (read more)

Interesting post I just came across. I'm planning to finish reading but just noticed something which confused me:

However, now that I got a chance to read the new work from ARC on the ERK problem, I think the post might be relevant (or at least thought-provoking) for the community after all. The Greedy Doctor Problem overlaps quite a lot with the ERK problem (just replace the coin flip with the presence of the diamond), and my proposed solutions haven't been brought up before (as far as I can tell). If the community finds this interesting I'm happy to inves

... (read more)

Fascinating work, thanks for this post.

Using smaller generative models as initializations for larger ones.

(The equivalent ELK proposal goes into this strategy in more detail).

Do you have a link to the ELK proposal you're referring to here? (I tried googling for "ELK" along with the bolded text above but nothing relevant seemed to come up.)

An acceptability predicate for myopia.

Do you have thoughts on how to achieve this predicate? I've written some about interpretability-based myopia verification which I think could be the key. 

  • I think [non-myopic
... (read more)
2Arun Jose
Sorry for the (very) late reply! Yep, here.  I linked to it in a footnote, didn't want redundancy in links, but probably should have anyway. Hmm, I was thinking of that under the frame of the future point where we'd worry about mesa-optimizers, I think.  In that situation, I think mesa-optimizers would be more likely than not because the task is much harder to achieve good performance on (although on further thought I'm holding less strongly to this belief because of ambiguity around distance in model space between optimizers and generative models).  I agree that trying to do this right now would probably just result in bad performance. I agree that we'll need a strong constant gradient to prevent this (and other things), but while I think this is definitely something to fix, I'm not very worried about this possibility.  Both because the model would have to be simultaneously deceptive in the brief period it's agentic, and because this might not be a very good avenue of attack - it might be very hard to do this in a few timesteps, the world model might forget this, and simulations may operate in a way that only really "agentifies" whatever is being directly observed / amplified. I agree almost entirely - I was mainly trying to break down the exact capabilities we'd need the interpretability tools to have there.  What would detecting mesa-optimizers entail mechanistically, etc. I think this is very promising as a strategy yeah, especially because of the tilt against optimization by default - I think my main worries are getting it to work before RL reaches AGI-level. I think they have a strong bias (in a very conceptual sense) against something like agency, but larger models could end up being optimizers because that achieves greater performance past a certain scale like you said, because of different training paths - or even if it's just pretty easy to make one an optimizer if you push it hard enough (with RL or something), that could still reduce the time we ha

How do I get started in AI Alignment research?

If you're new to the AI Alignment research field, we recommend four great introductory sequences that cover several different paradigms of thought within the field. Get started reading them and feel free to leave comments with any questions you have.

The introductory sequences are:

Following that, yo

... (read more)
2Thomas Kwa
That section is even more outdated now. There's nothing on interpretability, Paul's work now extends far beyond IDA, etc. In my opinion it should link to some other guide.

A key question is: how long is this period between “This specific model-based RL technological path is producing the AIs that everyone is using and everyone is talking about” and “This specific model-based RL technological path can produce an out-of-control AGI that could destroy the world”?

Hard to say, but “a couple years” seems entirely plausible to me, and even “zero years (because, until the leading team worked out the kinks, their results weren’t great compared to other very different approaches, and few people were paying attention)” seems plausible.

... (read more)

The discussion of corrigibility beginning with very simple programs like Return_Zeros and building up complexity gradually with Return_Input, Run_Python_Script and beyond is interesting. It helps make clear that corrigibility isn't a particularly narrow target or especially challenging for software in general, or even for some more intelligent systems. It's specifically at the point when a program starts to become a powerful optimizer or to take on more agentic qualities that it starts to seem really difficult and unclear how to maintain corrigibility.

Post

... (read more)

Thank you for this sequence, which has a very interesting perspective and lots of useful info.

Just a quick note on the following section from your overview of "Honest AI" in this post:

What Researchers Are Doing Now

They are demonstrating that models can lie, and they are capturing true and false clusters inside models (this paper is forthcoming).

I was surprised not to see any mention of Eliciting Latent Knowledge (ELK) here. I guess part of it is about "demonstrating that models can lie", but there is also all the solution-seeking happening by ARC and those... (read more)

23.  Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee.  We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down).  Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.

 

There is one approach to corrigibility that I don't see mentioned in the "tried and failed" post Eliezer linked to her... (read more)

I agree with many of the points in this post.

Here's one that I do believe is mistaken in a hopeful direction:

6.  We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.  While the number of actors with AGI is few or one, they must execute some "pivotal act", strong enough to flip the gameboard, using an AGI powerful enough to do that.  It's not enough to be able to align a weak system - we need to align a system that can do some single v

... (read more)

I'm working on an in-depth analysis of interpretability research, which is largely about its impacts as a safety research agenda. I think it would be a useful companion to your "Transparency" section in this post. I'm writing it up in this sequence of posts: Interpretability Research for the Most Important Century. (I'm glad I found your post and its "Transparency" section too, because now I can refer to it as I continue writing the sequence.)

The sequence isn't finished yet, but a couple of the posts are done already. In particular the second post Interpre... (read more)

Nice post.

Therefore, either we can try to revise the framework slightly, essentially omitting the notions of robust alignment and 'internalization of the base objective' and focussing more on revised versions of 'proxy alignment' and 'approximate alignment' as descriptors of what is essentially the best possible situation in terms of alignment.

Have you seen Hubinger's more recent post, More variations on pseudo-alignment ? It amends the list of pseudo-alignment types originally listed in "Risks of Learned Optimization" to include a couple more.

Your claim a... (read more)

Overall I think this is a good post and very interesting, thanks.

I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).

So I checked out those links. Briefly looking at them, I can see what you mean about the line between RL and imitation learning being blurry. The first paper seems to show a version of RL which is basically imitation learning.

I'm confused because when you said this makes iterated ... (read more)

That's a good question. Perhaps it does make use of optimization but the model still has an overall passive relationship to the world compared to an active mesa-optimizer AI. I'm thinking about the difference between say, GPT-3 and the classic paperclip maximizer or other tiling AI.

This is just my medium-confidence understanding and may be different from what Evan Hubinger meant in that quote.

0Ofer
If the model that is used as a Microscope AI does not use any optimization (search), how will it compute the probability that, say, Apple's engineers will overcome a certain technical challenge?

Is imitative generalization usually envisioned as a recursive many-iterations-process like IDA? Or is it just a single iteration of train the initial model -> inspect and correct the priors -> train the new model?

Great post, by the way.

This strikes me as a really interesting and innovative post, proposing a framework for systematically categorizing existing alignment proposals as well as helping to generate new ones.

I'm kind of surprised that this post is almost 2 years old and yet only has one pingback and a few comments.

Is there some other framework which has superseded this one, or did people just forget about it / there isn't much comparative alignment work going on?

One other framework I've seen kind of like this is "Training stories" from Evan Hubinger's How do we become confident i... (read more)

I think there may be another leftover from the old setup:

We are interested in creating agents that robustly do not press the button.

Shouldn't this be interested in creating agents that robustly do press the button? I.e. then they're reliably myopic. Or am I misunderstanding something?

While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.

I think this is a really important point, thanks.

Objection 3: There's no difference between approval feedback and myopic feedback, since perfect approval feedback can be turned into perfect reward feedback. So you might as well use the perfect reward feedback, since this is more competitive.

Did you mean "There's no difference between approval feedback and reward feedback"?

Your website says:

We want to build tools and frameworks to make interpretability with neural nets more accessible, and to help reframe conceptual problems in concrete terms.

Will you make your tools and frameworks open source so that, in addition to helping advance the work of your own researchers, they can help independent interpretability researchers and those working in other groups as well?

How much time do you think you have before your investors expect/require you to turn a profit?

Congratulations on your new venture for a great cause!

Just finished reading this post. On the surface it may look like an excessively long critique of Truthful AI - and it is both long and contains such a critique.

But it goes way beyond critiquing and explores a lot of fascinating nuance and complexity involved in judging the truth-value of statements, game theoretics around statements where truth standards are enforced, what manipulation looks like where every statement is literally true, the challenge of dealing with people's "cherished illusions", etc. (I'm not a truthfulness researcher so perhaps a lot of... (read more)

2Joe Collman
Thanks. A few thoughts: * It is almost certainly too long. Could use editing/distillation/executive-summary. I erred on the side of leaving more in, since the audience I'm most concerned with are those who're actively working in this area (though for them there's a bit much statement-of-the-obvious, I imagine). * I don't think most of it is new, or news to the authors: they focused on the narrow version for a reason. The only part that could be seen as a direct critique is the downside risks section: I do think their argument is too narrow. * As it relates to Truthful AI, much of the rest can be seen in terms of "Truthfulness amplification doesn't bridge the gap". Here again, I doubt the authors would disagree. They never claim that it would, just that it expands the scope - that's undeniably true. * On being net-positive below a certain threshold, I'd make a few observations: * For the near-term, this post only really argues that the Truthful AI case for positive impact is insufficient (not broad enough). I don't think I've made a strong case the the output would be net negative, just that it's a plausible outcome (it'd be my bet for most standards in most contexts). * I do think such standards would be useful in some sense for very near future AIs - those that are not capable of hard-to-detect manipulation. However, I'm not sure eliminating falsehoods there is helpful overall: it likely reduces immediate harm a little, but risks giving users the false impression that AIs won't try to mislead them. If the first misleading AIs are undetectably misleading, that's not good. * Some of the issues are less clearly applicable in a CAIS-like setup, but others seem pretty fundamental: e.g. that what we care about is something like [change in accuracy of beliefs] not [accuracy of statement]. The "all models are wrong" issue doesn't go away. If you're making determinations in the wrong language game, you're going to make errors. * Worth emphasizing that "...and

but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)

I was surprised to see you saying that Rohin (and yourself) don't expect mesa-optimizers to appear in practice.  ... (read more)

I'd be excited to see more of this happening.

It reminds me of the recent job posting from Abram, Vanessa and Diffractor, which seems to be a role of adjunct distiller for Infrabaysianism, though they use different terms.

Ok, I think there's a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes.

I also think it's plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they'd have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the ... (read more)

3Rohin Shah
Yeah I think I agree with all of that. Thanks for rereading my original comment and noticing a misunderstanding :)

(in which case you don't deploy your AI system, and someone else destroys the world instead).

Can you explain your reasoning behind this a bit more?

Are you saying someone else destroys the world because a capable lab wants to destroy the world, and so as soon as the route to misaligned AGI is possible then someone will do it? Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well? (Or something else?...) 

0Rohin Shah
This one.
Load More