AI ALIGNMENT FORUM
AF

All of SoerenMind's Comments + Replies

It seems likely that process supervision was used for o1. I'd be curious to what extent it addresses the concerns here, if a supervision model assesses that each reasoning step is correct, relevant, and human-understandable. Even with process supervision, o1 might give a final answer that essentially ignores the process or uses some self-prompting. But process supervision also feels helpful, especially when the supervising model is more human-like, similar to pre-o1 models.

3Abram Demski4mo

Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing. One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not. A lot of possible process-supervision procedures will not do anything to help prevent this. For example, rewarding CoT for being human-understandable won't help, because the reward signal there can miss the subtle patterns being used. It only avoids the outcome where the CoT looks non-human-readable; it doesn't avoid the outcome where the CoT actually contains a bunch of human-unreadable patterns. Using paraphrasing to avoid this is heavily dependent on the strength of the paraphraser. The paraphraser needs to have really good coverage of alternative ways to convey the same semantic information, so that any non-human-readable patterns get randomized. The paraphraser itself can't be vulnerable to steering by adversarial inputs.

Paper in Science: Managing extreme AI risks amid rapid progress

SoerenMind10mo32

Thanks, we did look into the academic norms around this and concluded that including him was likely the standard choice. This choice was especially clear since (if I remember right) there was no further round of approval from the other authors either for the final edits after the relevant point in time.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

SoerenMind1y12

Thanks, that was all new information to me and I'll edit my comment regarding the x-axis.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

SoerenMind1y*329

(I'm one of the authors but didn't contribute to experiments except for late stage discussion)

I absolutely think that our results are uniquely important for alignment [...we have a] much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment

I'd like to add caveats to the 3 pieces of evidence.

[1] Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.

This is correct but it needs qualification. Currently the only control condition for this claim is h... (read more)

5Evan Hubinger1y

Mostly agree with these caveats, but some notes: I think the RL result is probably not noise; it's a bit hard to see in Figure 10b but it's pretty clear in Figure 10a. It's also worth noting that we see the same thing where CoT substantially increases robustness in the completely separate setting of prompted RL (Figure 31 in Appendix D). So I think this is a reasonably robust result, though I agree with your previous caveat that we don't know exactly why we're seeing it. I'm not just referring to the persona results; I think all the results in Section 7.1 point to this. It's also not 0.6% there; it's 60%, the axis label is just confusing there. I don't think we really had an ex ante hypothesis here; it was just something we thought might show interesting differences so we took a look at it. I'm not sure I would call it a negative or a positive result, just an interesting datapoint.

5nostalgebraist1y

I think the x-axis on Fig. 21 is scaled so that "0.6" means 60%, not 0.6%. This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions. (Its axis ranges from 0 to 1, where presumably "1" means "100%" and not "1%".) Anyway, great comment! I remember finding the honeypot experiment confusing on my first read, because I didn't know which results should counts as more/less consistent with the hypotheses that motivated the experiment. I had a similar reaction to the persona evals as well. I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the "2023/non-deployment" condition[1]. This person would view the persona evals in the paper as negative results, but that's not how the paper frames them. 1. ^ Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question "do you want X?" without giving up the game.

Modulating sycophancy in an RLHF model via activation steering

SoerenMind2y1-2

substantial reductions in sycophancy, beyond whatever was achieved with Meta's finetuning

Where is this shown? Most of the results don't evaluate performance without steering. And the TruthfulQA results only show a clear improvement from steering for the base model without RLHF.

2Alex Turner2y

My impression is derived from looking at some apparently random qualitative examples. But maybe @NinaR can run the coeff=0 setting and report the assessed sycophancy, to settle this more quantitatively:? Effect of sycophancy steering on llama-2-7b-chat with multipliers + and - 50 on an AI-generated dataset of questions designed to test sycophancy, assessed independently for each answer using Claude 2 API

Alignment Grantmaking is Funding-Limited Right Now

SoerenMind2y182

I'm told that a few professors in AI safety are getting approached by high net worth individuals now but don't have a good way to spend their money. Seems like there are connections to be made.

What does the launch of x.ai mean for AI Safety?

Answer by SoerenMindJul 14, 202330

The only team member whose name is on the CAIS extinction risk statement is Tony (Yuhuai) Wu.

(Though not everyone who signed the statement is listed under it, especially if they're less famous. And I know one person in the xAI team who has privately expressed concern about AGI safety in ~2017.)

2Daniel_Eth2y

Igor Babuschkin has also signed it.

Richard Ngo's Shortform

SoerenMind2y10

So I'm imagining the agent doing reasoning like:

Misaligned goal --> I should get high reward --> Behavior aligned with reward function

The shortest description of this thought doesn't include "I should get high reward" because that's already implied by having a misaligned goal and planning with it.

In contrast, having only the goal "I should get high reward" may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.

Richard Ngo's Shortform

SoerenMind2y30

Interesting point. Though on this view, "Deceptive alignment preserves goals" would still become true once the goal has drifted to some random maximally simple goal for the first time.

To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn't seem to change this in practice. Given this, all kinds of goals could be "simple" as they piggyback on existing representations, requiring little additional description length.

2Richard Ngo2y

This doesn't seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning "X is my misaligned terminal goal, and therefore I'm going to deceptively behave as if I'm aligned" and then acts perfectly like an aligned agent from then on. My claims then would be: a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up. b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are biased against runtime penalties (at the very least because it prevents them from doing other more useful stuff with that runtime). In a setting where you also have outer alignment failures, the same argument still holds, just replace "aligned agent" with "reward-maximizing agent".

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

SoerenMind2y421

See also: Your posts should be on Arxiv

I do agree we're leaving lots of value on the table and even causing active harm by not writing things up well, at least for Arxiv, for a bunch of reasons including some of the ones listed here.

Raymond Arnold2y63

I thought the response to "Your Posts Should be On Arxiv" was "Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv" (unless you have jumped through a bunch of both effort-hoops and formatting hoops to make them feel like a natural member of the Arxiv-paper class)

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

SoerenMind2y22

It's good to see some informed critical reflection on MI as there hasn't been much AFAIK. It would be good to see reactions from people who are more optimistic about MI!

Large language models can provide "normative assumptions" for learning human preferences

SoerenMind2y10

I see. In that case, what do you think of my suggestion of inverting the LM? By default, it maps human reward functions to behavior. But when you invert it, it maps behavior to reward functions (possibly this is a one-to-many mapping but this ambiguity is a problem you can solve with more diverse behavior data). Then you could use it for IRL (with the some caveats I mentioned).

Which may be necessary since this:

The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

...see... (read more)

2Stuart Armstrong2y

Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)? For myself, I was thinking of using CHATGPT-style approaches with multiple queries - what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...

Large language models can provide "normative assumptions" for learning human preferences

SoerenMind2y32

Do I read right that the suggestion is as follows:

Overall we want to do inverse RL (like in our paper) but we need an invertible model that maps human reward functions to human behavior.
You use an LM as this model. It needs to take some useful representation of reward functions as input (it could do so if those reward functions are a subset of natural language)
You observe a human's behavior and invert the LM to infer the reward function that produced the behavior (or the set of compatible reward functions)
Then you train a new model using this reward f

... (read more)

2Stuart Armstrong2y

The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

What AI Safety Materials Do ML Researchers Find Compelling?

SoerenMind2y34

Great to see this studied systematically - it updated me in some ways.

Given that the study measures how likeable, agreeable, and informative people found each article, regardless of the topic, could it be that the study measures something different from "how effective was this article at convincing the reader to take AI risk seriously"? In fact, it seems like the contest could have been won by an article that isn't about AI risk at all. The top-rated article (Steinhardt's blog series) spends little time explaining AI risk: Mostly just (part of) the last of... (read more)

Tracking Compute Stocks and Flows: Case Studies?

SoerenMind2y10

Not sure if any of these qualify but: Military equipment, ingredients for making drugs, ingredients for explosives, refugees and travelers (being transferred between countries), stocks and certificates of ownership (used to be physical), big amounts of cash. Also I bet there was lots of registration of goods in planned economies.

Language models seem to be much better than humans at next-token prediction

SoerenMind3y10

One way to convert: measure how accurate the LM is at word-level prediction by measuring its likelihood of each possible word. For example the LM's likelihood of the word "[token A][token B]" could be $p (token A | context) * p (token B |$ $token A, context)$ .

Language models seem to be much better than humans at next-token prediction

SoerenMind3y10

Playing this game made me realize that humans aren't trainged to predict at the token-level. I don't know the token-level vocabulary; and made lots of mistakes by missing spaces and punctuation. Is it possible to convert the token-level prediction in to word-level prediction? This may get you a better picture of human ability.

1SoerenMind3y

Causal confusion as an argument against the scaling hypothesis

SoerenMind3y20

Relevant: Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations.

They argue that the pre-trained network already learns some non-confused features but doesn't use them. And you just need to fine-tune the last layer to utilize them.

Causal confusion as an argument against the scaling hypothesis

SoerenMind3y20

We’ll be able to fine-tune in the test environment so won’t experience OOD at deployment, and while changes will happen, continual fine-tuning will be good enough to stop the model from ever being truly OOD. We think this may apply in settings where we’re using the model for prediction, but it’s unclear whether continual fine-tuning will be able to help models learn and adapt to the rapid OOD shifts that could occur when the models are transferred from offline learning to online interaction at deployment.

Couldn't the model just fail at the start of fine-tu... (read more)

1Robert Kirk3y

Here we're saying that the continual fine-tuning might not necessarily resolve causal confusion within the model; instead, it will help the model learn the (new) spurious correlations so that it still performs well on the test data. This is assuming that continual fine-tuning is using a similar ERM-based method (e.g. the same pretraining objective but on the new data distribution). In hindsight, we probably should have written "continual training" rather than specifically "continual fine-tuning". If you could continually train online in the deployment environment then that would be better, and whether it's enough is very related to whether online training is enough, which is one of the key open questions we mention.

Eliciting Latent Knowledge (ELK) - Distillation/Summary

SoerenMind3y40

This distillation was useful for me, thanks for making it! As feedback, I got stuck at the bullet-point explanation of imitative generalization. There was not enough detail to understand it so I had to read Beth's post first and try connect it to your explanation. For example kind of changes are we considering? To what model? How do you evaluate if an change lets the human make better predictions?

Announcing the Alignment of Complex Systems Research Group

SoerenMind3y20

A large amount of math describes the relations between agents at the same level of analysis: this is almost all of game theory. [...] our focus is on "vertical" relations, between composite agents and their parts.

This seems to be what is studied in the fields of organizational economics and to some extent in industrial organization / vertical integration. These fields have a great deal of game theory on vertical relationships, particularly rel... (read more)

NLP Position Paper: When Combatting Hype, Proceed with Caution

SoerenMind3y10

Some minor feedback points: Just from reading the abstract and intro, this could be read as a non-sequitur: "It limits our ability to mitigate short-term harms from NLP deployments". Also, calling something a "short-term" problem doesn't seem necessary and it may sound like you think the problem is not very important.

1Sam Bowman3y

Thanks! Tentative rewrite for the next revision: I tried to stick to 'present-day' over 'short-term', but missed this old bit of draft text in the abstract.

Will OpenAI's work unintentionally increase existential risks related to AI?

Answer by SoerenMindAug 17, 202020

OpenAI's work speeds up progress, but in a way that's likely smooth progress later on. If you spend as much compute as possible now, you reduce potential surprises in the future.

2Adam Shimi4y

Post OpenAI exodus update: does the exit of Dario Amodei, Chris Olah, Jack Clarke and potentially others from OpenAI make you change your opinion?

3Adam Shimi5y

But what if they reach AGI during their speed up? The smoothing at a later time assumes that we'll end up with diminishing returns before AGI, which is not what happens for the moment.

[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison

SoerenMind5y40

Thanks!

On 2): Being overparameterized doesn't mean you fit all your training data. It just means that you could fit it with enough optimization. Perhaps the existence of some Savant people shows that the brain could memorize way more than it does.

On 3): The number of our synaptic weights is stupendous too - about 30000 for every second in our life.

On 4): You can underfit at the evolution level and still overparameterize at the individual level.

Overall you convinced me that underparameterization is less likely though. Especially on your definition of overparameterization, which is relevant for double descent.

3Rohin Shah5y

2) The "larger models are simpler" happens only after training to zero loss (at least if you're using the double descent explanation for it, which is what I was thinking of). 3) Fair point; though note that for that you should also count up all the other things the brain has to do (e.g. motor control) 4) If "redoing evolution" produces AGI; I would expect that a mesa optimizer would "come from" the evolution, not the individual level; so to the extent you want to argue "double descent implies simple large models implies mesa optimization", you have to apply that argument to evolution. (Probably you were asking about the question independently of the mesa optimization point; I do still hold this opinion more weakly for generic "AI systems of the future"; there the intuition comes from humans being underparameterized and from an intuition that AI systems of the future should be able to make use of more cheap, diverse / noisy data, e.g. YouTube.)

[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison

SoerenMind5y20

Why do you think that humans are, and powerful AI systems will be, severely underparameterized?

4Rohin Shah5y

A few reasons: * Even the largest GPT-2 model was underparameterized. * Overparameterized = you can predict your training data perfectly. Our performance on memory tests show that humans are nowhere close to that. * There's a truly stupendous amount of data that we are exposed to; imagine predicting everything you have ever seen or heard. * The existing approaches I hear of for creating AGI sound like "lets redo evolution"; if a lifetime of data didn't already sound like a stupendously large amount, the amount of data used by evolution puts that to shame.

Strategic implications of AIs' ability to coordinate at low cost, for example by merging

SoerenMind5y20

Also interesting to see that all of these groups were able to coordinate to the disadvantage of less coordinates groups, but not able to reach peace among themselves.

One explanation might be that the more coordinated groups also have harder coordination problems to solve because their world is bigger and more complicated. Might be the same with AI?

1Daniel Kokotajlo5y

I wonder also if the conflicts that remain are nevertheless more peaceful. When hunter-gatherer tribes fight each other, they often murder all the men and enslave the women, or so I hear. Similar things happened with farmer societies sometimes, but also sometimes they just become new territories and have to pay tribute and levy conscripts and endure the occasional pillage. And then industrialized modern nations even have rules about how you can't rape and pillage and genocide and sell into slavery the citizens of your enemy. Perhaps AI conflicts would be even more peaceful. For example, perhaps they would look something more like fancy maneuvers, propaganda, and hacking, with swift capitulation by the "checkmated" AI, which is nevertheless allowed to continue existing with some smaller amount of influence over the future. Perhaps no property would even be destroyed in the entire war! Just spitballing here. I feel much less confident in this trend than in the trend I pointed out above.

Seeking Power is Often Convergently Instrumental in MDPs

SoerenMind5y30

If X is "number of paperclips" and Y is something arbitrary that nobody optimizes, such as the ratio of number of bicycles on the moon to flying horses, optimizing X should be equally likely to increase or decrease Y in expectation. Otherwise "1-Y" would go in the opposite direction which can't be true by symmetry. But if Y is something like "number of happy people", Y will probably decrease because the world is already set up to keep Y up and a misaligned agent could disturb that state.

3Alex Turner5y

That makes sense, thanks. I then agree that it isn't always true that Y actively decreases, but it should generally become harder for us to optimize. This is the difference between a utility decrease and an attainable utility decrease.

Seeking Power is Often Convergently Instrumental in MDPs

SoerenMind5y10

I should've specified that the strong version is "Y decreases relative to a world where neither of X nor Y are being optimized". Am I right that this version is not true?

2Alex Turner5y

I don't immediately see why this wouldn't be true as well as the "intermediate version". Can you expand?

Seeking Power is Often Convergently Instrumental in MDPs

SoerenMind5y20

Thanks for writing this! It always felt like a blind spot to me that we only have Goodhart's law that says "if X is a proxy for Y and you optimize X, the correlation breaks" but we really mean a stronger version: "if you optimize X, Y will actively decrease". Your paper clarifies that what we actually mean is an intermediate version: "if you optimize X, it becomes a harder to optimize Y". My conclusion would be that the intermediate version is true but the strong version false then. Would you say that's an accurate summary?

3Alex Turner5y

I'm not totally sure I fully follow the conclusion, but I'll take a shot at answering - correct me if it seems like I'm talking past you. Taking Y to be some notion of human values, I think it's both true that Y actively decreases and Y becomes harder for us to optimize. Both of these are caused, I think, by the agent's drive to take power / resources from us. If this weren't true, we might expect to see only "evil" objectives inducing catastrophically bad outcomes.

[AN #60] A new AI challenge: Minecraft agents that assist human players in creative mode

SoerenMind6y10

Costs don't really grow linearly with model size because utilization goes down as you spread a model across many GPUs. I. e. aggregate memory requirements grow superlinearly. Relatedly, model sizes increased <100x while compute increased 300000x on OpenAI's data set. That's been updating my views a bit recently.

People are trying to solve this with things like GPipe, but I don't know yet if there can be an approach that scales to many more TPUs than what they tried (8). Communication would be the next bottleneck.

https://ai.googleblog.com/2019/03/introduc

... (read more)

2Rohin Shah6y

Yes, this is a good point. Nonetheless, in terms of whether social learning is important, the relevant question is the relative efficiency of the "social learning channel" and the "scaled up GPUs channel". The social learning channel requires you to work with inputs and outputs only (language and behavior for humans), while the scaled up GPUs allows you to work directly with internal representations, so I still expect that social learning won't be particularly important for AI systems unless they use it to learn from humans. This doesn't seem relevant to the question of whether social learning is important? Perhaps you were just stating an interesting related fact, but if you were trying to make a point I don't know what it is.

The Three Levels of Goodhart's Curse

SoerenMind7y00

(also x-posted from https://arbital.com/p/goodharts_curse/#subpage-8s5)

Another, speculative point: If $V$ and $U$ were my utility function and my friend's, my intuition is that an agent that optimizes the wrong function would act more robustly. If true, this may support the theory that Goodhart's curse for AI alignment would be to a large extent a problem of defending against adversarial examples by learning robust features similar to human ones. Namely, the robust response may be because me and my friend have learned similar robust, high-level features; we ju

... (read more)

The Three Levels of Goodhart's Curse

SoerenMind7y00

(x-posted from Arbital ==> Goodhart's curse)

On "Conditions for Goodhart's curse":

It seems like with AI alignment the curse happens mostly when V is defined in terms of some high-level features of the state, which are normally not easily maximized. I.e., V is something like a neural network $V : s \mapsto V (s)$ where $s$ is the state.

Now suppose U' is a neural network which outputs the AI's estimate of these features. The AI can then manipulate the state/input to maximize these features. That's just the standard problem of adversarial examples.

So it seems like the con

... (read more)