All of Charlie Steiner's Comments + Replies

Thanks!

Any thoughts on how this line of research might lead to "positive" alignment properties? (i.e. Getting models to be better at doing good things in situations where what's good is hard to learn / figure out, in contrast to a "negative" property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)

I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).

Hm, yeah, I didn't really think that through. How about giving a model a fraction of either its own precomputed chain of thought, or the summarized version, and plotting curves of accuracy and further tokens used vs. % of CoT given to it? (To avoid systematic error from summaries moving information around, doing this with a chunked version and comparing at each chunk seems like a good idea.)

Anyhow, thanks for the reply. I have now seen last figure.

Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.

"Steganography" I think give the wrong picture of what I expect - it's not that the model would be choosing a deliberately obscure way to encode secret information. It's just that it's going to use lots of degrees of freedom to try to get better results, often not what a human would do.

A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens. This is quite diff... (read more)

4Fabien Roger
I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer). This is why "encoded reasoning" is maybe a better expression. The experiments to evaluate whether it is present stay the same. I agree this is a concern. This can be tested more directly by adding filler tokens, and I find that adding filler tokens (for base model L) doesn't perform better than having the final answer directly (see last figure). I agree it won't be very clean. But the most scary scenarios are the ones where an AI can actually have thoughts that are independent of what the Chain-of-Thought looks like, since this seriously compromises CoT monitoring. So while I am curious about the ways in which LLM CoT work in ways more subtle than the "naive way", I think this is much lower stakes than figuring out if LLMs can do proper encoded reasoning. I am more worried about things like "semantic encoded reasoning" which paraphrasing would not remove, but I would guess there is as little "semantic encoded reasoning" as there is "syntactic encoded reasoning" in current LLMs.

I have a lot of implicit disagreements.

Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.

This is because ethics isn't science, it doesn't "hit back" when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.

Defending against this kind of "sycophancy++" failure mode doesn't look like defending against scheming. It looks like solving outer alignment really well.

Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.

I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)

It's a skill the same way "being a good umpire for baseball" takes skills, despite baseball being a social construct.[1] 

I mean, if you don't want to use the word "skill," and instead use the phrase "computationally non-trivial task we want to teach the AI," that's fine. But don't make the mistake of thinking that because of the is-ought problem there isn't anything we want to teach future AI about moral decision-making. Like, clearly we want to... (read more)

Condition 2: Given that M_1 agents are not initially alignment faking, they will maintain their relative safety until their deferred task is completed.

  • It would be rather odd if AI agents' behavior wildly changed at the start of their deferred task unless they are faking alignment.

"Alignment" is a bit of a fuzzy word.

Suppose I have a human musician who's very well-behaved, a very nice person, and I put them in charge of making difficult choices about the economy and they screw up and implement communism (or substitute something you don't like, if you like c... (read more)

1Josh Clymer
Developers separately need to justify models are as skilled as top human experts I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction) > An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions. We can tell them not to do things we would obviously dislike (e.g. takeover) and retain our optionality to direct them in ways that we are currently uncertain about.

I don't think this has much direct application to alignment, because although you can build safe AI with it, it doesn't differentially get us towards the endgame of AI that's trying to do good things and not bad things. But it's still an interesting question.

It seems like the way you're thinking about this, there's some directed relations you care about (the main one being "this is like that, but with some extra details") between concepts, and something is "real"/"applied" if it's near the edge of this network - if it doesn't have many relations directed t... (read more)

1Thane Ruthenis
Yep. The idea is to try and get a system that develops all practically useful "theoretical" abstractions, including those we haven't discovered yet, without developing desires about the real world. So we train some component of it on the real-world data, then somehow filter out "real-world" stuff, leaving only a purified superhuman abstract reasoning engine. One of the nice-to-have properties here would be is if we don't need to be able to interpret its world-model to filter out the concepts – if, in place of human understanding and judgement calls, we can blindly use some ground-truth-correct definition of what is and isn't a real-world concept.
Answer by Charlie Steiner30

Not being an author in any of those articles, I can only give my own take.

I use the term "weak to strong generalization" to talk about a more specific research-area-slash-phenomenon within scalable oversight (which I define like SO-2,3,4). As a research area, it usually means studying how a stronger student AI learns what a weaker teacher is "trying" to demonstrate, usually just with slight twists on supervised learning, and when that works well, that's the phenomenon.

It is not an alignment technique to me because the phrase "alignment technique" sounds li... (read more)

Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me?

It's not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I'm not clear.

But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve.

But clearly other people think differently than me.

I think the core argument is "if you want to slow down, or somehow impose restrictions on AI research and deployment, you need some way of defining thresholds. Also, most policymaker's cruxes appear to be that AI will not be a big deal, but if they thought it was going to be a big deal they would totally want to regulate it much more. Therefore, having policy proposals that can use future eval results as a triggering mechanism is politically more feasible, and also, epistemically helpful since it allows people who do think it will be a big deal to establish a track record". 

I find these arguments reasonably compelling, FWIW.

One big reason I might expect an AI to do a bad job at alignment research is if it doesn't do a good job (according to humans) of resolving cases where humans are inconsistent or disagree. How do you detect this in string theory research? Part of the reason we know so much about physics is humans aren't that inconsistent about it and don't disagree that much. And if you go to sub-topics where humans do disagree, how do you judge its performance (because 'be very convincing to your operators' is an objective with a different kind of danger).

Another potentia... (read more)

Thanks for the great reply :) I think we do disagree after all.

humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans

Except about that - here we agree.

 

Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflecti

... (read more)

I sometimes come back to think about this post. Might as well write a comment.

  • Goodhart's law. You echo the common frame that an approximate value function is almost never good enough, and that's why Goodhart's law is a problem. Probably what I though when I first read this post was that I'd just written a sequence about how human values live inside models of humans (whether our own models or an AI's), which makes that frame weird - weird to talk about an 'approximate value function' that's not really an approximation to anything specific. The Siren Worlds

... (read more)

Nice! Purely for my own ease of comprehension I'd have liked a little more translation/analogizing between AI jargon and HCI jargon - e.g. the phrase "active learning" doesn't appear in the post.

  • Value Alignment: Ultimately, humans will likely need to continue to provide input to confirm that AI systems are indeed acting in accordance with human values. This is because human values continue to evolve. In fact, human values define a “slice” of data where humans are definitionally more accurate than non-humans (including AI). AI systems might get quite good a
... (read more)
2Sophie Bridgers
Hi Charlie, Thanks for your thoughtful feedback and comments! If we may, we think we actually agree more than we disagree. By “definitionally accurate”, we don’t necessarily mean that a group of randomly selected humans are better than AI at explicitly defining or articulating human values or better at translating those values into actions in any given situation. We might call this “empirical accuracy” – that is, under certain empirical conditions such as time pressure, expertise and background of the empirical sample, incentive structure of the empirical task, the dependent measure, etc. humans can be inaccurate about their underlying values and the implications of those values for real-world decisions. Rather by “definitional accuracy”, we mean that for something to be a human value, it needs to actually be held by humans and for an AI action or decision to be aligned with human values, it needs to be deemed desirable by humans. That is, if no human agrees with or endorses an AI action as being appropriate – under the best case empirical conditions – then it definitionally is not in line with human values. And thus human input will be needed in some capacity to verify or confirm alignment. (On reflection, we can see how the term “accuracy” in this context may be misleading. We could instead have stated: “humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans.”) Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflective thinking rather than snap judgments or falling back on heuristics. It could also involve AI assistance to help counter human biases, find common ground, and consider the logical consequences of communicate

Sorry, on my phone for a few days, but iirc in ch. 3 they consider the loss you get if you just predict according to the simplest hypothesis that matches the data (and show it's bounded).

0dil-leik-og
thank you, will look into that. I intuitively expect that in the setting where compute is precisely 0 cost, you can always just convert multiplicity to negative-length by building an iterate/sort/index loop around the bit segment where the multiplicity lies, and this just costs you the length of the iterate/sort/index loop (a constant which depends on your language). I also intuitively expect this to break in the infinite bitstring setting because you can have multiplicity that isn't contained in a finite substring? 

Temperature 0 is also sometimes a convenient mathematical environment for proving properties of Solomonoff induction, as in Li and Vitanyi (pdf of textbook).

0dil-leik-og
I was not able on a quick skim of the pdf to identify which passage you were referring to. If possible can you point me to an example Temperature 0 in the textbook?

Fair enough.

Yes, it seems totally reasonable for bounded reasoners to consider hypotheses (where a hypothesis like 'the universe is as it would be from the perspective of prisoner #3' functions like treating prisoner #3 as 'an instance of me') that would be counterfactual or even counterlogical for more idealized reasoners.

Typical bounded reasoning weirdness is stuff like seeming to take some counterlogicals (e.g. different hypotheses about the trillionth digit of pi) seriously despite denying 1+1=3, even though there's a chain of logic connecting one to t... (read more)

Answer by Charlie Steiner63

Suppose there are a hundred copies of you, in different cells. At random, one will be selected - that one is going to be shot tomorrow. A guard notifies that one that they're going to be shot.

There is a mercy offered, though - there's a memory-eraser-ray handy. The one who knows they'te going to be shot is given the option to erase their memory of the warning and everything that followed, putting them in the same information state, more or less, as any of the other copies.

"Of course!" They cry. "Erase my memory, and I could be any of them - why, when you shoot someone tomorrow, there's a 99% chance it won't even be me!"

Then the next day comes, and they get shot.

4Thane Ruthenis
Sure. This setup couldn't really be exploited for optimizing the universe. If we assume that the self-selection assumption is a reasonable assumption to make, inducing amnesia doesn't actually improve outcomes across possible worlds. One out of 100 prisoners still dies.  It can't even be considered "re-rolling the dice" on whether the specific prisoner that you are dies. Under the SSA, there's no such thing as a "specific prisoner", "you" are implemented as all 100 prisoners simultaneously, and so regardless of whether you choose to erase your memory or not, 1/100 of your measure is still destroyed. Without SSA, on the other hand, if we consider each prisoner's perspective to be distinct, erasing memory indeed does nothing: it doesn't return your perspective to the common pool of prisoner-perspectives, so if "you" were going to get shot, "you" are still going to get shot. I'm not super interested in that part, though. What I'm interested in is whether there are in fact 100 clones of me: whether, under the SSA, "microscopically different" prisoners could be meaningfully considered a single "high-level" prisoner.

I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.

No, that’s not how RL works. RL - in settings like REINFORCE for simplicity - provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray. 

... (read more)

Thanks for the detailed post!

I personally would have liked to see some mention of the classic 'outer' alignment questions that are subproblems of robustness and ELK. E.g. What counts as 'generalizing correctly'? -> How do you learn how humans want the AI to generalize? -> How do you model humans as systems that have preferences about how to model them?

Just riffing on this rather than starting a different comment chain:

If alignment is "get AI to follow instructions" (as typically construed in a "good enough" sort of way) and alignment is "get AI to do good things and not bad things," (also in a "good enough" sort of way, but with more assumed philosophical sophistication) I basically don't care about anyone's safety plan to get alignment except insofar as it's part of a plan to get alignment.

Philosophical errors/bottlenecks can mean you don't know how to go from 1 to 2. Human safety pr... (read more)

The fact that latents are often related to their neighbors definitely seems to support your thesis, but it's not clear to me that you couldn't train a smaller, somewhat-lossy meta-SAE even on an idealized SAE, so long as the data distribution had rare events or rare properties you could thow away cheaply.

You could also play a similar game showing that latents in a larger SAE are "merely" compositions of latents in a smaller SAE.

So basically, I was left wanting a more mathematical perspective of what kinds of properties you're hoping for SAEs (or meta-SAEs)... (read more)

2Neel Nanda
IMO am "idealized" SAE just has no structure relating features, so nothing for a meta SAE to find. I'm not sure this is possible or desirable, to be clear! But I think that's what idealized units of analysis should look like I agree, we do this briefly later in the post, I believe. I see our contribution more as showing that this kind of thing is possible, than that meta SAEs are objectively the best tool for it

It would be interesting to meditate in the question "What kind of training procedure could you use to get a meta-SAE directly?" And I think answering this relies in part on mathematical specification of what you want.

At Apollo we're currently working on something that we think will achieve this. Hopefully will have an idea and a few early results (toy models only) to share soon.

Did you ever read Lara Buchak's book? Seems related.

Also, I'm not really intuition-pumped by the repeated mugging example. It seems similar to a mugging where Omega only shows up once, but asks you for a recurring payment.

A related issue might be asking if UDT-ish agents who use a computable approximation to the Solomonoff prior are reflectively stable - will they want to "lock out" certain hypotheses that involve lots of computation (e.g. universes provably trying to simulate you via search for simple universes that contain agents who endorse Solomonoff induction). And probably the answer us going to be "it depends," and you can do verbal argumentation for either option.

2Abram Demski
Yeah, I expect the Lizard World argument to be the more persuasive argument for a similar point. I'm thinking about reorganizing the post to make it more prominent. 

I worry about erasing self-other distinction of values. If I want an AI to get me a sandwich, I don't want the AI to get itself a sandwich.

It's easy to say "we'll just train the AI to have good performance (and thereby retain some self-other distinctions), and getting itself a sandwich would be bad performance so it won't learn to do that." But this seems untrustworthy for any AI that's learning human values and generalizing them to new situations. In fact the entire point is that you hope it will affect generalization behavior.

I also worry that the instan... (read more)

To add onto this comment, let’s say there’s self-other overlap dial—e.g. a multiplier on the KL divergence or whatever.

  • When the dial is all the way at the max setting, you get high safety and terribly low capabilities. The AI can’t explain things to people because it assumes they already know everything the AI knows. The AI can't conceptualize the idea that if Jerome is going to file the permits, then the AI should not itself also file the same permits. The AI wants to eat food, or else the AI assumes that Jerome does not want to eat food. The AI thinks it
... (read more)

Fun read!

This seems like it highlights that it's vital for current fine-tuned models to change the output distribution only a little (e.g. small KL divergence between base model and finetuned model). If they change the distribution a lot, they'll run into unintended optima, but the base distribution serves as a reasonable prior / reasonable set of underlying dynamics for the text to follow when the fine-tuned model isn't "spending KL divergence" to change its path.

Except it's still weird how bad the reward model is - it's not like the reward model was trained based on the behavior it produced (like humans' genetic code was), it's just supervised learning on human reviews.

This was super interesting. I hadn't really thought about the tension between SLT and superposition before, but this is in the middle of it.

Like, there's nothing logically inconsistent with the best local basis for the weights being undercomplete while the best basis for the activations is overcomplete. But if both are true, it seems like the relationship to the data distribution has to be quite special (and potentially fragile).

Nice! There's definitely been this feeling with training SAEs that activation penalty+reconstruction loss is "not actually asking the computer for what we want," leading to fragility. TopK seems like it's a step closer to the ideal - did you subjectively feel confident when starting off large training runs?

Confused about section 5.3.1:

To mitigate this issue, we sum multiple TopK losses with different values of k (Multi-TopK). For
example, using L(k) + L(4k)/8 is enough to obtain a progressive code over all k′ (note however
that training with Multi-TopK does

... (read more)
8leogao
We had done very extensive ablations at small scale where we found TopK to be consistently better than all of the alternatives we iterated through, and by the time we launched the big run we had already worked out how to scale all of the relevant hyperparameters, so we were decently confident. One reason we might want a progressive code is it would basically let you train one autoencoder and use it for any k you wanted to at test time (which is nice because we don't really know exactly how to set k for maximum interpretability yet). Unfortunately, this is somewhat worse than training for the specific k you want to use, so our recommendation for now is to train multiple autoencoders. Also, even with a progressive code, the activations on the margin would not generally be negative (we actually apply a ReLU to make sure that the activations are definitely non-negative, but almost always the (k+1)th value is still substantially positive)

> [Tells complicated, indirect story about how to wind up with a corrigible AI]

> "Corrigibility is, at its heart, a relatively simple concept"

I'm not saying the default strategy of bumbling forward and hoping that we figure out tool AI as we go has a literal 0% chance of working. But from the tone of this post and the previous table-of-contents post, I was expecting a more direct statement of what sort of functional properties you mean by "corrigibility," and I feel like I got more of a "we'll know it when we see it" approach.

5Max Harms
I'm curious for whether your perspective shifts once you read https://www.alignmentforum.org/posts/QzC7kdMQ5bbLoFddz/2-corrigibility-intuition and the formalism documents I'm publishing tomorrow. I gave a simple definition of corrigibility at the start of the doc: But the big flaw with just giving an English sentence like that is that it's more like a checksum than a mathematical definition. If one doesn't already understand corrigibility, it won't necessarily give them a crisp view of what is meant, and it's deeply prone to generating misunderstandings. Note that this is true about simple, natural concepts like "chairs" and "lakes"!

I think that guarding against rogue insiders and external attackers might mostly suffice for guarding against schemers. So if it turns out that it's really hard to convince labs that they need to be robust to schemers, we might be safe from human-level schemers anyway.

This would be nice (maybe buying us a whole several months until slightly-better schemers convince internal users to escalate their privileges), but it only makes sense if labs are blindly obstinate to all persuasion, rather than having a systematic bias. I know this is really just me dissecting a joke, but oh well :P

Wow, that's pessimistic. So in the future you imagine, we could build AIs that promote the good of all humanity, we just won't because if a business built that AI it wouldn't make as much money?

2Matthew Barnett
Yes, but I don't consider this outcome very pessimistic because this is already what the current world looks like. How commonly do businesses work for the common good of all humanity, rather than for the sake of their shareholders? The world is not a utopia, but I guess that's something I've already gotten used to.

Nice. I tried to do something similar (except making everything leaky with polynomial tails, so 

y = (y+torch.sqrt(y**2+scale**2)) * (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) / 4

where the first part (y+torch.sqrt(y**2+scale**2)) is a softplus, and the second part (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) is a leaky cutoff at the value threshold.

But I don't think I got such clearly better results, so I'm going to have to read more thoroughly to see what else you were doing that I wasn't :)

Wouldn't other people also like to use an AI that can collaborate with them on complex topics? E.g. people planning datacenters, or researching RL, or trying to get AIs to collaborate with other instances of themselves to accurately solve real-world problems?

I don't think people working on alignment research assistants are planning to just turn it on and leave the building, they on average (weighted by money) seem to be imagining doing things like "explain an experiment in natural language and have an AI help implement it rapidly."

So I think both they and ... (read more)

5Abram Demski
I don't think the plan is "turn it on and leave the building" either, but I still think the stated goal should not be automation.  I don't quite agree with the framing "building very generally useful AI, but the good guys will be using it first" -- the approach I am advocating is not to push general capabilities forward and then specifically apply those capabilities to safety research. That is more like the automation-centric approach I am arguing against. Hmm, how do I put this... I am mainly proposing more focused training of modern LLMs with feedback from safety researchers themselves, toward the goal of safety researchers getting utility out of these systems; this boosts capabilities for helping-with-safety-research specifically, in a targeted way, because that is what you are getting more+better training feedback on. (Furthermore, checking and maintaining this property would be an explicit goal of the project.) I am secondarily proposing better tools to aid in that feedback process; these can be applied to advance capabilities in any area, I agree, but I think it only somewhat exacerbates the existing "LLM moderation" problem; the general solution of "train LLMs to do good things and not bad things" does not seem to get significantly more problematic in the presence of better training tools (perhaps the general situation even gets better). If the project was successful for safety research, it could also be extended to other fields. The question of how to avoid LLMs being helpful for dangerous research would be similar to the LLM moderation question currently faced by Claude, ChatGPT, Bing, etc: when do you want the system to provide helpful answers, and when do you want it to instead refuse to help? I am thirdly also mentioning approaches such as training LLMs to interact with proof assistants and intelligently decide when to translate user arguments into formal languages. This does seem like a more concerning general-capability thing, to which the remark

Yeah, I don't know where my reading comprehension skills were that evening, but they weren't with me :P

Oh well, I'll just leave it as is as a monument to bad comments.

I think it's pretty tricky, because what matters to real networks is the cost difference between storing features pseudo-linearly (in superposition), versus storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode), versus not storing them at all. Calculating such a cost function seems like it has details that depend on the particulars of the network and training process, making it a total pain to try to mathematize (but maybe amenable to making toy models).

1Linda Linsefors
I think it's reasonable to think about what can be stored in a way that can be read of in a linear way (by the next layer), since that are the features that can be directly used in the next layer.  If it takes multiple nn layers to decode, then the nn need to unpack it before using it, and represent it as a linear readable feature later.

Neat, thanks. Later I might want to rederive the estimates using different assumptions - not only should the number of active features L be used in calculating average 'noise' level (basically treating it as an environment parameter rather than a design decision), but we might want another free parameter for how statistically dependent features are. If I really feel energetic I might try to treat the per-layer information loss all at once rather than bounding it above as the sum of information losses of individual features.

I feel like there's a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously - being outside the training distribution isn't going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it'll probably just fall back on familiar, safe behavior.

To me, these arguments feel like they treat present-day model-free RL as the "central case," and model-based RL as a small correction.

Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.

I feel like there's a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously - being outside the training distribution isn't going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it'll probably just fall back on familiar, safe behavior.

To me, these arguments feel like they treat present-day model-free RL as the "central case," and model-based RL as a small correction.

Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.

I hear you as saying "If we don't have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?"

One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.

But if you want the best evaluation of a research assistant's capabilities, I agress using it as a research assistant is more reliable.

A separate issue I have here is the assumption th... (read more)

2Ryan Greenblatt
I don't think the evaluations we're describing here are about measuring capabilites. More like measuring whether our oversight (and other aspects) suffice for avoiding misalignment failures. Measuring capabilities should be easy.

Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate - and if you don't eliminate them, they'll keep happening until some do slip through. I think I take them more seriously than you.

4Ryan Greenblatt
I agree, I was trying to note this in my second paragraph, but I guess this was insufficiently clear. I added the sentence "Being easy-to-study doesn't imply easy-to-solve". Seems too hard to tell based on this limited context. I think non-scheming failures are about 50% of the risk and probably should be about 50% of the effort of the AI safety-from-misalignment community. (I can see some arguments for scheming/deceptive alignment being more important toi work on in advance, but it also might be that non-scheming is more tractible and a higher fraction of risk in short timelines, so IDK overall.)

Or if you buy a shard-theory-esque picture of RL locking in heuristics, what heuristics can get locked in depends on what's "natural" to learn first, even when training from scratch.

Both of these hypotheses probably should come with caveats though. (About expected reliability, training time, model-free-ness, etc.)

The history is a little murky to me. When I wrote [what's the dream for giving natural-language commands to AI](https://www.lesswrong.com/posts/Bxxh9GbJ6WuW5Hmkj/what-s-the-dream-for-giving-natural-language-commands-to-ai), I think I was trying to pin down and critique (a version of) something that several other people had gestured to in a more offhand way, but I can't remember the primary sources. (Maybe Rohin's alignment newsletter between the announcement of GPT2 and then would contain the relevant links?)

This is what all that talk about predictive loss was for. Training on predictive loss gets you systems that are especially well-suited to being described as learning the time-evolution dynamics of the training distribution. Not in the sense that they're simulating the physical reality underlying the training distribution, merely in the sense that they're learning dynamics for the behavior of the training data.

Sure, you could talk about AlphaZero in terms of prediction. But it's not going to have the sort of configurability that makes the simulator framing ... (read more)

I can at least give you the short version of why I think you're wrong, if you want to chat lmk I guess.

Plain text: "GPT is a simulator."

Correct interpretation: "Sampling from GPT to generate text is a simulation, where the state of the simulation's 'world' is the text and GPT encodes learned transition dynamics between states of the text."

Mistaken interpretation: "GPT works by doing a simulation of the process that generated the training data. To make predictions, it internally represents the physical state of the Earth, and predicts the next token by appl... (read more)

4Oliver Habryka
Sure, I wasn't under the impression that the claim was that GPT was literally simulating earth, but I don't understand how describing something as a simulation of this type, over a completely abstract "next token space" constraints expectations. Like, I feel like you can practically define all even slightly recurrent systems as "simulators" of this type. If we aren't talking about simulating something close to human minds, what predictions can we make? Like, let's say I have a very classical RL algorithm, something like AlphaZero with MCTS. It also "simulates" a game state by state into the future (into many different branches). But how does this help me predict what the system does? AlphaZero seems to share few of the relevant dynamics this post is talking about.

Wild.

The difference in variability doesn't seem like it's enough to explain the generalization, if your PC-axed plots are on the same scale. But maybe that's misleading because the datapoints are still kinda muddled in the has_alice xor has_not plot, and separating them might require going to more dimensions, that have smaller variability.

Agree with simon that if the AI gets rich data about what counts as "measurement tampering," then you're sort of pushing around the loss basin but if tampering was optimal in the first place, the remaining optimum is still probably some unintended solution that has most effects of tampering without falling under the human-provided definition. Not only is there usually no bright-line distinction between undesired and desired behavior, the AI would be incentivized to avoid developing such a distinction.

I agree that this isn't actually that big a problem in m... (read more)

1Ryan Greenblatt
Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).

My take is that detecting bad behavior at test time is <5% of the problem, and >95% of the problem is making an AI that doesn't have bad behavior at test time.

If you want to detect and avoid bad behavior during training, Goodhart's law becomes a problem, and you actually need to learn your standards for "bad behavior" as you go rather than keeping them static. Which I think obviates a lot of the specificity you're saying advantages MTD.

2Fabien Roger
The kind of scheme I'm imagining is the following: Let the AI take actions. Look at a transcript describing the thousand of actions the AI took and all consequences they had on the codebase/the bank account/the users reviews/the materials coming in and out of the factory. Give that report to human raters. Train on that reward signal. Use MTD to lower reward if the AI tried to tamper with any part of that process. In some sense, this scheme "learns what bad behavior is as you go" thanks to the human raters. Imo, the main threat is that the AI tampers with the report / with how the information gets into the report / with how the human rater feels about the report (e.g. brain hacking stuff), and my central hope is that such actions are clearly different in the AI's "mind", such that a fixed technique using the internal of the models is enough to capture that these are "different reasons" for getting a high rating - no online adjustment required. I agree there are some things in some gray area between those, but I think that given enough budget to analyze the large number of measurements, they would actually give non-catastrophic rewards if the measurements still described what they usually describe. See this section of the MTD post for more details. Is your concern that the human rater reward signal is too expensive, gameable in ways which won't be "measurement tampering" (in a way that is obvious to the AI itself), that there is no internal-based technique capturing "tampering" that the AI can't game (even if tampering is obvious to the AI itself), or sth else entirely? Or did I address your objection?

Suppose the two agents are me and a flatworm.
a = ideal world according to me
b = status quo
c = ideal world according to the flatworm
d, e, f = various deliberately-bad-to-both worlds

I'm not going to stop trying to improve the world just because the flatworm prefers the status quo, and I wouldn't be "happy enough" if we ended up in flatworm utopia.

What bargains I would agree to, and how I would feel about them, are not safe to abstract away.

1Thane Ruthenis
You would, presumably, be quite happy compared to "various deliberately-bad-to-both worlds". Because you don't care about the flatworm and the flatworm is not perceived by you as having much bargaining power for you to bend to its preferences. In addition, your model rules out more fine-grained ideas like "the cubic mile of terrain around the flatworm remains unchanged while I get the rest of the universe". Which is plausibly what CEV would result in: everyone gets their own safe garden, with the only concession the knowledge that everyone else's safe gardens also exist.

I disagree that translating to x and y let you "reduce the degrees of freedom" or otherwise get any sort of discount lunch. At the end you still had to talk about the low level states again to say they should compromise on b (or not compromise and fight it out over c vs. a, that's always an option).

1Thane Ruthenis
"Compromising on b" is a more detailed implementation that can easily be omitted. The load-bearing part is "both would be happy enough with any low-level state that gets mapped to the high-level state of x". For example, the policy of randomly sampling any l such that f(l)=x is something both utility functions can agree on, and doesn't require doing any additional comparisons of low-level preferences, once the high-level state has been agreed upon. Rising tide lifts all boats, etc.

Not even if those people independently keep going higher in the abstraction hierarchy - they'll never converge to the same object, because there's always that inequivalence in how they're translated back to the low level description.

I mean, that's clearly not how it works in practice? Take the example in the post literally: two people disagree on food preferences, but can agree on the "food" abstraction and on both of them having a preference for subjectively tasty ones.

I agree with the part of what you just said that's the NAH, but disagree with your inte... (read more)

1Thane Ruthenis
Okay, let's build a toy model. 1. We have some system with a low-level state l, which can take on one of six values: {a,b,c,d,e,f}. 2. We can abstract over this system's state and get a high-level state h, which can take on one of two states: {x,y}. 3. We have an objective abstracting-up function f(l)=h. 4. We have the following mappings between states: * ∀l∈{a,b,c}:f(l)=x * ∀l∈{d,e,f}:f(l)=y 5. We have an utility function UA(l), with a preference ordering of a>b>c≫d≈e≈f, and an utility function UB(l), with a preference ordering of c>b>a≫d≈e≈f. 6. We translate both utility functions to h, and get the same utility function: U(h) whose preference ordering is x>y. Thus, both UA(l) and UB(l) can agree on which high-level state they would greatly prefer. No low-level state would maximally satisfy both of them, but they both would be happy enough with any low-level state that gets mapped to the high-level state of x. (b is the obvious compromise.) Which part of this do you disagree with?

Sure, every time you go more abstract there are fewer degrees of freedom. But there's no free lunch - there are degrees of freedom in how the more-abstract variables are connected to less-abstract ones.

People who want different things might make different abstractions. E.g. if you're calling some high level abstraction "eat good food," it's not that this is mathematically the same abstraction made by someone who thinks good food is pizza and someone else who thinks good food is fish. Not even if those people independently keep going higher in the abstracti... (read more)

2Thane Ruthenis
That's a direct rejection of the natural abstractions hypothesis. And some form of it increasingly seems just common-sensically true. It's indeed the case that one's choice of what system to model is dependent on what they care about/where their values are housed (whether I care to model the publishing industry, say). But once the choice to model a given system is made, the abstractions are in the territory. They fall out of noticing to which simpler systems a given system can be reduced. (Imagine you have a low-level description of a system defined in terms of individual gravitationally- and electromagnetically-interacting particles. Unbeknownst to you, the system describes two astronomical objects orbiting each other. Given some abstracting-up algorithm, we can notice that this system reduces to these two bodies orbiting each other (under some definition of approximation). It's not value-laden at all: it's simply a true mathematical fact about the system's dynamics. The NAH is that this generalizes, very widely.) I mean, that's clearly not how it works in practice? Take the example in the post literally: two people disagree on food preferences, but can agree on the "food" abstraction and on both of them having a preference for subjectively tasty ones. If your model is assumed, i. e. that abstractions are inherently value-laden, then yes, this is possible. But that's not how it'd work under the NAH and on my model, because "good" and "bad" are not objective high-level states a given system could be in. It'd be something like State A and State B. And then the "human values converge" hypothesis is that all human values would converge to preferring one of these states.

Good paper - even if it shows that the problem is hard!

Sounds like it might be worth it to me to spend time understanding the "confidence loss" to figure out what's going on. There's an obvious intuitive parallel to a human student going "Ah yes, I know what you're doing" - rounding off the teacher to the nearest concept the student is confident in. But it's not clear how good an intuition pump that is.

I agree with Roger that active learning seems super useful (especially for meta-preferences, my typical hobby-horse). It seems a lot easier for the AI to le... (read more)

Load More