All of Charlie Steiner's Comments + Replies

Some costs of superposition

Yeah, I don't know where my reading comprehension skills were that evening, but they weren't with me :P

Oh well, I'll just leave it as is as a monument to bad comments.

Some costs of superposition

I think it's pretty tricky, because what matters to real networks is the cost difference between storing features pseudo-linearly (in superposition), versus storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode), versus not storing them at all. Calculating such a cost function seems like it has details that depend on the particulars of the network and training process, making it a total pain to try to mathematize (but maybe amenable to making toy models).

1Linda Linsefors1y

I think it's reasonable to think about what can be stored in a way that can be read of in a linear way (by the next layer), since that are the features that can be directly used in the next layer. If it takes multiple nn layers to decode, then the nn need to unpack it before using it, and represent it as a linear readable feature later.

Charlie Steiner1y30

Neat, thanks. Later I might want to rederive the estimates using different assumptions - not only should the number of active features L be used in calculating average 'noise' level (basically treating it as an environment parameter rather than a design decision), but we might want another free parameter for how statistically dependent features are. If I really feel energetic I might try to treat the per-layer information loss all at once rather than bounding it above as the sum of information losses of individual features.

Counting arguments provide no evidence for AI doom

Charlie Steiner1y94

I feel like there's a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously - being outside the training distribution isn't going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it'll probably just fall back on familiar, safe behavior.

To me, these arguments feel like they treat present-day model-free RL as the "central case," and model-based RL as a small correction.

Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.

Counting arguments provide no evidence for AI doom

Charlie Steiner1y10-3

Charlie Steiner1y94

To me, these arguments feel like they treat present-day model-free RL as the "central case," and model-based RL as a small correction.

Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.

Protocol evaluations: good analogies vs control

Charlie Steiner1y00

I hear you as saying "If we don't have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?"

One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.

But if you want the best evaluation of a research assistant's capabilities, I agress using it as a research assistant is more reliable.

A separate issue I have here is the assumption th... (read more)

2Ryan Greenblatt1y

I don't think the evaluations we're describing here are about measuring capabilites. More like measuring whether our oversight (and other aspects) suffice for avoiding misalignment failures. Measuring capabilities should be easy.

Protocol evaluations: good analogies vs control

Goals selected from learned knowledge: an alternative to RL alignment

Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate - and if you don't eliminate them, they'll keep happening until some do slip through. I think I take them more seriously than you.

4Ryan Greenblatt1y

I agree, I was trying to note this in my second paragraph, but I guess this was insufficiently clear. I added the sentence "Being easy-to-study doesn't imply easy-to-solve". Seems too hard to tell based on this limited context. I think non-scheming failures are about 50% of the risk and probably should be about 50% of the effort of the AI safety-from-misalignment community. (I can see some arguments for scheming/deceptive alignment being more important toi work on in advance, but it also might be that non-scheming is more tractible and a higher fraction of risk in short timelines, so IDK overall.)

Goals selected from learned knowledge: an alternative to RL alignment

Or if you buy a shard-theory-esque picture of RL locking in heuristics, what heuristics can get locked in depends on what's "natural" to learn first, even when training from scratch.

Both of these hypotheses probably should come with caveats though. (About expected reliability, training time, model-free-ness, etc.)

What’s up with LLMs representing XORs of arbitrary features?

The history is a little murky to me. When I wrote [what's the dream for giving natural-language commands to AI](https://www.lesswrong.com/posts/Bxxh9GbJ6WuW5Hmkj/what-s-the-dream-for-giving-natural-language-commands-to-ai), I think I was trying to pin down and critique (a version of) something that several other people had gestured to in a more offhand way, but I can't remember the primary sources. (Maybe Rohin's alignment newsletter between the announcement of GPT2 and then would contain the relevant links?)

Simulators

Charlie Steiner1y34

This is what all that talk about predictive loss was for. Training on predictive loss gets you systems that are especially well-suited to being described as learning the time-evolution dynamics of the training distribution. Not in the sense that they're simulating the physical reality underlying the training distribution, merely in the sense that they're learning dynamics for the behavior of the training data.

Sure, you could talk about AlphaZero in terms of prediction. But it's not going to have the sort of configurability that makes the simulator framing ... (read more)

Simulators

Charlie Steiner1y43

I can at least give you the short version of why I think you're wrong, if you want to chat lmk I guess.

Plain text: "GPT is a simulator."

Correct interpretation: "Sampling from GPT to generate text is a simulation, where the state of the simulation's 'world' is the text and GPT encodes learned transition dynamics between states of the text."

Mistaken interpretation: "GPT works by doing a simulation of the process that generated the training data. To make predictions, it internally represents the physical state of the Earth, and predicts the next token by appl... (read more)

4Oliver Habryka1y

Sure, I wasn't under the impression that the claim was that GPT was literally simulating earth, but I don't understand how describing something as a simulation of this type, over a completely abstract "next token space" constraints expectations. Like, I feel like you can practically define all even slightly recurrent systems as "simulators" of this type. If we aren't talking about simulating something close to human minds, what predictions can we make? Like, let's say I have a very classical RL algorithm, something like AlphaZero with MCTS. It also "simulates" a game state by state into the future (into many different branches). But how does this help me predict what the system does? AlphaZero seems to share few of the relevant dynamics this post is talking about.

Measurement tampering detection as a special case of weak-to-strong generalization

Wild.

The difference in variability doesn't seem like it's enough to explain the generalization, if your PC-axed plots are on the same scale. But maybe that's misleading because the datapoints are still kinda muddled in the has_alice xor has_not plot, and separating them might require going to more dimensions, that have smaller variability.

Measurement tampering detection as a special case of weak-to-strong generalization

Agree with simon that if the AI gets rich data about what counts as "measurement tampering," then you're sort of pushing around the loss basin but if tampering was optimal in the first place, the remaining optimum is still probably some unintended solution that has most effects of tampering without falling under the human-provided definition. Not only is there usually no bright-line distinction between undesired and desired behavior, the AI would be incentivized to avoid developing such a distinction.

I agree that this isn't actually that big a problem in m... (read more)

1Ryan Greenblatt1y

Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).

Charlie Steiner1y1-3

My take is that detecting bad behavior at test time is <5% of the problem, and >95% of the problem is making an AI that doesn't have bad behavior at test time.

If you want to detect and avoid bad behavior during training, Goodhart's law becomes a problem, and you actually need to learn your standards for "bad behavior" as you go rather than keeping them static. Which I think obviates a lot of the specificity you're saying advantages MTD.

2Fabien Roger1y

The kind of scheme I'm imagining is the following: Let the AI take actions. Look at a transcript describing the thousand of actions the AI took and all consequences they had on the codebase/the bank account/the users reviews/the materials coming in and out of the factory. Give that report to human raters. Train on that reward signal. Use MTD to lower reward if the AI tried to tamper with any part of that process. In some sense, this scheme "learns what bad behavior is as you go" thanks to the human raters. Imo, the main threat is that the AI tampers with the report / with how the information gets into the report / with how the human rater feels about the report (e.g. brain hacking stuff), and my central hope is that such actions are clearly different in the AI's "mind", such that a fixed technique using the internal of the models is enough to capture that these are "different reasons" for getting a high rating - no online adjustment required. I agree there are some things in some gray area between those, but I think that given enough budget to analyze the large number of measurements, they would actually give non-catastrophic rewards if the measurements still described what they usually describe. See this section of the MTD post for more details. Is your concern that the human rater reward signal is too expensive, gameable in ways which won't be "measurement tampering" (in a way that is obvious to the AI itself), that there is no internal-based technique capturing "tampering" that the AI can't game (even if tampering is obvious to the AI itself), or sth else entirely? Or did I address your objection?

Suppose the two agents are me and a flatworm.
a = ideal world according to me
b = status quo
c = ideal world according to the flatworm
d, e, f = various deliberately-bad-to-both worlds

I'm not going to stop trying to improve the world just because the flatworm prefers the status quo, and I wouldn't be "happy enough" if we ended up in flatworm utopia.

What bargains I would agree to, and how I would feel about them, are not safe to abstract away.

1Thane Ruthenis1y

You would, presumably, be quite happy compared to "various deliberately-bad-to-both worlds". Because you don't care about the flatworm and the flatworm is not perceived by you as having much bargaining power for you to bend to its preferences. In addition, your model rules out more fine-grained ideas like "the cubic mile of terrain around the flatworm remains unchanged while I get the rest of the universe". Which is plausibly what CEV would result in: everyone gets their own safe garden, with the only concession the knowledge that everyone else's safe gardens also exist.

I disagree that translating to x and y let you "reduce the degrees of freedom" or otherwise get any sort of discount lunch. At the end you still had to talk about the low level states again to say they should compromise on b (or not compromise and fight it out over c vs. a, that's always an option).

1Thane Ruthenis1y

"Compromising on b" is a more detailed implementation that can easily be omitted. The load-bearing part is "both would be happy enough with any low-level state that gets mapped to the high-level state of x". For example, the policy of randomly sampling any l such that f(l)=x is something both utility functions can agree on, and doesn't require doing any additional comparisons of low-level preferences, once the high-level state has been agreed upon. Rising tide lifts all boats, etc.

Charlie Steiner1y30

Not even if those people independently keep going higher in the abstraction hierarchy - they'll never converge to the same object, because there's always that inequivalence in how they're translated back to the low level description.
I mean, that's clearly not how it works in practice? Take the example in the post literally: two people disagree on food preferences, but can agree on the "food" abstraction and on both of them having a preference for subjectively tasty ones.

I agree with the part of what you just said that's the NAH, but disagree with your inte... (read more)

1Thane Ruthenis1y

Okay, let's build a toy model. 1. We have some system with a low-level state l, which can take on one of six values: {a,b,c,d,e,f}. 2. We can abstract over this system's state and get a high-level state h, which can take on one of two states: {x,y}. 3. We have an objective abstracting-up function f(l)=h. 4. We have the following mappings between states: * ∀l∈{a,b,c}:f(l)=x * ∀l∈{d,e,f}:f(l)=y 5. We have an utility function UA(l), with a preference ordering of a>b>c≫d≈e≈f, and an utility function UB(l), with a preference ordering of c>b>a≫d≈e≈f. 6. We translate both utility functions to h, and get the same utility function: U(h) whose preference ordering is x>y. Thus, both UA(l) and UB(l) can agree on which high-level state they would greatly prefer. No low-level state would maximally satisfy both of them, but they both would be happy enough with any low-level state that gets mapped to the high-level state of x. (b is the obvious compromise.) Which part of this do you disagree with?

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Charlie Steiner1y105

Sure, every time you go more abstract there are fewer degrees of freedom. But there's no free lunch - there are degrees of freedom in how the more-abstract variables are connected to less-abstract ones.

People who want different things might make different abstractions. E.g. if you're calling some high level abstraction "eat good food," it's not that this is mathematically the same abstraction made by someone who thinks good food is pizza and someone else who thinks good food is fish. Not even if those people independently keep going higher in the abstracti... (read more)

2Thane Ruthenis1y

That's a direct rejection of the natural abstractions hypothesis. And some form of it increasingly seems just common-sensically true. It's indeed the case that one's choice of what system to model is dependent on what they care about/where their values are housed (whether I care to model the publishing industry, say). But once the choice to model a given system is made, the abstractions are in the territory. They fall out of noticing to which simpler systems a given system can be reduced. (Imagine you have a low-level description of a system defined in terms of individual gravitationally- and electromagnetically-interacting particles. Unbeknownst to you, the system describes two astronomical objects orbiting each other. Given some abstracting-up algorithm, we can notice that this system reduces to these two bodies orbiting each other (under some definition of approximation). It's not value-laden at all: it's simply a true mathematical fact about the system's dynamics. The NAH is that this generalizes, very widely.) I mean, that's clearly not how it works in practice? Take the example in the post literally: two people disagree on food preferences, but can agree on the "food" abstraction and on both of them having a preference for subjectively tasty ones. If your model is assumed, i. e. that abstractions are inherently value-laden, then yes, this is possible. But that's not how it'd work under the NAH and on my model, because "good" and "bad" are not objective high-level states a given system could be in. It'd be something like State A and State B. And then the "human values converge" hypothesis is that all human values would converge to preferring one of these states.