AI ALIGNMENT FORUM
AF

All of Sam Marks's Comments + Replies

Auditing language models for hidden objectives

Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper.

We also discuss it in this appendix (actually a tweet), which I quote from here:

Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this,

Sam Marks2mo106

Thanks for writing this reflection, I found it useful.

Just to quickly comment on my own epistemic state here:

I haven't read GD.
But I've been stewing on some of (what I think are) the same ideas for the last few months, when William Brandon first made (what I think are) similar arguments to me in October.
1. (You can judge from this Twitter discussion whether I seem to get the core ideas)
When I first heard these arguments, they struck me as quite important and outside of the wheelhouse of previous thinking on risks from AI development. I think they raise concer

... (read more)

ryan_greenblatt's Shortform

Sam Marks2mo50

Thanks, this is helpful. So it sounds like you expect

AI progress which is slower than the historical trendline (though perhaps fast in absolute terms) because we'll soon have finished eating through the hardware overhang
separately, takeover-capable AI soon (i.e. before hardware manufacturers have had a chance to scale substantially).

It seems like all the action is taking place in (2). Even if (1) is wrong (i.e. even if we see substantially increased hardware production soon), that makes takeover-capable AI happen faster than expected; IIUC, this contradict... (read more)

2Ryan Greenblatt2mo

I don't think (2) is a crux (as discussed in person). I expect that if takeover-capable AI takes a while (e.g. it happens in 2040), then we will have a long winter where economic value from AI doesn't increase that fast followed a period of faster progress around 2040. If progress is relatively stable accross this entire period, then we'll have enough time to scale up fabs. Even if progress isn't stable, we could see enough total value from AI in the slower growth period to scale up to scale up fabs by 10x, but this would require >>$1 trillion of economic value per year I think (which IMO seems not that likely to come far before takeover-capable AI due to views about economic returns to AI and returns to scaling up compute).

ryan_greenblatt's Shortform

Sam Marks2mo30

I really like the framing here, of asking whether we'll see massive compute automation before [AI capability level we're interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:

"How much is AI capabilities driven by algorithmic progress?" (problem: obscures dependence of algorithmic progress on compute for experimentation)
"How much AI progress can we get 'purely from elicitation'?" (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute

... (read more)

Ryan Greenblatt2mo*260

I put roughly 50% probability on feasibility of software-only singularity.^[1]

(I'm probably going to be reinventing a bunch of the compute-centric takeoff model in slightly different ways below, but I think it's faster to partially reinvent than to dig up the material, and I probably do use a slightly different approach.)

My argument here will be a bit sloppy and might contain some errors. Sorry about this. I might be more careful in the future.

The key question for software-only singularity is: "When the rate of labor production is doubled (as in, as if your... (read more)

6Ryan Greenblatt2mo

I think your outline of an argument against contains an important error. Importantly, while the spending on hardware for individual AI companies has increased by roughly 3-4x each year[1], this has not been driven by scaling up hardware production by 3-4x per year. Instead, total compute production (in terms of spending, building more fabs, etc.) has been increased by a much smaller amount each year, but a higher and higher fraction of that compute production was used for AI. In particular, my understanding is that roughly ~20% of TSMC's volume is now AI while it used to be much lower. So, the fact that scaling up hardware production is much slower than scaling up algorithms hasn't bitten yet and this isn't factored into the historical trends. Another way to put this is that the exact current regime can't go on. If trends continue, then >100% of TSMC's volume will be used for AI by 2027! Only if building takeover-capable AIs happens by scaling up TSMC to >1000% of what their potential FLOP output volume would have otherwise been, then does this count as "massive compute automation" in my operationalization. (And without such a large build-out, the economic impacts and dependency of the hardware supply chain (at the critical points) could be relatively small.) So, massive compute automation requires something substantially off trend from TSMC's perspective. [Low importance] It is only possible to build takeover-capable AI without previously breaking an important trend prior to around 2030 (based on my rough understanding). Either the hardware spending trend must break or TSMC production must go substantially above the trend by then. If takeover-capable AI is built prior to 2030, it could occur without substantial trend breaks but this gets somewhat crazy towards the end of the timeline: hardware spending keeps increasing at ~3x for each actor (but there is some consolidation and acquisition of previously produced hardware yielding a one-time increase up to about

2Ryan Greenblatt2mo

I think this happening in practice is about 60% likely, so I don't think feasibility vs. in practice is a huge delta.

Sam Marks's Shortform

Sam Marks3mo40

The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they're working with small models and adding a handful of SAE calls to a forward pass shouldn't be too big a hit.)

3Buck Shlegeris3mo

This uses transformers, which is IIUC way less efficient for inference than e.g. vllm, to an extent that is probably unacceptable for production usecases.

Sam Marks's Shortform

Sam Marks3mo42

@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").

5Buck Shlegeris3mo

I’m curious how they set up the SAE stuff; I’d have thought that this would require modifying some performance-critical inference code in a tricky way.

Sam Marks's Shortform

Sam Marks3mo20

I'm guessing you'd need to rejection sample entire blocks, not just lines. But yeah, good point, I'm also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen

4Sam Marks3mo

Sam Marks's Shortform

Sam Marks3mo267

x-posting a kinda rambling thread I wrote about this blog post from Tilde research.

---

If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don't use regexes. A big milestone for the field of interpretability!

I'll discussed some things that surprised me about this case study in

---

The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt ... (read more)

6Buck Shlegeris3mo

Isn't it easy to detect regexes in model outputs and rejection sample lines that contain regexes? This requires some custom sampling code if you want optimal latency/throughput, but the SAEs also require that.

A basic systems architecture for AI agents that do autonomous research

Sam Marks6mo100

While I agree the example in Sycophancy to Subterfuge isn't realistic, I don't follow how the architecture you describe here precludes it. I think a pretty realistic set-up for training an agent via RL would involve computing scalar rewards on the execution machine or some other machine that could be compromised from the execution machine (with the scalar rewards being sent back to the inference machine for backprop and parameter updates).

Fabien's Shortform

Sam Marks9mo20

Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing).

(Note that LoRA adapters can be merged into model we... (read more)

3Fabien Roger9mo

I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.

Fabien's Shortform

Sam Marks9mo50

Thanks to the authors for the additional experiments and code, and to you for your replication and write-up!

IIUC, for RR makes use of LoRA adapters whereas HP is only a LR probe, meaning that RR is optimizing over a more expressive space. Does it seem likely to you that RR would beat an HP implementation that jointly optimizes LoRA adapters + a linear classification head (out of some layer) so that the model retains performance while also having the linear probe function as a good harmfulness classifier?

(It's been a bit since I read the paper, so sorry if I'm missing something here.)

6Fabien Roger8mo

I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).

3Fabien Roger9mo

Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).

Sycophancy to subterfuge: Investigating reward tampering in large language models

Sam Marks9mo10

there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.

You're correct that I was neglecting this threat model—good point, and thanks.

So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.

Hmm, here's another way to fram... (read more)

4Ryan Greenblatt9mo

Yep, reasonable summary. I don't think this was obvious to everyone and I appreciate this point - I edited my earlier comment to more explicitly note my agreement.

Sycophancy to subterfuge: Investigating reward tampering in large language models

Sam Marks9mo*60

In this comment, I'll use reward tampering frequency (RTF) to refer to the proportion of the time the model reward tampers.

I think that in basically all of the discussion above, folks aren't using a correct mapping of RTF to practical importance. Reward hacking behaviors are positively reinforced once they occur in training; thus, there's a rapid transition in how worrying a given RTF is, based on when reward tampering becomes frequent enough that it's likely to appear during a production RL run.

To put this another way: imagine that this paper had trained ... (read more)

5Ryan Greenblatt9mo

I agree that a few factors of 2 don't matter much at all, but I think highlighting a specific low threshold relative to the paper seems misguided as opposed to generally updating based on the level of egregiousness and rarity. (Where you should probably think about the rarity in log space.) (I think I made the point that a few factors of 2 shouldn't matter much for the bottom line above.) (Edit: I agree that it's worth noting that low RTFs can be quite concerning for the reasons you describe.) I'll argue against a specific threshold in the rest of this comment. ---------------------------------------- First, it is worth noting that there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future: 1. As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion. 2. As an egregious reward hack that would be reinforced and would be harmful via occuring repeatedly (a la what failure looks like part 1). 3. As a reward hack that will generalize to some other failure eventually cascading into (1) or (2). (Implying this enviroment is not importantly different from the other environments in the paper, it is just part of a progression.) Your argument applies to (2) and (3), but not to (1). ---------------------------------------- I have a pretty broad uncertainty over how the real world empirical conditions will compare to the empirical conditions discussed in the paper. And, the empirical conditions in the paper clearly differ substantially from what we'd typically expect (much high fraction hacking, much dumber models). Examples of conditions that might differ: * Incentives against reward hacking or other blockers. * How much of a gradient toward reward hacking their is. * How good models are at doing reward hacking reasoning relative to the difficulty of doing so in v

Mechanistically Eliciting Latent Behaviors in Language Models

Sam Marks11mo54

I think this is cool! The way I'm currently thinking about this is "doing the adversary generation step of latent adversarial training without the adversarial training step." Does that seem right?

It seems intuitively plausible to me that once you have a latent adversarial perturbation (the vectors you identify), you might be able to do something interesting with it beyond "train against it" (as LAT does). E.g. maybe you would like to know that your model has a backdoor, beyond wanting to move to the next step of "train away the backdoor." If I were doing t... (read more)

1Andrew Mack11mo

Thanks for your comment! Yes, I’d say that roughly sums things up. As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens. Moreover, this might chain well with mechanistic anomaly detection. Quoting the post: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.” I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

Improving Dictionary Learning with Gated Sparse Autoencoders

Layer	MSE Loss	% Variance Explained	L1	L0	% Alive	CE Reconstructed
1	0.069	95	40	15	46	6.45
7	0.81	86	125	59.2	96	4.38

Layer	MSE Loss	% Variance Explained	L1	L0	% Alive	CE Reconstructed
1	0.11	92	44	17.5	54	5.95
7	1.1	82	137	65.4	95	4.29