AI ALIGNMENT FORUM
AF

Sam Marks
Ω97011810
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
0Sam Marks's Shortform
3y
9
No wikitag contributions to display.
evhub's Shortform
Sam Marks21d*154

Here are some research outputs that have already come out of the program (I expect many more to be forthcoming):

  • Open-source circuit tracing tools
  • Why Do Some Language Models Fake Alignment While Others Don't?
  • Subliminal Learning
  • Persona Vectors
  • Inverse Scaling in Test-Time Compute

If I forgot any and someone points them out to me I'll edit this list.

Reply
Towards Alignment Auditing as a Numbers-Go-Up Science
Sam Marks1mo*134

Thanks, I appreciate you writing up your view in more detail. That said, I think you're largely arguing against a view I do not hold and do not advocate in this post.

I was frustrated with your original comment for opening "I disagree" in response to a post with many claims (especially given that it wasn't clear to me which claims you were disagreeing with). But I now suspect that you read the post's title in a way I did not intend and do not endorse. I think you read it as an exhortation: "Let's introduce progress metrics!"

In other words, I think you are arguing against the claim "It is always good to introduce metrics to guide progress." I do not believe this. I strongly agree with you that "bad metrics are worse than no metrics." Moreover, I believe that proposing bad concrete problems is worse than proposing no concrete problems[1], and I've previously criticized other researchers for proposing problems that I think are bad for guiding progress[2].

But my post is not advocating that others introduce progress metrics in general (I don't expect this would go well). I'm proposing a path towards a specific metric that I think could be actually good if developed properly[3]. So insofar as we disagree about something here, I think it must be:

  1. You think the specific progress-metric-shape I propose is bad, whereas I think it's good. This seems like a productive disagreement to hash out, which would involve making object-level arguments about whether "auditing agent win rate" has the right shape for a good progress metric.
  2. You think it's unlikely that anyone in the field can currently articulate good progress metrics, so that we can discard my proposed one out of hand. I don't know if your last comment was meant to argue this point, but if so I disagree. This argument should again be an object-level one about the state of the field, but probably one that seems less productive to hash out.
  3. You think that introducing progress metrics is actually always bad. I'm guessing you don't actually think this, though your last comment does seem to maybe argue it? Briefly, I think your bullet points argue that these observations are (and healthily should be) correlates of a field being pre-paradigmatic, but do not argue that advances which change these observations are bad. E.g. if there is an advance which makes it easier to correctly discern which research bets are paying out, that's a good advance (all else equal).

Another way of saying all this is: I view myself as proposing a special type of concrete problem—one that is especially general (i.e. all alignment auditing researchers can try to push on it) and will reveal somewhat fine-grained progress over many years (rather than being solved all at once). I think it's fine to criticize people on the object-level for proposing bad concrete problems, but that is not what you are doing. Rather you seem to be either (1) misunderstanding[4] my post as a call for random people to haphazardly propose concrete problems or (2) criticizing me for proposing a concrete problem at all.

  1. ^

    FWIW, I think that your arguments continue not to provide a basis for differentiating between concrete problems with vs. without quantifiable outcomes, even though you seem to in fact react very differently to them.

  2. ^

    In fact, this is a massive pet peeve of mine. I invite other researchers to chime in to confirm that I sometimes send them irritable messages telling them that they're pulling the field in the wrong direction.

  3. ^

    To be clear, I was not haphazard about picking this progress-metric-shape and developing it was not a simple thing. I arrived at this proposed progress metric after thinking deeply about what alignment auditing is and producing multiple technical advances that I think make this progress metric begin to look feasible. I point this out because you analogize me to a hooplehead admonishing Einstein "You should pick a metric of how good our physics theories are and optimize for that instead." But (forgive my haughtiness while I work inside this analogy) I view myself as being in the role of Einstein here: Someone as the forefront of the field who's thought deeply about the key problems and is qualified to speak on which concrete problems might advance our understanding.

  4. ^

    Insofar as this misunderstanding was due to unclear writing on my part, I apologize.

Reply3
Towards Alignment Auditing as a Numbers-Go-Up Science
Sam Marks1mo32

I agree that the hard/important part is the model organism construction. That said, I think having auditing agents is still a prerequisite for a bunch of the claims in this post. Auditing agents make auditing games (1) repeatable and (2) self-serve (in the sense that e.g. a single researcher could in principle run their own agent-based auditing game evaluation multiple times over the course of a project to see if things are going well). If auditing games were inherently single-use (because of cost + the fact that you can't rerun the game on unspoiled auditors after you've run it once), then I couldn't reasonably describe auditing game performance as a progress metric.

I think I also probably take the continuous score more seriously than you do (rather than viewing 0% vs. non-0% win rate as being the important thing), though maybe I would change my mind if I thought about it more. (I'm not sure whether this is an important disagreement.)

Reply
Towards Alignment Auditing as a Numbers-Go-Up Science
Sam Marks1mo*215

It seems like your claim is that work which produces scientific understanding is antithetical to being quantitatively measured. I disagree. 

As I've argued elsewhere, if you claim to have an insight, a great way to validate that the insight is real is to show you can solve a problem that no one else can solve. E.g. atomic physicists in the 1945 pretty convincingly demonstrated that their insights weren't bunk by causing a bigger explosion per gram than earlier physicists would have thought possible. 

Of course, you might not know which problem your insights allow you to solve until you have the insights. I'm a big fan of constructing stylized problems that you can solve, after you know which insight you want to validate.

That said, I think it's even better if you can specify problems in advance to help guide research in the field. The big risk, then, is that these problems might not be robust to paradigm shifts (because paradigm shifts could change the set of important problems). If that is your concern, then I think you should probably give object-level arguments that solving auditing games is a bad concrete problem to direct attention to. (Or argue that specifying concrete problems is in general a bad thing.)

It seems like your concern, though, isn't about specifying concrete problems, but rather a skepticism that problems which are quantifiable are systematically bad for directing attention at. It seems like you mostly think this by appeal to example to other areas of ML which have dropped the ball on producing scientific understanding. I don't really agree with this: I think that e.g. RL algorithms researchers have some pretty deep insights about the nature of exploration, learning, etc. I agree they do not understand e.g. the psychologies of the AI systems they're training. But if physics doesn't produce sociological insights (thereby necessitating the scientific field of sociology), that doesn't imply that physicists have failed to produce deep scientific understanding of physics.

In summary, I think:

  • Your Kuhnian argument is really a critique of proposing concrete problems to steer research.
  • Assuming that proposing concrete problems is a good thing to do anyway, you've not convincingly argued that it's bad if those problems are numbers-go-up problems.
Reply1
Towards Alignment Auditing as a Numbers-Go-Up Science
Sam Marks1mo82

constructing good auditing test beds has been a huge pain

Yeah, I confirm this is the case. I estimate that the model organism from Auditing Language Models for Hidden Objectives took 8-12 FTE-months, which I think most people find surprisingly expensive. That said, 30% of this time probably went into implementing synthetic documenting finetuning and testing that it worked. I'd guess I could now make another model organism of similar quality in 1.5-3 FTE-months. 

Reply
White Box Control at UK AISI - Update on Sandbagging Investigations
Sam Marks2mo61

Since the probe is trained off-policy on a simple deception like dataset (and we're able to play with it a bunch and characterise what we think it's detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property.

I'd be pretty surprised if you had sufficient evidence to be very confident in this. Some things that seem relevant:

  1. IIUC, the training data for the probe consists of positive and negative examples of the model being instructed to be dishonest.
  2. Even if (1) were not true, I'd guess it's more likely that a classifier trained on simple/legible examples of a model being dishonest would flag text discussing the concept of dishonesty, than that it would flag text containing very OOD examples of covert honesty.
    1. (My evidence here is just personal experience on somewhat-related-but-not-exactly-the-same things.)
  3. While I'm not super confident in (2), I'd be surprised if you had the evidence to confidently reject (2).
Reply1
White Box Control at UK AISI - Update on Sandbagging Investigations
Sam Marks2mo83

You note:

We note that this does disadvantage the black box monitor, relative to the white box monitor, as some information about the sandbagging prompt would still be moved to the residual stream of later tokens by attention heads and therefore may still be available to the probe.

I agree with this concern and view it as a strong reason not to put much effort into detecting sandbagging in the prompted setting. That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.

Reply
White Box Control at UK AISI - Update on Sandbagging Investigations
Sam Marks2mo81

Thanks for this update! FWIW, my take is that detecting sandbagging is both (1) important and (2) one of the best downstream applications for interpretability researchers to target. So I'm glad you're doing this work!

Reply1
ryan_greenblatt's Shortform
Sam Marks2mo50

Yeah, maybe I should have defined "recursive oversight" as "techniques that attempt to bootstrap from weak oversight to stronger oversight." This would include IDA and task decomposition approaches (e.g. RRM). It wouldn't seem to include debate, and that seems fine from my perspective. (And I indeed find it plausible that debate-shaped approaches could in fact scale arbitrarily, though I don't think that existing debate schemes are likely to work without substantial new ideas.)

Reply2
ryan_greenblatt's Shortform
Sam Marks2mo31

On terminology, I prefer to say "recursive oversight" to refer to methods that leverage assistance from weaker AIs to oversee stronger AIs. IDA is a central example here. Like you, I'm skeptical of recursive oversight schemes scaling to arbitrarily powerful models. 

However, I think it's plausible that other oversight strategies (e.g. ELK-style strategies that attempt to elicit and leverage the strong learner's own knowledge) could succeed at scaling to arbitrarily powerful models, or at least to substantially superhuman models. This is the regime that I typically think about and target with my work, and I think it's reasonable for others to do so as well.

Reply1
Load More
24Discovering Backdoor Triggers
14d
0
57Towards Alignment Auditing as a Numbers-Go-Up Science
1mo
8
29Building and evaluating alignment auditing agents
1mo
0
37Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
1mo
0
16Principles for Picking Practical Interpretability Projects
2mo
0
45Modifying LLM Beliefs with Synthetic Document Finetuning
4mo
10
58Downstream applications as validation of interpretability progress
5mo
2
82Auditing language models for hidden objectives
6mo
3
34Recommendations for Technical AI Safety Research Directions
8mo
0
192Alignment Faking in Large Language Models
8mo
24
Load More