AI ALIGNMENT FORUM
AF

All of LawrenceC's Comments + Replies

What I would do if I wasn’t at ARC Evals

Lawrence Chan3mo160Review for 2023 Review

I think writing this post was helpful to me in thinking through my career options. I've also been told by others that the post was quite valuable to them as an example of someone thinking through their career options.

Interestingly, I left METR (then ARC Evals) about a month and a half after this post was published. (I continued to be involved with the LTFF.) I then rejoined METR in August 2024. In between, I worked on ambitious mech interp and did some late stage project management and paper writing (including some for METR). I also organized a mech ... (read more)

Evaluations (of new AI Safety researchers) can be noisy

Lawrence Chan3mo52Review for 2023 Review

I think this post made an important point that's still relevant to this day.

If anything, this post is more relevant in late 2024 than in early 2023, as the pace of AI makes ever more people want to be involved, while more and more mentors have moved towards doing object level work. Due to the relative reduction of capacity in evaluating new AIS researcher, there's more reliance on systems or heuristics to evaluate people now than in early 2023.

Also, I find it amusing that without the parenthetical, the title of the post makes another important point: "evals are noisy".

Touch reality as soon as possible (when doing machine learning research)

Lawrence Chan3mo50Review for 2023 Review

I think this post was useful in the context it was written in and has held up relatively well. However, I wouldn't active recommend it to anyone as of Dec 2024 -- both because the ethos of the AIS community has shifted, making posts like this less necessary, and because many other "how to do research" posts were written that contain the same advice.

Background

This post was inspired by conversations I had in mid-late 2022 with MATS mentees, REMIX participants, and various bright young people who were coming to the Bay to work on AIS (collectively, "kid... (read more)

Base LLMs refuse too

Lawrence Chan6mo52

I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.

Base LLMs refuse too

Lawrence Chan6mo30

I mean, we don't know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + "high-quality multi-task instruction data". I wouldn't be surprised if the same were true of Qwen 1.5.

From the Qwen2 report:

Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-t

... (read more)

3Arthur Conmy6mo

> Qwen2 was explicitly trained on synthetic data from Qwen1.5 ~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~ EDITED TO ADD: "these [Qwen] models are utilized to synthesize high-quality pre-training data" is clear evidence, I am being silly. All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models "trained to predict the next word on the internet" (I don't think the training samples being IID early and late in training is an important detail)

Base LLMs refuse too

Lawrence Chan6mo71

After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above.

It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.

Base LLMs refuse too

Lawrence Chan6mo20

It's worth noting that there's reasons to expect the "base models" of both Gemma2 and Qwen 1.5 to demonstrate refusals -- neither is trained on unfilted webtext.

We don't know what 1.5 was trained on, but we do know that Qwen2's pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes "high-quality multi-task instruction data"! From the Qwen2 report:

Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased met

... (read more)

[This comment is no longer endorsed by its author]Reply

7Lawrence Chan6mo

After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above. It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.

Defining alignment research

Lawrence Chan7mo40

Ah, you're correct, it's from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/

Defining alignment research

Lawrence Chan7mo30

(The Anthropic paper I cited predates ChatGPT by 7 months)

2Oliver Habryka7mo

Huh, interesting. Maybe the OpenAI statements about their models being "more aligned" came earlier than that in the context of Instruct-GPT? I definitely feel like I remember some Twitter threads and LW comment threads about it in the context of OpenAI announcements, and nothing in the context of Anthropic announcements.

Defining alignment research

Lawrence Chan7mo40

Pretty sure Anthropic's early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862

But yes, people complained about it a lot at the time

4Oliver Habryka7mo

Yeah, my sense is others (like Anthropic) followed along after OpenAI did that, though it seemed to me mostly to be against consensus in the alignment field (though I agree it's messy).

When fine-tuning fails to elicit GPT-3.5's chess abilities

Lawrence Chan9mo30

Very cool work; I'm glad it was done.

That being said, I agree with Fabien that the title is a bit overstated, insofar as it's about your results in particular::

Thus, fine-tuned performance provides very little information about the best performance that would be achieved by a large number of actors fine-tuning models with random prompting schemes in parallel.

It's a general fact of ML that small changes in finetuning setup can greatly affect performance if you're not careful. In particular, it seems likely to me that the empirical details that Fabien ... (read more)

Gradient surfing: the hidden role of regularization

Lawrence Chan10mo20

Have you tried instead 'skinny' NNs with a bias towards depth,

I haven't -- the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot -> grokking were done with the hope of interpreting the model before/after the slingshots.

That being said, you're probably correct that having more layers does seem related to slingshots.

(Particularly for MLPs, which are notorious for overfitting due to their power.)

What do you mean by power here?

4gwern10mo

Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of "can be real smart and learn complicated stuff" is.

One-layer transformers aren’t equivalent to a set of skip-trigrams

Lawrence Chan10mo40

Worth noting that both some of Anthropic's results and Lauren Greenspan's results here (assuming I understand her results correctly) give a clear demonstration of learned (even very toy) transformers not being well-modeled as sets of skip trigrams.

An OV-Coherent Toy Model of Attention Head Superposition

Lawrence Chan10mo*91

I'm having a bit of difficulty understanding the exact task/set up of this post, and so I have a few questions.

Here's a summary of your post as I understand it:

In Anthropic's Toy Model of Attention Head "Superposition",^[1] they consider a task where the model needs to use interference between heads to implement multiple skip trigrams. In particular, they consider, and call this "OV-incoherent, because the OV seems to need to use information "not present" in V of the source token. (This was incorrect, because you can implement their task perfectl

... (read more)

Thomas Kwa's Shortform

Lawrence Chan11mo1512

The general version of this statement is something like: if your beliefs satisfy the law of total expectation, the variance of the whole process should equal the variance of all the increments involved in the process.^[1] In the case of the random walk where at each step, your beliefs go up or down by 1% starting from 50% until you hit 100% or 0% -- the variance of each increment is 0.01^2 = 0.0001, and the variance of the entire process is 0.5^2 = 0.25, hence you need 0.25/0.0001 = 2500 steps in expectation. If your beliefs have probability p of going... (read more)

Thomas Kwa11mo43

I talked about this with Lawrence, and we both agree on the following:

There are mathematical models under which you should update >=1% in most weeks, and models under which you don't.
Brownian motion gives you 1% updates in most weeks. In many variants, like stationary processes with skew, stationary processes with moderately heavy tails, or Brownian motion interspersed with big 10%-update events that constitute <50% of your variance, you still have many weeks with 1% updates. Lawrence's model where you have no evidence until either AI takeover happen

... (read more)

William_S's Shortform

Lawrence Chan11mo73

When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn't.

Refusal in LLMs is mediated by a single direction

Lawrence Chan11mo52

I don't know what the "real story" is, but let me point at some areas where I think we were confused. At the time, we had some sort of hand-wavy result in our appendix saying "something something weight norm ergo generalizing". Similarly, concurrent work from Ziming Liu and others (Omnigrok) had another claim based on the norm of generalizing and memorizing solutions, as well as a claim that representation is important.

One issue is that our picture doesn't consider learning dynamics that seem actually important here. For example, it seems that one of... (read more)

Refusal in LLMs is mediated by a single direction

Lawrence Chan11mo88

To be clear: I don't think the results here are qualitatively more grounded than e.g. other work in the activation steering/linear probing/representation engineering space. My comment was defense of studying harmlessness in general and less so of this work in particular.

If the objection isn't about this work vs other rep eng work, I may be confused about what you're asking about. It feels pretty obvious that this general genre of work (studying non-cherry picked phenomena using basic linear methods) is as a whole more grounded than a lot of mech inte... (read more)

2Buck Shlegeris11mo

Yeah definitely I agree with the implication, I was confused because I don't think that these techniques do improve on state of the art.

Refusal in LLMs is mediated by a single direction

Lawrence Chan11mo1310

I agree pretty strongly with Neel's first point here, and I want to expand on it a bit: one of the biggest issues with interp is fooling yourself and thinking you've discovered something profound when in reality you've misinterpreted the evidence. Sure, you've "understood grokking"^[1] or "found induction heads", but why should anyone think that you've done something "real", let alone something that will help with future dangerous AI systems? Getting rigorous results in deep learning in general is hard, and it seems empirically even harder in (mech) in... (read more)

5Buck Shlegeris11mo

Lawrence, how are these results any more grounded than any other interp work?

2Neel Nanda11mo

Thanks! Broadly agreed I'd be curious to hear more about what you meant by this

Superposition is not "just" neuron polysemanticity

Lawrence Chan1y20

Thanks!

I was grouping that with “the computation may require mixing together ‘natural’ concepts” in my head. After all, entropy isn’t an observable in the environment, it’s something you derive to better model the environment. But I agree that “the concept may not be one you understand” seems more central.

Explaining grokking through circuit efficiency

Lawrence Chan1y20

My speculation for Omni-Grok in particular is that in settings like MNIST you already have two of the ingredients for grokking (that there are both memorising and generalising solutions, and that the generalising solution is more efficient), and then having large parameter norms at initialisation provides the third ingredient (generalising solutions are learned more slowly), for some reason I still don't know.

Higher weight norm means lower effective learning rate with Adam, no? In that paper they used a constant learning rate across weight norms, but Adam ... (read more)

2Rohin Shah1y

Sounds plausible, but why does this differentially impact the generalizing algorithm over the memorizing algorithm? Perhaps under normal circumstances both are learned so fast that you just don't notice that one is slower than the other, and this slows both of them down enough that you can see the difference?

LawrenceC's Shortform

Lawrence Chan1y20

I mean, yeah, as your footnote says:

Another simpler but less illuminating way to put this is that higher serial reasoning depth can't be parallelized.^[1]

Transformers do get more computation per token on longer sequences, but they also don't get more serial depth, so I'm not sure if this is actually an issue in practice?

^{^}
[C]ompactly represent $f \circ g$ (f composed with g) in a way that makes computing $f (g (x))$ more efficient for general choices of $f$ and $g$ .
As an aside, I actually can't think of any class of interesting funct

... (read more)

LawrenceC's Shortform

Lawrence Chan1y*3018

I finally got around to reading the Mamba paper. H/t Ryan Greenblatt and Vivek Hebbar for helpful comments that got me unstuck.

TL;DR: authors propose a new deep learning architecture for sequence modeling with scaling laws that match transformers while being much more efficient to sample from.

A brief historical digression

As of ~2017, the three primary ways people had for doing sequence modeling were RNNs, Conv Nets, and Transformers, each with a unique “trick” for handling sequence data: recurrence, 1d convolutions, and self-attention.

RNNs are

... (read more)

4Vladimir Nesov1y

StripedHyena, Griffin, and especially Based suggest that combining RNN-like layers with even tiny sliding window attention might be a robust way of getting a large context, where the RNN-like layers don't have to be as good as Mamba for the combination to work. There is a great variety of RNN-like blocks that haven't been evaluated for hybridization with sliding window attention specifically, as in Griffin and Based. Some of them might turn out better than Mamba on scaling laws after hybridization, so Mamba being impressive without hybridization might be less important than this general point. (Possibly a window of precise attention gives the RNN layers many attempts at both storing and retrieving any given observation, so interspersing layes with even a relatively tiny window is sufficient to significantly improve on the more sloppy RNN-like model without any attention, whereas a pure RNN-like model would have to capture what it needs from a token in the exact step it appears, and then the opportunity is mostly lost. StripedHyena's attention wasn't sliding window, so context didn't scale any better, but with sliding window attention there are no context scaling implications from the attention layers.)

Ryan Greenblatt1y*123

Another key note about mamba is that despite being RNN-like it doesn't result in substantially higher effective serial reasoning depth (relative to transformers). This is because the state transition is linear^[1]. However, it is architecturally closer to things that might involve effectively higher depth.

See also here.

And indeed, there is a fundamental tradeoff where if the state transition function is expressive (e.g. nonlinear), then it would no longer be possible to use a parallel scan because the intermediates for the scan would be too large to rep

... (read more)

5Ryan Greenblatt1y

The induction heads extrapolation graph is a bit cherry picked/misleading IMO because it's still the case that mamba can't copy an arbitrary amount of text. (It can copy a some fixed finite number of tokens if it is predictable that those tokens will be copied over other tokens.) E.g., if you have N tokens repeated twice, mamba will fail to get perfect loss on the second repetition for sufficiently large N. To see this, note that the total hidden state is bounded so eventually there will be enough text to fill this state. This isn't true for transformers. It's unclear how much of an obstacle this is in practice, but it hints at ways in which transformers might be relatively fundamentally more capacity efficient. Note that this issues also applies to some sparse attention mechanisms like sliding window attention, but any attention mechanism that stores state for the entire sequence should be able to avoid this issue. (This is a well known result, though I forget the current state of empirical results.)

Natural Latents: The Math

Lawrence Chan1y40

Right, the step I missed on was that P(X|Y) = P(X|Z) for all y, z implies P(X|Z) = P(X). Thanks!

Natural Latents: The Math

Lawrence Chan1y20

Hm, it sounds like you're claiming that if each pair of x, y, z are pairwise independent conditioned on the third variable, and p(x, y, z) =/= 0 for all x, y, z with nonzero p(x), p(y), p(z), then ?

I tried for a bit to show this but couldn't prove it, let alone the general case without strong invariance. My guess is I'm probably missing something really obvious.

2johnswentworth1y

Yeah, that's right. The secret handshake is to start with "X is independent of Y given Z" and "X is independent of Z given Y", expressed in this particular form: P[X,Y,Z]=P[X|Z]P[Y,Z]=P[X|Y]P[Y,Z] ... then we immediately see that P[X|Z]=P[X|Y] for all X,Y,Z such that P[Y,Z]>0. So if there are no zero probabilities, then P[X|Z]=P[X|Y] for all X,Y,Z. That, in turn, implies that P[X|Z] takes on the same value for all Z, which in turn means that it's equal to P[X]. Thus X and Z are independent. Likewise for X and Y. Finally, we leverage independence of Y and Z given X: P[X,Y,Z]=P[Y|X]P[Z|X]P[X] =P[Y]P[Z]P[X] (A similar argument is in the middle of this post, along with a helpful-to-me visual.)

Natural Latents: The Math

Lawrence Chan1y40

Probabilities of zero are extremely load-bearing for natural latents in the exact case, and probabilities near zero are load-bearing in the approximate case; if the distribution is zero nowhere, then it can only have a natural latent if the $X_{i}$ ’s are all independent (in which case the trivial variable is a natural latent).

I'm a bit confused why this is the case. It seems like in the theorems, the only thing "near zero" is that D_KL (joint, factorized) < epsilon ~= 0 . But you. can satisfy this quite easily even with all probabilities > 0.

E.... (read more)

2johnswentworth1y

Roughly speaking, all variables completely independent is the only way to satisfy all the preconditions without zero-ish probabilities. This is easiest to see if we use a "strong invariance" condition, in which each of the Xi must mediate between X¯i and Λ. Mental picture: equilibrium gas in a box, in which we can measure roughly the same temperature and pressure (Λ) from any little spatially-localized chunk of the gas (Xi). If I estimate a temperature of 10°C from one little chunk of the gas, then the probability of estimating 20°C from another little chunk must be approximately-zero. The only case where that doesn't imply near-zero probabilities is when all values of both chunks of gas always imply the same temperature, i.e. Λ only ever takes on one value (and is therefore informationally empty). And in that case, the only way the conditions are satisfied is if the chunks of gas are unconditionally independent.

Many arguments for AI x-risk are wrong

Lawrence Chan1y*7381

I wasn't around in the community in 2010-2015, so I don't know what the state of RL knowledge was at that time. However, I dispute the claim that rationalists "completely miss[ed] this [..] interpretation":

To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL [..] the mechanistic function of per-trajectory rewards in a given batched update

... (read more)

4Alex Turner9mo

Knowing how to reason about "upweighting trajectories" when explicitly prompted or in narrow contexts of algorithmic implementation is not sufficient to conclude "people basically knew this perspective" (but it's certainly evidence). See Outside the Laboratory: Knowing "vanilla PG upweights trajectories", and being able to explain the math --- this is not enough to save someone from the rampant reward confusions. Certainly Yoshua Bengio could explain vanilla PG, and yet he goes on about how RL (almost certainly, IIRC) trains reward maximizers. I personally disagree --- although I think your list of alternative explanations is reasonable. If alignment theorists had been using this (simple and obvious-in-retrospect) "reward chisels circuits into the network" perspective, if they had really been using it and felt it deep within their bones, I think they would not have been particularly tempted by this family of mistakes.

Anthropic release Claude 3, claims >GPT-4 Performance

Lawrence Chan1y1311

After having spent a few hours playing with Opus, I think "slightly better than best public gpt-4" seems qualitatively correct -- both models tend to get tripped up on the same kinds of tasks, but Opus can inconsistently solve some tasks in my workflow that gpt-4 cannot.

And yeah, it seems likely that I will also swap to Claude over ChatGPT.

Anthropic release Claude 3, claims >GPT-4 Performance

Lawrence Chan1y34

Thanks for doing this!

Toward A Mathematical Framework for Computation in Superposition

Lawrence Chan1y20

Fascinating, thanks for the update!

Toward A Mathematical Framework for Computation in Superposition

Lawrence Chan1y*71

(I haven't had the chance to read part 3 in detail, and I also haven't checked the proofs except insofar as they seem reasonable on first viewing. Will probably have a lot more thoughts after I've had more time to digest.)

This is very cool work! I like the choice of U-AND task, which seems way more amenable to theoretical study (and is also a much more interesting task) than the absolute value task studied in Anthropic's Toy Model of Superposition (hereafter TMS). It's also nice to study this toy task with asymptotic theoretical analysis as opposed to the ... (read more)

3Jake Mendel1y

Thanks for the kind feedback! Do you mean the thing we call genericity in the further work section? If so, we have some preliminary theoretical and experimental evidence that genericity of U-AND is true. We trained networks on the U-AND task and the analogous U-XOR task, with a narrow 1-layer MLP and looked at the size of the interference terms after training with a suitable loss function. Then, we reinitialised and froze the first layer of weights and biases, allowing the network only to learn the linear readoff directions, and found that the error terms were comparably small in both cases. This figure is the size of the errors for d=d0=100 (which is pretty small) for readoffs which should be zero in blue and one in yellow (we want all these errors to be close to zero). This suggests that the AND/XOR directions were ϵ-linearly readoffable at initialisation, but the evidence at this stage is weak because we don't have a good sense yet of what a reasonable value of ϵ is for considering the task to have been learned correctly: to answer this we want to fiddle around with loss functions and training for longer. For context, an affine readoff (linear + bias) directly on the inputs can read off f1∧f2 with (→f1+→f2)/2−1/4, which has an error of ϵ=1/4. This is larger than all but the largest errors here, and you can’t do anything like this for XOR with affine readoff. After we did this, Kaarel came up with an argument that networks randomly initialised with weights from a standard Gaussian and zero bias solve U-AND with inputs not in superposition (although it probably can be generalised to the superposition case) for suitable readoffs. To sketch the idea: Let Wi be the vector of weights from the ith input to the neurons. Then consider the linear readoff vector with kth component given by: (−−−→ANDi,j)k=α(1Wik>0∧1Wjk>0)+β1Wik>0+γ1Wjk>0+δ where 1 is the indicator function. There are 4 free parameters here, which are set by 4 constraints given by requiring that the

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Lawrence Chan1y*190Review for 2022 Review

As with the CCS post, I'm reviewing both the paper and the post, though the majority of the review is on the paper. Writing this quickly (total time on review: ~1.5h), but I expect to be willing to defend the points being made --

There's a lot of reasons I like the work. It's an example of:

Actually poking inside a real model. A lot of the mech interp work in early-mid 2022 was focused on getting a deep understanding of toy models trained on algorithmic tasks (at least in this community).^[1] There was some effort at Redwood to do neuron-by-neuron replac

Lawrence Chan1y30

I agree that people dramatically overrated the empirical results of this work, but not more so than other pieces that "went viral" in this community. I'd be excited to see your takes on this general phenomenon as well as how we might address it in the future.

Ryan Greenblatt1y72

I agree not more than other pieces that "went viral", but I think that the lasting impact of the misconceptions seem much larger in the case of CCS. This is probably due to the conceptual ideas actually holding up in the case of CCS.

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Lawrence Chan1y*3913Review for 2022 Review

This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post.

Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper.

TL... (read more)

Ryan Greenblatt1y1717

Beyond the paper and post, I think it seems important to note the community reaction to this work. I think many people dramatically overrated the empirical results in this work due to a combination of misunderstanding what was actually done, misunderstanding why the method worked (which follow up work helped to clarify as you noted), and incorrectly predicting the method would work in many cases where it doesn't.

The actual conceptual ideas discussed in the blog post seem quite good and somewhat original (this is certainly the best presentation of these sor... (read more)

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Lawrence Chan1y102

I agree with the overall point (that this was a solid intellectual contribution and is a reasonable-ish metric), but there's been a non-zero amount of followups or at least use cases of this work, imo. Off the top of my head:

In general, CaSc has been used on lots of toy/tiny models to a decent level of success. I agree that part of the reason for CaSc's lack of adoption is that the metric consistently returns "this explanation is not very faithful/complete/etc". For example:
- I checked the hypotheses for the toy modular arithmetic/group composition work with

... (read more)

4Buck Shlegeris1y

Thanks for the links! I agree that the usecases are non-zero.

200 COP in MI: Exploring Polysemanticity and Superposition

Lawrence Chan1y20

I strongly upvoted this post because of the "Tips" section, which is something I've come around on only in the last ~2.5 months.

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Lawrence Chan1y*84

This is great work, even though you weren't able to understand the memorization mechanistically.

I agree that a big part of the reason to be pessimistic about ambitious mechanistic interp is that even very large neural networks are performing some amount of pure memorization. For example, existing LMs often can regurgitate some canary strings, which really seems like a case without any (to use your phrase) macrofeatures. Consequently, as you talk about in both posts 3 and 4, it's not clear that there even should be a good mechanistic understanding fo... (read more)

Shallow review of live agendas in alignment & safety

Lawrence Chan1y71

I spent a good hour or two reading the construction and proposed solution of the paper; here's my attempted explanation with cleaned up notation.

Basically, he considers a setting with four actions: a, b, c, d, and a real numbered state s, where R(s, a) > R(s, b) = R(s, c) > R(s, d) = 0 if s > 0 and 0 = R(s, d) > R(s, c) = R (s, b) > R(s, c) if s <= 0.

The transition rule is:
s' = s - 1 + L if action b is taken and s > 0,
s' = s - 1 - L if action c is taken and s > 0,
s' = s - 1 otherwise
for some constant L >= 0.

The ... (read more)

2Koen Holtman1y

Thanks for reading my paper! For the record I agree with some but not all points in your summary. My later paper 'AGI Agent Safety by Iteratively Improving the Utility Function' also uses the simulation environment with the > and < actions and I believe it explains the nature of the simulation a bit better by interpreting the setup more explicitly as a two-player game. By the way the > and < are supposed to be symbols representing arrows → and ← for 'push # to later in time' and 'pull # earlier in time'. No, the design of the gc agent is not motivated by the need to create an incentive to preserve the shutdown button itself, as required by desideratum 4 from Soares et al. Instead it is motivated by the desire to create an incentive to preserve agent's actuators that it will need to perform any physical actions incentivised by the shutdown reward function RS -- I introduce this as a new desideratum 6. A discussion about shaping incentives or non-incentives to preserve the button (as a sensor) is in section 7.3, where I basically propose to enhance the indifference effects produced by the reward function by setting up the physical environment around the button in a certain way: For the record, adding gc to the agent design creates no incentive to press the shutdown button: if it did, this would be visible as > actions in the simulation of the third line of figure 10, and also the proof in section 9 would not have been possible.

Shallow review of live agendas in alignment & safety

Lawrence Chan1y40

Both Alex Lyzhov and Jacob Pfau also post on LW/AF:

Shallow review of live agendas in alignment & safety

Lawrence Chan1y30

Expanding on this -- this whole area is probably best known as "AI Control", and I'd lump it under "Control the thing" as its own category. I'd also move Control Evals to this category as well, though someone at RR would know better than I.

2Ryan Greenblatt1y

Yep, indeed I would consider "control evaluations" to be a method of "AI control". I consider the evaluation and the technique development to be part of a unified methodology (we'll describe this more in a forthcoming post). (I work at RR)

Shallow review of live agendas in alignment & safety

Lawrence Chan1y52

Thanks for making this! I’ll have thoughts and nitpicks later, but this will be a useful reference!

4Lawrence Chan1y

Very small nitpick: I think you should at least add Alex Lyzhov, David Rein, Jacob Pfau, Salsabila Mahdi, and Julian Michael for the NYU Alignment Research Group; it's a bit weird to not list any NYU PhD students/RSs/PostDocs when listing people involved in NYU ARG. Both Alex Lyzhov and Jacob Pfau also post on LW/AF: * https://www.lesswrong.com/users/alexlyzhov * https://www.lesswrong.com/users/jacob-pfau

Interpretability Externalities Case Study - Hungry Hungry Hippos

Lawrence Chan1y30

Oh, I like that one! Going to use it from now on

Interpretability Externalities Case Study - Hungry Hungry Hippos

Lawrence Chan1y10

Sure, though it seems too general or common to use a long word for it?

Maybe "linear intervention"?

2Neel Nanda1y

We called it vector arithmetic in my Othello paper?

Impact stories for model internals: an exercise for interpretability researchers

Lawrence Chan1y10

Thanks!

Yeah, I think ELK is surprisingly popular in my experience amongst academics, though they tend to frame it in terms of partial observability (as opposed to the measurement tampering framing I often hear EA/AIS people use).

Impact stories for model internals: an exercise for interpretability researchers

Lawrence Chan2y10

Thanks for writing this up!

I'm curious about this:

I personally found the discussion useful for helping me understand what motivated some of the researchers I talked to. I was surprised by the diversity.

What motivated people in particular? What was surprising?

1Jenny Nitishinskaya1y

I had cached impressions that AI safety people were interested in auditing, ELK, and scalable oversight. A few AIS people who volunteered to give feedback before the workshop (so biased towards people who were interested in the title) each named a unique top choice: scientific understanding (specifically threat models), model editing, and auditing (so 2/3 were unexpected for me). During the workshop, attendees (again, biased, as they self-selected into the session) expressed excitement most about auditing, unlearning, MAD, ELK, and general scientific understanding. I was surprised at the interest in MAD and ELK, I thought there would be more skepticism around those; though I can see how they might be aesthetically appealing for the slightly more academic audience.

Interpretability Externalities Case Study - Hungry Hungry Hippos

Lawrence Chan2y20

Minor clarifying point: Act-adds cannot be cast as ablations.

Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It's possible there's a better or standard word that I can't think of write now.

Also, another example of an attempt at interp -> alignment would arguably be the model editing stuff following causal tracing in the ROME paper?

2Neel Nanda1y

I'd probably call it an act-add style intervention? Ablation connotes the removal of something. I originally thought patch, but patch should be about setting it to be equal to the activation on another input, and being able to put in 10x the original value doesn't count as patching.

Interpretability Externalities Case Study - Hungry Hungry Hippos

Lawrence Chan2y90

This is why I'm pessimistic about most interpretability work. It just isn't focused enough

Most of the "exploratory" interp work you suggest is trying to achieve an ambitious mechanistic understanding of models, which requires a really high degree of model understanding in general. They're not trying to solve particular concrete problems, and it seems unfair to evaluate them according to a different theory of change. If you're going to argue against this line of work, I think you should either argue that they're failing to achieve their theory of change, or... (read more)

3Alex Turner2y

Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing? Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of interp leading to an alignment technique (albeit two independent paths leading to a similar technique).

1Garrett Baker2y

Mostly I think that MI is right to think it can do a lot for alignment, but I suspect that lots of the best things it can do for alignment it will do in a very dual-use way, which skews heavily towards capabilities. Mostly because capabilities advances are easier and there are more people working on those. At the same time I suspect that many of those dual use concerns can be mitigated by making your MI research targeted. Not necessarily made such that you can do off-the-shelf interventions based on your findings, but made such that if it ever has any use, that use is going to be for alignment, and you can predict broadly what that use will look like. This also doesn't mean your MI research can't be ambitious. I don't want to criticize people for being ambitious or too theoretical! I want to criticize people for producing knowledge on something which, while powerful, seems powerful in too many directions to be useful if done publicly. I don't entirely know what you mean by this. How would we solve alignment by not going through a concrete problem? Maybe you think MI will be secondary to that process, and will give us useful information about what problems are necessary to solve? In such a case I still don't see why you need ambitious MI. You can just test the different problem classes directly. Maybe you think the different problem classes are too large to test directly. Even in that case, I still think that a more targeted approach would be better, where you generate as much info about those target classes as possible, while minimizing info that can be used to make your models better. And you selectively report only the results of your investigation which bear on the problem class. Even if the research is exploratory, the result & verification demonstration can still be targeted. Maybe I misspoke. I dislike current MI because I expect large capability improvements before and at the same time as the alignment improvements, but I don't dispute future alignment im

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Lawrence Chan2y40

Glad to see that this work is out!

I don't have much to say here, especially since I don't want to rehash the old arguments about the usefulness of prosaic adversarial ML research. (I think it's worth working on but the direct impacts of the work are unclear). I do think that most people in AIS agree that image advexes are challenging and generally unsolved, but the people who disagree on the relevance of this line of research tend to question the implied threat model.

Where might I direct promising-to-me researchers to apply for alignment jobs/grants?

Lawrence Chan2y10

I don't think they're hiring, but added.

Where might I direct promising-to-me researchers to apply for alignment jobs/grants?

Answer by Lawrence ChanSep 18, 2023*64

The main funders are LTFF, SFF/Lightspeed/other S-process stuff from Jaan Tallinn, and Open Phil. LTFF is the main one that solicits independent researcher grant applications.

There's a lot of orgs, off the top of my head, there's Anthropic/OpenAI/GDM as the scaling labs with decent-sized alignment teams, and then there's a bunch of smaller/independent orgs:

Alignment Research Center
Apollo Research
CAIS
CLR
Conjecture
FAR
Orthogonal
Redwood Research

And there's always academia.

(I'm sure I'm missing a few though!)

(EDIT: added in RR and CLR)

3DanielFilan2y

Redwood Research?