All of leogao's Comments + Replies

leogao96

in retrospect, we know from chinchilla that gpt3 allocated its compute too much to parameters as opposed to training tokens. so it's not surprising that models since then are smaller. model size is a less fundamental measure of model cost than pretraining compute. from here on i'm going to assume that whenever you say size you meant to say compute.

obviously it is possible to train better models using the same amount of compute. one way to see this is that it is definitely possible to train worse models with the same compute, and it is implausible that the ... (read more)

leogao32

simple ideas often require tremendous amounts of effort to make work.

leogao94

I'm very excited about approaches to add hierarchy to SAEs - seems like an important step forward. In general, approaches that constraint latents in various ways that let us have higher L0 without reconstruction becoming trivial seem exciting.

I think it would be cool to get follow up work on bigger LMs. It should also be possible to do matryoshka with block size = 1 efficiently with some kernel tricks, which would be cool.

3Noa Nabeshima
Yes, follow up work with bigger LMs seems good! I use number of prefix-losses per batch = 10 here; I tried 100 prefixes per batch and the learned latents looked similar at a quick glance, so I wonder if naively training with block size = 1 might not be qualitatively different. I'm not that sure and training faster with kernels on its own seems good also! Maybe if you had a kernel for training with block size = 1 it would create surface area for figuring out how to work on absorption when latents are right next to each other in the matryoshka latent ordering.
leogao62

This is likely not the first instance, but OpenAI was already using the word "aligned" in this way in 2021 in the Codex paper.

https://arxiv.org/abs/2107.03374 (section 7.2)

leogao72

Very exciting that JumpReLU works well with STE gradient estimation! I think this fixes one of the biggest flaws with TopK, which is that having a fixed number of latents k on each token is kind of wonky. I also like the argument in section 4 a lot - in particular the point about how this works because we're optimizing the expectation of the loss. Because of how sparse the features are, I wonder if it would reduce gradient noise substantially to use a KDE with state persisting across a few recent steps.

leogao60

Some takes on some of these research questions:

Looking for opposing feature directions in SAEs

I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < -0.9, and 1314 with cosine sim < -0.7.

SAE/Transcoder activation shuffling

I'm confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from prelimi... (read more)

leogao92

Cool work - figuring out how much of scaling up autoencoders is discovering new features vs splitting existing ones feels quite important. Especially since for any one scale of autoencoder there are simultaneously features which are split too finely and features which are too rare to yet be discovered, it seems quite plausible that the most useful autoencoders will be ones with features stitched together from multiple scales.

Some minor nitpicks: I would recommend always thinking of MSE/L0 in terms of the frontier between the two, rather than either alone; in my experiments I found it very easy to misjudge at a glance whether a run with better MSE but worse L0 was better or worse than the frontier.

leogao*50

VAEs:

a normal autoencoder decodes single latents z to single images (or whatever other kind of data) x, and also encodes single images x to single latents z. 

with VAEs, we want our decoder (p(x|z)) to take single latents z and output a distribution over x's. for simplicity we generally declare that this distribution is a gaussian with identity covariance, and we have our decoder output a single x value that is the mean of the gaussian.

because each x can be produced by multiple z's, to run this backwards you also need a distribution of z's for each sin... (read more)

leogao171

learning thread for taking notes on things as i learn them (in public so hopefully other people can get value out of it)

leogao*50

VAEs:

a normal autoencoder decodes single latents z to single images (or whatever other kind of data) x, and also encodes single images x to single latents z. 

with VAEs, we want our decoder (p(x|z)) to take single latents z and output a distribution over x's. for simplicity we generally declare that this distribution is a gaussian with identity covariance, and we have our decoder output a single x value that is the mean of the gaussian.

because each x can be produced by multiple z's, to run this backwards you also need a distribution of z's for each sin... (read more)

leogao80

We had done very extensive ablations at small scale where we found TopK to be consistently better than all of the alternatives we iterated through, and by the time we launched the big run we had already worked out how to scale all of the relevant hyperparameters, so we were decently confident.

One reason we might want a progressive code is it would basically let you train one autoencoder and use it for any k you wanted to at test time (which is nice because we don't really know exactly how to set k for maximum interpretability yet). Unfortunately, this is s... (read more)

leogao110

To add some more concreteness: suppose we open up the model and find that it's basically just a giant k nearest neighbors (it obviously can't be literally this, but this is easiest to describe as an analogy). Then this would explain why current alignment techniques work and dissolves some of the mystery of generalization. Then suppose we create AGI and we find that it does something very different internally that is more deeply entangled and we can't really make sense of it because it's too complicated. Then this would imo also provide strong evidence that... (read more)

leogao90

Thanks for your kind words!

My views on interpretability are complicated by the fact that I think it's quite probable there will be a paradigm shift between current AI and the thing that is actually AGI like 10 years from now or whatever. So I'll describe first a rough sketch of what I think within-paradigm interp looks like and then what it might imply for 10 year later AGI. (All these numbers will be very low confidence and basically made up)

I think the autoencoder research agenda is currently making significant progress on item #1. The main research bott... (read more)

leogao110

To add some more concreteness: suppose we open up the model and find that it's basically just a giant k nearest neighbors (it obviously can't be literally this, but this is easiest to describe as an analogy). Then this would explain why current alignment techniques work and dissolves some of the mystery of generalization. Then suppose we create AGI and we find that it does something very different internally that is more deeply entangled and we can't really make sense of it because it's too complicated. Then this would imo also provide strong evidence that... (read more)

leogao67

For what it's worth, it seems much more likely to me for catastrophic Goodhart to happen because the noise isn't independent from the thing we care about, rather than the noise being independent but heavy tailed.

leogao22

It doesn't seem like a huge deal to depend on the existence of smaller LLMs - they'll be cheap compared to the bigger one, and many LM series already contain smaller models. Not transferring between sites seems like a problem for any kind of reconstruction based metric because there's actually just differently important information in different parts of the model.

leogao10

Sorry I meant the Anthropiclike neuron resampling procedure.

I think I misread Neel's comment, I thought he was saying that 131k was chosen because larger autoencoders would have too many dead latents (as opposed to this only being for Pythia residual).

1Arthur Conmy
Ah yeah, Neel's comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic's work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.
leogao10

Another question: any particular reason to expect ablate-to-zero to be the most relevant baseline? In my experiments, I find ablate to zero to completely destroy the loss. So it's unclear whether 90% recovered on this metric actually means that much - GPT-2 probably recovers 90% of the loss of GPT-4 under this metric, but obviously GPT-2 only explains a tiny fraction of GPT-4's capabilities. I feel like a more natural measure may be for example the equivalent compute efficiency hit.

4Neel Nanda
Nah I think it's pretty sketchy. I personally prefer mean ablation, especially for residual stream SAEs where zero ablation is super damaging. But even there I agree. Compute efficiency hit would be nice, though it's a pain to get the scaling laws precise enough For our paper this is irrelevant though IMO because we're comparing gated and normal SAEs, and I think this is just scaling by a constant? It's at least monotonic in CE loss degradation
1Arthur Conmy
I don't think zero ablation is that great a baseline. We're mostly using it for continuity's sake with Anthropic's prior work (and also it's a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish. I don't think compute efficiency hit[1] is ideal. It's really expensive to compute, since you can't just calculate it from an SAE alone as you need to know facts about smaller LLMs. It also doesn't transfer as well between sites (splicing in an attention layer SAE doesn't impact loss much, splicing in an MLP SAE impacts loss more, and residual stream SAEs impact loss the most). Overall I expect it's a useful expensive alternative to loss recovered, not a replacement. EDIT: on consideration of Leo's reply, I think my point about transfer is wrong; a metric like "compute efficiency recovered" could always be created by rescaling the compute efficiency number. 1. ^ What I understand "compute efficiency hit" to mean is: for a given (SAE, LM1) pair, how much less compute you'd need (as a multiplier) to train a different LM, LM2 such that LM2 gets the same loss as LM1-with-the-SAE-spliced-in.
leogao10

Got it - do you think with a bit more tuning the feature death at larger scale could be eliminated, or would it be tough to manage with the reinitialization approach?

1Arthur Conmy
I'm not sure what you mean by "the reinitialization approach" but feature death doesn't seem to be a major issue at the moment. At all sites besides L27, our Gemma-7B SAEs didn't have much feature death at all (stats at https://arxiv.org/pdf/2404.16014v2 up in a few hours), and also the Anthropic update suggests even in small models the problem can be addressed.
leogao10

Makes sense that the shift would be helpful

leogao31

Great paper! The gating approach is an interesting way to learn the JumpReLU threshold and it's exciting that it works well. We've been working on some related directions at OpenAI based on similar intuitions about feature shrinking.

Some questions:

  • Is b_mag still necessary in the gated autoencoder?
  • Did you sweep learning rates for the baseline and your approach?
  • How large is the dictionary of the autoencoder?
2Senthooran Rajamanoharan
On bmag, it's unclear what a "natural" choice would be for setting this parameter in order to simplify the architecture further. One natural reference point is to set it to ermag⊙bgate, but this corresponds to getting rid of the discontinuity in the Jump ReLU (turning the magnitude encoder into a ReLU on multiplicatively rescaled gate encoder preactivations). Effectively (removing the now unnecessary auxiliary task), this would give results similar to the "baseline + rescale & shift" benchmark in section 5.2 of the paper, although probably worse, as we wouldn't have the shift.
4Arthur Conmy
We use learning rate 0.0003 for all Gated SAE experiments, and also the GELU-1L baseline experiment. We swept for optimal baseline learning rates on GELU-1L for the baseline SAE to generate this value.  For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by E||x||2, motivated by wanting better hyperparameter transfer, and so changed learning rate to 0.001 or 0.00075 for all the runs (currently in Figure 1, only attention output pre-linear uses 0.00075. In the rerelease we'll state all the values used). We didn't see noticable difference in the Pareto frontier changing between 0.001 and 0.00075 so did not sweep the baseline hyperparameter further than this.
3Neel Nanda
Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This'll be added to the paper soon, sorry!). I'll leave the other Qs for my co-authors
leogao10

For your dashboards, how many tokens are you retrieving the top examples from?

leogao20

Why do you scale your MSE by 1/(x_centred**2).sum(dim=-1, keepdim=True).sqrt() ? In particular, I'm confused about why you have the square root. Shouldn't it just be 1/(x_centred**2).sum(dim=-1, keepdim=True)?

leogao2321

I think this paper is empirical evidence for a nontrivial part of the deceptive alignment argument (RLHF/adversarial training being insufficient to remove it), and I also think most empirical papers don't make any sense when applied to AGI.

I think I have an intellectually consistent stance - I don't think this is because I have a double standard for pessimistic results.

First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training fai... (read more)

leogao10

The training set is a random 100k subsample of this dataset: https://huggingface.co/datasets/amazon_polarity

I'm prepending Alice/Bob and doing the xor of the label in exactly the same way you do.

leogao50

I'm having some trouble replicating this result in a not exactly comparable setting (internal model, looking at is_alice xor amazon_sentiment). I get 90%+ on the constituent datasets, but only up to 75% on the xor depending on which layer I look at.

(low confidence, will update as I get more results)

1Sam Marks
Thanks for doing this! Can you share the dataset that you're working with? I'm traveling right now, but when I get a chance I might try to replicate your failed replication on LLaMA-2-13B and with my codebase (which can be found here; see especially xor_probing.ipynb).
leogao154

I think deceptive alignment is still reasonably likely despite evidence from LLMs.

I agree with:

  • LLMs are not deceptively aligned and don't really have inner goals in the sense that is scary
  • LLMs memorize a bunch of stuff
  • the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
  • Adam on transformers does not have a super strong simplicity bias
  • without deceptive alignment, AI risk is a lot lower
  • LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)

I predict I... (read more)

1Robert Kirk
How would you imagine doing this? I understand your hypothesis to be "If a model generalises as if it's a mesa-optimiser, then it's better-described as having simplicity bias". Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?
4Alex Turner
Note that "LLMs are evidence against this hypothesis" isn't my main point here. The main claim is that the positive arguments for deceptive alignment are flimsy, and thus the prior is very low.
leogao66

I agree with the spirit of the post but not the kinda clickbaity title. I think a lot of people are over updating on single forward pass behavior of current LLMs. However, I think it is still possible to get evidence using current models with careful experiment design and being careful with what kinds of conclusions to draw.

leogao73

I think the main crux is that in my mind, the thing you call the "weak version" of the argument simply is the only and sufficient argument for inner misalignment and very sharp left turn. I am confused precisely what distinction you draw between the weak and strong version of the argument; the rest of this comment is an attempt to figure that out.

My understanding is that in your view, having the same drive as before means also having similar actions as before. For example, if humans have a drive for making art, in the ancestral environment this means drawi... (read more)

2Kaj Sotala
I agree that they don't depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following: Those arguments are explicitly premised on humans having been optimized for IGF, which is implied to be a single thing. As I understand it, your argument is just that humans now have some very different behaviors from the ones they used to have, omitting any claims of what evolution originally optimized us for, so I see it as making a very different sort of claim. To respond to your argument itself: I agree that there are drives for which the behavior looks very different from anything that we did in the ancestral environment. But does very different-looking behavior by itself constitute a sharp left turn relative to our original values? I would think that if humans had experienced a sharp left turn, then the values of our early ancestors should look unrecognizable to us, and vice versa. And certainly, there do seem to be quite a few things that our values differ on - modern notions like universal human rights and living a good life while working in an office might seem quite alien and repulsive to some tribal warrior who values valor in combat and killing and enslaving the neighboring tribe, for instance. At the same time... I think we can still basically recognize and understand the values of that tribal warrior, even if we don't share them. We do still understand what's attractive about valor, power, and prowess, and continue to enjoy those kinds of values in less destructive forms in sports, games, and fiction. We can read Gilgamesh or Homer or Shakespeare and basically get what the characters are motivated by and why they are doing the things they're doing. An anthropologist can go to a remote tribe to live among them and report that they have the same cultural and psychological universals as everyone else and come away with at least some basic understanding of how they think and wh
leogao3912

I agree with most of the factual claims made in this post about evolution. I agree that "IGF is the objective" is somewhat sloppy shorthand. However, after diving into the specific ways the object level details of "IGF is the objective" play out, I am confused about why you believe this implies the things you claim they imply about the sharp left turn / inner misalignment. Overall, I still believe that natural selection is a reasonable analogy for inner misalignment.

  • I agree fitness is not a single stationary thing. I agree this is prima facie unlike supe
... (read more)

So I think the issue is that when we discuss what I'd call the "standard argument from evolution", you can read two slightly different claims into it. My original post was a bit muddled because I think those claims are often conflated, and before writing this reply I hadn't managed to explicitly distinguish them.

The weaker form of the argument, which I interpret your comment to be talking about, goes something like this:

  • The original evolutionary "intent" of various human behaviors/goals was to increase fitness, but in the modern day these behaviors/goals a
... (read more)

Thank you, I like this comment. It feels very cooperative and like some significant effort went into it, and it also seems to touch the core of some important consideratios.

I notice I'm having difficulty responding, in that I disagree with some of what you said, but then have difficulty figuring out my reasons for that disagreement. I have the sense there's a subtle confusion going on, but trying to answer you makes me uncertain whether others are the ones with the subtle confusion or if I am.

I'll think about it some more and get back to you.

Answer by leogao2434

Obviously I think it's worth being careful, but I think in general it's actually relatively hard to accidentally advance capabilities too much by working specifically on alignment. Some reasons:

  1. Researchers of all fields tend to do this thing where they have really strong conviction in their direction and think everyone should work on their thing. Convincing them that some other direction is better is actually pretty hard even if you're trying to shove your ideas down their throats.
  2. Often the bottleneck is not that nobody realizes that something is a bott
... (read more)
leogao110

Ran this on GPT-4-base and it gets 56.7% (n=1000)

4Ethan Perez
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I'd be interested to know both)
4Quintin Pope
What about RLHF'd GPT-4?
leogao54

I think it's worth disentangling LLMs and Transformers and so on in discussions like this one--they are not one and the same. For instance, the following are distinct positions that have quite different implications:

  • The current precise transformer LM setup but bigger will never achieve AGI
  • A transformer trained on the language modelling objective will never achieve AGI (but a transformer network trained with other modalities or objectives or whatever will)
  • A language model with the transformer architecture will never achieve AGI (but a language model wit
... (read more)
leogao20

Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart (https://arxiv.org/pdf/2210.10760.pdf#page=17). Jacob probably has more detailed takes on this than me. 

In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

leogao518

Adding $200 to the pool. Also, I endorse the existence of more bounties/contests like this.

leogao10

re:1, yeah that seems plausible, I'm thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.

re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn't endorse. More broadly, the thing I'm focusing on in this post is not really about drift over time or self improvement; in the setup I'm describing, the thing that goes wrong is it does the classical ... (read more)

leogao40

one man's modus tollens is another man's modus ponens:

"making progress without empirical feedback loops is really hard, so we should get feedback loops where possible" "in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard"

2Raymond Arnold
Yeah something in this space seems like a central crux to me. I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops." I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")
leogao2834

Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.

This seems wrong. I think the mistake you're making is when you argue that because there's some chance X happens at each step and X is an absorbing state, therefore you have to end up at X eventually. However, this is only true if you assume the conclusion and claim that the prior probability of luigis is zero. If there is some prior probability of a luigi, each non-waluigi step incre... (read more)

4Abram Demski
I disagree. The crux of the matter is the limited memory of an LLM. If the LLM had unlimited memory, then every Luigi act would further accumulate a little evidence against Waluigi. But because LLMs can only update on so much context, the probability drops to a small one instead of continuing to drop to zero. This makes waluigi inevitable in the long run.

Agreed.  To give a concrete toy example:  Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}.  If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi.  The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).

This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.

leogao*1316

However, this trick won't solve the problem. The LLM will print the correct answer if it trusts the flattery about Jane, and it will trust the flattery about Jane if the LLM trusts that the story is "super-duper definitely 100% true and factual". But why would the LLM trust that sentence?

There's a fun connection to ELK here. Suppose you see this and decide: "ok forget trying to describe in language that it's definitely 100% true and factual in natural language. What if we just add a special token that I prepend to indicate '100% true and factual, for reals... (read more)

5Cleo Nardo
Yes — this is exactly what I've been thinking about! Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true. If the answer is "yes", then that's a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.
leogao20

The coin flip example seems related to some of the ideas here

leogao30

But like, why?

I think maybe the crux is the part about the strength of the incentives towards doing capabilities. From my perspective it generally seems like this incentive gradient is pretty real: getting funded for capabilities is a lot easier, it's a lot more prestigious and high status in the mainstream, etc. I also myself viscerally feel the pull of wishful thinking (I really want to be wrong about high P(doom)!) and spend a lot of willpower trying to combat it (but also not so much that I fail to update where things genuinely are not as bad as I would expect, but also not allowing that to be an excuse for wishful thinking, etc...). 

3Rohin Shah
In that case, I think you should try and find out what the incentive gradient is like for other people before prescribing the actions that they should take. I'd predict that for a lot of alignment researchers your list of incentives mostly doesn't resonate, relative to things like: 1. Active discomfort at potentially contributing to a problem that could end humanity 2. Social pressure + status incentives from EAs / rationalists to work on safety and not capabilities 3. Desire to work on philosophical or mathematical puzzles, rather than mucking around in the weeds of ML engineering 4. Wanting to do something big-picture / impactful / meaningful (tbc this could apply to both alignment and capabilities) For reference, I'd list (2) and (4) as the main things that affects me, with maybe a little bit of (3), and I used to also be pretty affected by (1). None of the things you listed feel like they affect me much (now or in the past), except perhaps wishful thinking (though I don't really see that as an "incentive").
leogao30

"This [model] is zero evidence for the claim" is a roughly accurate view of my opinion. I think you're right that epistemically it would have been much better for me to have said something along those lines. Will edit something into my original comment.

leogao30

Exponentials are memoryless. If you advance an exponential to where it would be one year from now. then some future milestone (like "level of capability required for doom") appears exactly one year earlier. [...]

Errr, I feel like we already agree on this point? Like I'm saying almost exactly the same thing you're saying; sorry if I didn't make it prominent enough:

It happens to be false in the specific model of moving an exponential up (if you instantaneously double the progress at some point in time, the deadline moves one doubling-time closer, but the tot

... (read more)
2Rohin Shah
Yes, sorry, I realized that right after I posted and replaced it with a better response, but apparently you already saw it :( But like, why? I wish people would argue for this instead of flatly asserting it and then talking about increased scrutiny or burdens of proof (which I also don't like).
leogao*69

Not OP, just some personal takes:

That's not small!

To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold. (To be clear, I don't think the exponential model presented provides evidence for this latter claim)

I think the first claim is obviously true. The second claim could be true in practice, though I feel quite uncertain about this. It happens to be false in the specific model of moving an ex... (read more)

4Rohin Shah
As you note, the second claim is false for the model the OP mentions. I don't care about the first claim once you know whether the second claim is true or false, which is the important part. I agree it could be true in practice in other models but I am unhappy about the pattern where someone makes a claim based on arguments that are clearly wrong, and then you treat the claim as something worth thinking about anyway. (To be fair maybe you already believed the claim or were interested in it rather than reacting to it being present in this post, but I still wish you'd say something like "this post is zero evidence for the claim, people should not update at all on it, separately I think it might be true".) To my knowledge, nobody in this debate thinks that advancing capabilities is uniformly good. Yes, obviously there is an effect of "less time for alignment research" which I think is bad all else equal. The point is just that there is also a positive impact of "lessens overhangs". I find the principle "don't do X if it has any negative effects, no matter how many positive effects it has" extremely weird but I agree if you endorse that that means you should never work on things that advance capabilities. But if you endorse that principle, why did you join OpenAI?
2Matt MacDermott
Nice, thanks. It seems like the distinction the authors make between 'building agents from the ground up' and 'understanding their behaviour and predicting roughly what they will do' maps to the distinction I'm making, but I'm not convinced by the claim that the second one is a much stronger version of the first. The argument in the paper is that the first requires an understanding of just one agent, while the second requires an understanding of all agents. But it seems like they require different kinds of understanding, especially if the agent being built is meant to be some theoretical ideal of rationality. Building a perfect chess algorithm is just a different task to summarising the way an arbitrary algorithm plays chess (which you could attempt without even knowing the rules).
leogao30

I agree with the general point here but I think there's an important consideration that makes the application to RL algorithms less clear: wireheading is an artifact of embeddedness, and most RL work is in the non-embedded setting. Thus, it seems plausible that the development of better RL algorithms does in fact lead to the development of algorithms that would, if they were deployed in an embedded setting, wirehead.

2Steve Byrnes
Here’s a question: In a non-embedded (cartesian) training environment where wireheading is impossible, is it the case that: * IF an intervention makes the value function strictly more accurate as an approximation of expected future reward, * THEN this intervention is guaranteed to lead to an RL agent that does more cool things that the programmers want? I can’t immediately think of any counterexamples to that claim, but I would still guess that counterexamples exist. (For the record, I do not claim that wireheading is nothing to worry about. I think that wireheading is a plausible but not inevitable failure mode. I don’t currently know of any plan in which there is a strong reason to believe that wireheading definitely won’t happen, except plans that severely cripple capabilities, such that the AGI can’t invent new technology etc. And I agree with you that if AI people continue to do all their work in wirehead-proof cartesian training environments, and don’t even try to think about wireheading, then we shouldn’t expect them to make any progress on the wireheading problem!)
leogao31

I think of mesaoptimization as primarily being concerning because it would mean models (selected using amortized optimization) doing their own direct optimization, and the extent to which the model is itself doing its own "direct" optimization vs just being "amortized" is what I would call the optimizer-controller spectrum (see this post also).

Also, it seems kind of inaccurate to declare that (non-RL) ML systems are fundamentally amortized optimization and then to say things like "more computation and better algorithms should improve safety and the pr... (read more)

leogao*22

I expect that the key externalities will be borne by society. The main reason for this is I expect deceptive alignment to be a big deal. It will at some point be very easy to make AI appear safe, by making it pretend to be aligned, and very hard to make it actually aligned. Then, I expect something like the following to play out (this is already an optimistic rollout intended to isolate the externality aspect, not a representative one):

We start observing alignment failures in models. Maybe a bunch of AIs do things analogous to shoddy accounting practices. ... (read more)

leogao*928

A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well.

Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like "well yes but this is just in a toy environment, and it's a big leap to it taking over the world", but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem ... (read more)

1benedelman
Thanks for laying out the case for this scenario, and for making a concrete analogy to a current world problem! I think our differing intuitions on how likely this scenario is might boil down to different intuitions about the following question: To what extent will the costs of misalignment be borne by the direct users/employers of AI? Addressing climate change is hard specifically because the costs of fossil fuel emissions are pretty much entirely borne by agents other than the emitters. If this weren't the case, then it wouldn't be a problem, for the reasons you've mentioned! I agree that if the costs of misalignment are nearly entirely externalities, then your argument is convincing. And I have a lot of uncertainty about whether this is true. My gut intuition, though, is that employing a misaligned AI is less like "emitting CO2 into the atmosphere" and more like "employing a very misaligned human employee" or "using shoddy accounting practices" or "secretly taking sketchy shortcuts on engineering projects in order to save costs"—all of which yield serious risks for the employer, and all of which real-world companies take serious steps to avoid, even when these steps are costly (with high probability, if not in expectation) in the short term. I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people, and for businesses to act once misalignment hurts their bottom lines (again, unless you think misalignment can always be shoved under the rug and not hurt anyone's bottom line). There's lots of room for this to happen in the middle ground between toy environments and taking over the world (unless you expect lightning-fast takeoff, which I don't).
Load More