Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

The Scaling Hypothesis (Gwern Branwen) (summarized by Rohin): This post centers around the scaling hypothesis:

Once we find a scalable architecture which can be applied fairly uniformly, we can simply train ever larger networks and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks and data. More powerful NNs are “just” scaled-up weak NNs, in much the same way that human brains look much like scaled-up primate brains⁠.

Importantly, we can get this sophisticated behavior just by training on simple objectives, such as “predict the next word”, as long as the data is sufficiently diverse. So, a priori, why might we expect the scaling hypothesis to be true?

The core reason is that optimal (or human-level) prediction of text really does require knowledge, reasoning, causality, etc. If you don’t know how to perform addition, you are probably not going to be able to predict the next word in the sentence “Though he started with six eggs, he found another fourteen, bringing his total to ____”. However, since any specific fact is only useful in a tiny, tiny number of cases, it only reduces the expected loss by a tiny amount. So, you’ll only see models learn this sort of behavior once they have exhausted all the other “easy wins” for predicting text; this will only happen when the models and dataset are huge.

Consider a model tasked with predicting characters in text with a set of 64 characters (52 uppercase and lowercase letters, along with some punctuation). Initially it outputs random characters, assigning a probability of 1/64 to the correct character, resulting in a loss of 6 bits. Once you start training, the easiest win is to simply notice how frequent each character is; just noticing that uppercase letters are rare, spaces are common, vowels are common, etc. could get your error down to 4-5 bits. After this, it might start to learn what words actually exist; this might take 10^5 - 10^6 samples since each word is relatively rare and there are thousands of words to learn, but this is a drop in the bucket given our huge dataset. After this step, it may have also learned punctuation along the way, and might now be down to 3-4 bits. At this point, if you sample from the model, you might get correctly spelled English words, but they won’t make any sense.

With further training the model now has to pick up on associations between adjacent words to make progress. Now it needs to look at things 10 characters ago to predict the next character -- a far cry from our initial letter frequencies where it didn’t even need to look at other characters! For example, it might learn that “George W” tends to be followed by “ashington”. It starts to learn grammar, being able to correctly put verbs in relation to subjects and objects (that are themselves nouns). It starts to notice patterns in how words like “before” and “after” are used; these can then be used to better predict words in the future; at this point it’s clear that the model is starting to learn semantics. Now the loss is around 2 bits per character. A little more training and your model starts to produce sentences that sound human-like in isolation, but don’t fit together: a model might start a story about a very much alive protagonist and then talk about how she is dead in the next sentence. Training is now about fixing errors like these and each such fix gains a tiny amount of accuracy -- think ten thousandths of a bit. Every further 0.1 bits you gain represents the model learning a huge amount of relevant knowledge (and correspondingly each subsequent 0.1 bits takes a much larger amount of training and data). The final few fractions of a bit are the most important and comprise most of what we call “intelligence”.

(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)

So far this is a clever argument, but doesn’t really establish that this will work in practice -- for example, maybe your model has to have 10^100 parameters to learn all of this, or maybe existing models and algorithms are not sophisticated enough to find the right parameters (and instead just plateau at, say, 2 bits of loss). But recent evidence provides strong support for the scaling hypothesis:

1. The scaling laws (AN #87) line of work demonstrated that models could be expected to reach the interesting realm of loss at amounts of compute, data, and model capacity that seemed feasible in the near future.

2. Various projects have trained large models and demonstrated that this allows them to solve tasks that they weren’t explicitly trained for, often in a more human-like way and with better performance than a more supervised approach. Examples include GPT-3 (AN #102), Image GPT, BigGAN, AlphaStar (AN #73), etc. (The full post has something like 25 examples.)

The author then argues that it seems like most researchers seem to be completely ignoring this phenomenon. OpenAI is the only actor that really has the conviction needed to put a large amount of resources behind a project based on the scaling hypothesis (such as GPT-3); DeepMind seems to believe in a weaker version where we need to build a bunch of “modules” similar to those in the human brain, but that those modules can then be scaled up indefinitely. Other actors seem to not take either scaling hypothesis very seriously.

Rohin's opinion: In my view, the scaling hypothesis is easily the most important hypothesis relevant to AI forecasting and AI development models, and this is the best public writeup of it that I know of. (For example, it seems to be an implicit assumption in the bio anchors framework (AN #121).) I broadly agree with the author that it’s a bit shocking how few people seem to be taking it seriously after OpenAI Five, AlphaStar, GPT-3, Copilot, etc.

I think this includes the AI safety space, where as far as I can tell the primary effect has been that it is even more fashionable to have shorter timelines, whereas it hasn’t affected AI safety research very much. However, I do know around 3-4 researchers who changed what they were working on based on changing their mind about the scaling hypothesis, so it’s possible there are several others I don’t know about.

As a simple example of how the scaling hypothesis affects AI safety research, it suggests that the training objective (“predict the next word”) is relatively unimportant in determining properties of the trained agent; in contrast, the dataset is much more important. This suggests that analyses based on the “reward function used to train the agent” are probably not going to be very predictive of the systems we actually build.

TECHNICAL AI ALIGNMENT


AGENT FOUNDATIONS

The Accumulation of Knowledge (Alex Flint) (summarized by Rohin): Probability theory can tell us about how we ought to build agents that have knowledge (start with a prior and perform Bayesian updates as evidence comes in). However, this is not the only way to create knowledge: for example, humans are not ideal Bayesian reasoners. As part of our quest to describe existing agents (AN #66), could we have a theory of knowledge that specifies when a particular physical region within a closed system is “creating knowledge”? We want a theory that works in the Game of Life (AN #151) as well as the real world.

This sequence investigates this question from the perspective of defining the accumulation of knowledge as increasing correspondence between a map and the territory, and concludes that such definitions are not tenable. In particular, it considers four possibilities and demonstrates counterexamples to all of them:

1. Direct map-territory resemblance: Here, we say that knowledge accumulates in a physical region of space (the “map”) if that region of space looks more like the full system (the “territory”) over time.

Problem: This definition fails to account for cases of knowledge where the map is represented in a very different way that doesn’t resemble the territory, such as when a map is represented by a sequence of zeros and ones in a computer.

2. Map-territory mutual information: Instead of looking at direct resemblance, we can ask whether there is increasing mutual information between the supposed map and the territory it is meant to represent.

Problem: In the real world, nearly every region of space will have high mutual information with the rest of the world. For example, by this definition, a rock accumulates lots of knowledge as photons incident on its face affect the properties of specific electrons in the rock giving it lots of information.

3. Mutual information of an abstraction layer: An abstraction layer is a grouping of low-level configurations into high-level configurations such that transitions between high-level configurations are predictable without knowing the low-level configurations. For example, the zeros and ones in a computer are the high-level configurations of a digital abstraction layer over low-level physics. Knowledge accumulates in a region of space if that space has a digital abstraction layer, and the high-level configurations of the map have increasing mutual information with the low-level configurations of the territory.

Problem: A video camera that constantly records would accumulate much more knowledge by this definition than a human, even though the human is much more able to construct models and act on them.

4. Precipitation of action: The problem with our previous definitions is that they don’t require the knowledge to be useful. So perhaps we can instead say that knowledge is accumulating when it is being used to take action. To make this mechanistic, we say that knowledge accumulates when an entity’s actions become more fine-tuned to a specific environment configuration over time. (Intuitively, they learned more about the environment and so could condition their actions on that knowledge, which they previously could not do.)

Problem: This definition requires the knowledge to actually be used to count as knowledge. However, if someone makes a map of a coastline, but that map is never used (perhaps it is quickly destroyed), it seems wrong to say that during the map-making process knowledge was not accumulating.

AI GOVERNANCE

AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries (Peter Cihon et al) (summarized by Rohin): Certification is a method of reducing information asymmetries: it presents credible information about a product to an audience that they couldn’t have easily gotten otherwise. With AI systems, certification could be used to credibly share information between AI actors, which could promote trust amongst competitors, or to share safety measures to prevent a race to the bottom on safety, caused by worrying that “the other guys would be even more unsafe”. Certification is at its best when there is demand from an audience to see such certificates; public education about the need for credible information can help generate such demand.

However, certification often runs into problems. Symbol-substance decoupling happens when certificates are issued to systems that don’t meet the standards for certification. For example, in “ethics washing”, companies advertise a self-certificate in which their products are approved by ethics boards, but those ethics boards have no real power. Means-ends decoupling happens when the standards for certification don’t advance the goals for which the certificate was designed. For example, a certificate might focus on whether a system was tested, rather than on what test was conducted, leading applicants to use easy-to-pass tests that don’t actually provide a check on whether the method is safe.

Effective certification for future AI systems needs to be responsive to changes in AI technology. This can be achieved in a few ways: first, we can try to test the underlying goals which are more likely to remain stable; for example, we could certify ethical principles that will likely remain the same in the future. Second, we can match the certification to the types of people and institutions, that is, our certifications talk about the executives, citizens, or corporations (rather than e.g. specific algorithms, that may be replaced in the future). Third, the certification system can build in mechanisms for updating the certification criteria periodically.

The paper then analyzes seven existing certification systems for AI systems; you’ll have to read the paper for details.

Case studies of self-governance to reduce technology risk (Jia) (summarized by Rohin): Should we expect AI companies to reduce risk through self-governance? This post investigates six historical cases, of which the two most successful were the Asilomar conference on recombinant DNA and the actions of Leo Szilard and other physicists in 1939 (around the development of the atomic bomb). It is hard to make any confident conclusions, but the author identifies the following five factors that make self-governance more likely:

1. The risks are salient.

2. If self-governance doesn’t happen, then the government will step in with regulation (which is expected to be poorly designed).

3. The field is small, so that coordination is easier.

4. There is support from gatekeepers (e.g. academic journals).

5. There is support from credentialed scientists.

Corporate Governance of Artificial Intelligence in the Public Interest (Peter Cihon et al) (summarized by Rohin): This paper is a broad overview of corporate governance of AI, where by corporate governance we mean “anything that affects how AI is governed within corporations” (a much broader category than the governance that is done by corporations about AI). The authors identify nine primary groups of actors that can influence corporate governance and give many examples of how such actors have affected AI governance in the past. The nine groups are managers, workers, investors, corporate partners and competitors, industry consortia, nonprofit organizations, the public, the media, and governments.

Since the paper is primarily a large set of examples along with pointers to other literature on the topic, I’m not going to summarize it in more detail here, though I did find many of the examples interesting (and would dive into them further if time was not so scarce).

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment
17 comments, sorted by Click to highlight new comments since:

This comment is inspired by a conversation with Ajeya Cotra.

As a simple example of how the scaling hypothesis affects AI safety research, it suggests that the training objective (“predict the next word”) is relatively unimportant in determining properties of the trained agent; in contrast, the dataset is much more important. This suggests that analyses based on the “reward function used to train the agent” are probably not going to be very predictive of the systems we actually build.

To elaborate on this more:

Claim 1: Scaling hypothesis + abundance of data + competitiveness requirement implies that an alignment solution will need to involve pretraining.

Argument: The scaling hypothesis implies that you can get strong capabilities out of abundant effectively-free data. So, if you want your alignment proposal to be competitive, it must also get strong capabilities out of effectively-free data. So far, the only method we know of for this is pretraining.

Note that you could have schemes where you train an actor model using a reward model that is always aligned; in this case your actor model could avoid pretraining (since you can generate effectively-free data from the reward model) but your reward model will need to be pretrained. So the claim is that some part of your scheme involves pretraining; it doesn't have to be the final agent that is deployed.

Claim 2: For a fixed 'reasonable' pretraining objective, there exists some (possibly crazy and bespoke but still reasonably-sized) dataset which would make the resulting model aligned without any finetuning.

(This claim is more of an intuition pump for Claim 3, rather than an interesting claim in its own right)

Argument 1: As long as your pretraining objective doesn't do something unreasonable like say "ignore the data, always say 'hello world'", given the fixed pretraining objective each data point acts as a "constraint" on the parameters of the model. If you have D data points and N model parameters with D > N, then you should expect these constraints to approximately determine the model parameters (in the same way that N linearly independent equations on N variables uniquely determine those variables). So with the appropriate choice of the D data points, you should be able to get any model parameters you want, including the parameters of the aligned model.

Argument 2: There are ~tens of bits going into the choice of pretraining objective, and ~millions of bits going into the dataset, so in some sense nearly all of the action is in the dataset.

Argument 3: For the specific case of next-word prediction, you could take an aligned model, generate a dataset by running that model, and then train a new model with next-word prediction on that dataset.
I believe this is equivalent to model distillation, which has been found to be really unreasonably effective, including for generalization (see e.g. here), so I’d expect the resulting model would be aligned too.

Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.

Argument: Basically the same as for Claim 2: by far most of the influence on which model you get out is coming from the dataset.

(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn't mean that the pretraining objective can't have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)

Claim 4: GPT-N need not be "trying" to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.

Argument: There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there's a corresponding model that the resulting GPT-N would "try" to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn't matter much which pretraining objective you use, so most of these models would be wrong.

Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.

Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn't matter.

And this is true up to a point: up to constant factors, it doesn't matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does U2. And so "there exists a program in U2-encoding which implements P in U1-encoding" doesn't get everything I want: I want to reason about the distribution of programs, about how hard it tends to be to get programs with desirable properties.

Stepping out of the analogy, even though I agree that "reasonable" pretraining objectives are all compatible with aligned / unaligned /arbitrarily behaved models, this argument seems to leave room that some objectives make alignment far more likely, a priori. And you may be noting as much:

(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn't mean that the pretraining objective can't have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)

Yeah, I agree with all this. I still think the pretraining objective basically doesn't matter for alignment (beyond being "reasonable") but I don't think the argument I've given establishes that.

I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention (and thus Claim 4 as well).

Sure.

Additional note for posterity: when I talked about "some objectives [may] make alignment far more likely", I was considering something like "given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

You're going to need the ease of specification condition, or something similar; else you'll probably run into no-free-lunch considerations (at which point I think you've stopped talking about anything useful).

Claim 4: GPT-N need not be "trying" to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.

I got a bit confused by this section, I think because the word "model" is being used in two different ways, neither of which is in the sense of "machine learning model".

Paraphrasing what I think is being said:

  • An observer (us) has a model_1 of what GPT-N is doing.
  • According to their model_1, GPT-N is building its own world model_2, that it uses to plan its actions.
  • The observer's model_1 makes good predictions about GPT-N's behavior when GPT-N (the machine learning model_3) is tested on data that comes from the training distribution, but bad predictions about what GPT-N will do when tested (or used) on data that does not come from the training distribution.
  • The way that the observer's model_1 will be wrong is not that it will be fooled by GPT-N taking a treacherous turn, but rather the opposite -- the observer's model_1 will predict a treacherous turn, but instead GPT-N will go on filling in missing words, as in training (or something else?).

Is that right?

Yes, that's right, sorry about the confusion.

Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:

There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there's a corresponding model that the resulting GPT-N would "try" to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn't matter much which pretraining objective you use, so most of these models would be wrong.

Seems to me the conclusion of this argument is that "In general it's not true that the AI is trying to achieve its training objective." The natural corollary is: We have no idea what the AI is trying to achieve, if it is trying to achieve anything at all. So instead of concluding "It'll probably just keep filling in missing words as in training" we should conclude "we have no idea what it'll do; treacherous turn is a real possibility because that's what'll happen for most goals it could have, and it may have a goal for all we know."

The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn.

?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?

EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just "fills in words" instead.

Seems to me the conclusion of this argument is that "In general it's not true that the AI is trying to achieve its training objective." 

Isn't that effectively what I said? (I was trying to be more precise since "achieve its training objective" is ambiguous, but given what I understand you to mean by that phrase, I think it's what I said?)

we have no idea what it'll do; treacherous turn is a real possibility because that's what'll happen for most goals it could have, and it may have a goal for all we know.

This seems reasonable to me (and seems compatible with what I said)

OK cool, sorry for the confusion. Yeah I think ESRogs interpretation of you was making a bit stronger claim than you actually were.

(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)

I'd like to know what this figure is based on. In the linked post, Gwern writes:

The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character⁠.

But in that linked post, there's no mention of "0.7" bits in particular, as far as I or cmd-f can see. The most relevant passage I've read is:

Claude Shannon found that each character was carrying more like 1 (0.6-1.3) bit of unguessable information (differing from genre to genre8); Hamid Moradi found 1.62-2.28 bits on various books9⁠; Brown et al 1992 found <1.72 bits; Teahan & Cleary 1996 got 1.46; Cover & King 1978 came up with 1.3 bits10⁠; and Behr et al 2002 found 1.6 bits for English and that compressibility was similar to this when using translations in Arabic/Chinese/French/Greek/Japanese/Korean/Russian/Spanish (with Japanese as an outlier). In practice, existing algorithms can make it down to just 2 bits to represent a character, and theory suggests the true entropy was around 0.8 bits per character.11

I'm not sure what the relationship is between supposedly unguessable information and human performance, but assuming that all these sources were actually just estimating human performance, and without looking into the sources more... this isn't just lots of uncertainty, but vast amounts of uncertainty, where it's very plausible that GPT-3 has already beaten humans. This wouldn't be that surprising, given that GPT-3 must have memorised a lot of statistical information about how common various words are, which humans certainly don't know by default.

I have a lot of respect for people looking into a literature like this and forming their own subjective guess, but it'd be good to know if that's what happened here, or if there is some source that pinpoints 0.7 in particular as a good estimate.

It's based on those estimates and the systematic biases in such methods & literatures. Just as you know that psychology and medical effects are always overestimated and can be rounded down by 50% to get a more plausible real world estimate, such information-theoretic methods will always overestimate model performance and underestimate human performance, and are based on various idealizations: they use limited genres and writing styles (formal, omitting informal like slang), don't involve extensive human calibration or training like the models get, don't involve any adversarial examples, don't try to test human reasoning by writing up texts made up of logical riddles and puzzles or complicated cause-and-effect scenarios or even things like Winograd Schemas, are time-biased, etc. We've seen a lot of these issues come up in benchmarking, like ImageNet models outside ImageNet despite hitting human parity or superiority. (If we are interested in truly testing 'compression = intelligence', we need texts which stress all capabilities and remove all of those issues.)

So given Shannon's interval's lower end is 0.6, and Grassberger's asymptotic is 0.8 (the footnote 11) and a widespread of upper bounds going down to 1.3 along with extremely dumb fast algorithms hitting 2, I am comfortable with rounding them downish to get estimates of 0.7 bpc being the human performance; and I expect that to, if anything, be still underestimating true human peak performance, so I wouldn't be shocked if it was actually more like 0.6 bpc.

I'd like to hear more thoughts, from Rohin or anybody else, about how the scaling hypothesis might affect safety work.

Wrote a separate comment here (in particular I think claims 1 and 4 are directly relevant to safety)

Second, we can match the certification to the types of people and institutions, that is, our certifications talk about the executives, citizens, or corporations (rather than e.g. specific algorithms, that may be replaced in the future). Third, the certification system can build in mechanisms for updating the certification criteria periodically.

* I think effective certification is likely to involve expert analysis (including non-technical domain experts) of specific algorithms used in specific contexts.  This appears to contradict the "Second" point above somewhat.
* I want people to work on developing the infrastructure for such analyses.  This is in keeping with the "Third" point.
* This will likely involve a massive increase in investment of AI talent in the process of certification. 

As an example, I think "manipulative" algorithms -- that treat humans as part of the state to be optimized over -- should be banned in many applications in the near future, and that we need expert involvement to determine the propensity of different algorithms to actually optimize over humans in various contexts.

I think effective certification is likely to involve expert analysis (including non-technical domain experts) of specific algorithms used in specific contexts.  This appears to contradict the "Second" point above somewhat.

The idea with the "Second" point is that the certification would be something like "we certify that company X has a process Y for analyzing and fixing potential problem Z whenever they build a new algorithm / product", which seems like it is consistent with your belief here? Unless you think that the process isn't enough, you need to certify the analysis itself.

I think the contradiction may only be apparent, but I thought it was worth mentioning anyways.  
My point was just that we might actually want certifications to say things about specific algorithms.