I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don't currently know of any good writeup. Major pieces in part one:

  • Some semitechnical intuition-building for high-dimensional problem-spaces.
    • Optimization compresses information "by default"
    • Resources and "instrumental convergence" without any explicit reference to agents
  • A frame for thinking about the alignment problem which only talks about high-dimensional problem-spaces, without reference to AI per se.
    • The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values.
    • Details like whether an AI is a singleton, tool AI, multipolar, oracle, etc are mostly irrelevant.
  • Fermi estimate: just how complex are human values?
  • Coherence arguments, presented the way I think they should be done.
    • Also subagents!

Note that I don't talk about timelines or takeoff scenarios; this talk is just about the technical problem of alignment.

Here's the video for part one: 

Big thanks to Rob Miles for editing! Also, the video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.

New Comment
18 comments, sorted by Click to highlight new comments since: Today at 11:25 PM

Are there already plans for a transcript of this? (I could set in motion of a rev.com transcription)

No plans in motion. Thank you very much if you decide to do so! Also, you might want to message Rob to get the images.

I've put in a request for a transcript.

How do transcriptions typically handle images? They're pretty important for this talk. You could embed the images in the text as it progresses?

Thanks a bunch!

  1. I want to interrogate a little more the notion that gradient descent samples uniformly (or rather, are dominated by the initialization distribution) from good parameters. Have you read various things about grokking like Hypothesis: GD Prefers General Crictuits? That argument seems to be that you might start with parameters dominated by the initialization distribution, but various sorts of regularization are going to push you to sample solutions in a nonuniform way. Do you have a take on this?
  2. For the power-seeking-because-of-entropy example, I want to second the audience questions. If you're getting your policy by sampling from all possible policies, the argument is great, but if you're getting your policy by sampling from NN parameters that generate strings of 100 actions, then you just finished arguing that uniform-ish sampling over NN parameters will give simplcity-ish sampling over policies. What would a NN do if trained to play the example game? I would assume it would quickly learn to exactly alternate $ and Apple. This looks like something that seems a little less like powerseeking, and more like telling DeepDream to fill the image with dogs, except filling a string with buying three apples. I dunno, do you think it's still like powerseeking?
  3. I think you make a subtle error when throwing out a lot of "mere biology" genes as not generating human values. If we had different mere biology than we do, the values we develop would probably be different even if our brain-specific genes were the same! Like, I dunno, suppose you have some genes that build your thyroid. But you can't go "ho hum, the thyroid isn't the brain, let's throw those genes out as uninformative," because thyroid disorders activity impacts your mood, which impacts your expressed values. Or I bet I'd have different values if my eyes saw in UV rather then visible, or my skin had no sense of pain, or I went through adolescence in two days rather than five years. Basically I totally disagree with this notion that "if we share it with plants, an AI wouldn't need to know it."
  4. Actually I'm kinda not sure how relevant you think the size-of-human-preference-generators question is, since we don't want the AI to learn human preferences in gene-format, we want the AI to learn human preferences in some (different, I think we agree) format that's better-suited for doing things like making decisions or comparing between different humans.
  5. Cool last section. If you can have 2 dimensions of things to be Pareto optimal over tradeoffs between, why not N dimensions? It seems like there are behaviors that are irrational even for markets (is failing to make mutually beneficial trades between individuals an example? I'm having trouble thinking of something less inward-facing) that could be "optimal" for decision-making procedures with N of 3 or 4.

I second Rob's unanswered question at 40:12: how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?

How is the "the genome is small, therefore generators of human values (that can't be learned from the environment) are no more complex than tens or hundreds of things on the order of a fuzzy face detector" argument compatible with the complexity of value thesis, or does it contradict it?

how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?

This question needs a whole essay (or several) on its own. If I don't get around to leaving a longer answer in the next few days, ping me.

Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?

How is the "the genome is small, therefore generators of human values (that can't be learned from the environment) are no more complex than tens or hundreds of things on the order of a fuzzy face detector" argument compatible with the complexity of value thesis, or does it contradict it?

The key difference is between "human values" vs "generators of human values". The complexity of value thesis (as articulated on that arbital page) says that human values are not algorithmically simple, and I do agree with that. But that still allows for simple generators of human values, which (conceptually) take in lots of data from the real world and spit out values. Everything except those generators is learned from the environment.

In principle, if we can figure out those relatively-simple generators, then we can feed an AI data similar to the data from which humans' value-generators generate their values, and the AI should be able to reconstruct human values (up to within ordinary between-humans-within-similar-environments variation).

Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?

Cultural accumulation and google, but that's mimicking someone who's already figured it out. How about the person who first figured out eg crop growth? Could be scientific method, but also just random luck which then caught on. 

Additionally, sometimes it's just applying the same hammers to different nails or finding new nails, which means that there are general patterns (hammers) that can be applied to many different situations. There's bits of information in both the patterns themselves and when to apply them, though I feel confused trying to connect these ideas here.

People specifically have inner simulations (ie you can imagine what it'd look like to drop a bowling ball off a building even if you've never seen it) from things you have lots of experience with is a way of applying different patterns to new situations.

I think a lot of the values we care about are cultural, not just genetic. A human raised without culture isn't even clearly going to be generally intelligent (in the way humans are), so why assume they'd share our values?

Estimations of the information content of this part are discussed by Eric Baum in What is Thought?, although I do not recall the details.

I find that plausible, a priori. Mostly doesn't affect the stuff in the talk, since that would still come from the environment, and the same principles would apply to culturally-derived values as to environment-derived values more generally. Assuming the hardwired part is figured out, we should still be able to get an estimate of human values within the typical-human-value-distribution-for-a-given-culture from data which is within the typical-human-environment-distribution-for-that-culture.

Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:

  • Vast majority don't sort (or are even compilable)
  • The vast majority of programs that "look like they work", don't (eg "forgot a semicolon", "didn't account for an already sorted list", etc)
  • Generalizing: the vast majority of programs that pass [Unit tests, compiles, human says "looks good to me", simple], don't work. 
    • Could be incomprehensible, pass several unit tests, but still fail in weird edge cases (eg. when the input number is [84, >100, a prime number > 13, etc], then it spits out gibberish) 
    • counterargument for alignment check of "run it in a simulation to see if it breaks out of the box" because this is just another proxy.
    • Some constraints above are necessary, like being compilable,  and some aren't, like some randomly generated sorting algorithms that are really hard to understand. For example, could be written in brainfuck, or contain 10,000 lines of code that are mostly redundant or happen to cancel out and sorts correctly
      • To relate to the original talk, I agree that I can recognize my own values once I reflect on them, but this is different than seeing a plan about an AI that keeps my values and thinking "this looks like it works". In other words, the "human values" shouldn't be a strict subset of the "human says it looks like it works", just like "correctly sorts" shouldn't be a strict subset of "human says it looks like it works" due to incomprehensibility.

For programs specifically, if it's simple and passes a relevant distribution of unit tests, we can be highly confident it in fact sorts correctly, but what's the equivalent for "plan that maintains human values"? Let's say John succeeds and finds what we think to be the generators of human values, would it be comprehensible enough to verify it?

Applying the argument again but to John's proposed solution, the vast majority of [Ai's trained in human environments with what we think are the simple generators of human values]'s plans & behaviors may look good but not actually be good. Or the weights are incomprehensible, so we use unit tests to verify and it could still fail.

Counter-counterargument: I can imagine these generators being simple enough that we can indeed be confident they do what we want. Since it should be human-value-equivalent, it should also be human-interpretable (under reflection?). 

This sounds like a good idea overall, but I wouldn't bet my life on it. It'd be nice to have necessary and sufficient conditions for this possible solution.

Cheers for posting! I've got a question about the claim that optimizers compress by default, due to the entropy maximization-style argument given around 20:00 (apologies if you covered this, it's not easy to check back through a video):

Let's say that we have a neural network of width 100, which is trained on a dataset which could be trained to perfect accuracy on a network of width of only 30. If it compresses it into only 30 weights there's a 70-dimensional space of free parameters and we should expect a randomly selected solution to be of this kind. 

I agree that if we randomly sample zero-loss weight configurations, we end up with this kind of compression, but it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do. 

Assuming that the network is parameterized by, say, float16s, maximal compression of the data would result in the output of the network being sensitive to the final bit of the weights in as many cases as possible, thereby leaving the largest number of free bits, so 16 bits of info would be compressed in to one weight, rather than spread among 3-4.

My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice (and also have a big problem with an unknown examples, and are therefore screened off by techniques like dropout and regularization). There is therefore a competing incentive towards minima which are easy to land on - probably flat minima surrounded by areas of relatively good performance. Further, I expect that these kind of minima tend to leverage the whole network for redundancy and flatness (not needing to depend tightly on the final bit of weights).

The properties of would be not just compression but some combination of compression and smoothness (smoothness being sort of a variant of compression where the final bits don't matter much) which would not result in some subset of the parameters having all the useful information. 

If you agree that this is what happens, in what sense is there really compression, if the info is spread among multiple bits? Perhaps given the structure of NNs, we should expect to be able to compress by removing the last bits of weights as these are the easiest to leave free given the structure of training?

If you disagree I'd be curious to know where. I sense that Mingard et al shares your conclusion but I don't yet understand the claimed empirical demonstration.

tldr: optimization may compress by default, but learning seems to counteract this by choosing easy-to-find minima.

it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do.

This is where Mingard et al come in. One of their main results is that SGD training on neural nets does quite well approximate just-randomly-sampling-an-optimal-point. Turns out our methods are not actually very path-dependent in practice!

My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice... There is therefore a competing incentive towards minima which are easy to land on - probably flat minima surrounded by areas of relatively good performance.

There is a mismatch between your intuition and the implications of "flat minima surrounded by areas of relatively good performance".

Remember, the whole point of the "highly compressed arrangements" is that we only need to lock in a few parameter values in order to get optimal behavior; once those few values are locked in, the rest of the parameters can mostly vary however they want without screwing stuff up. "Flat minimum surrounded by areas of relatively good performance" is synonymous with compression: if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can't-vary-without-losing-performance.

Now, your intuition is correct in the sense that info may be spread over many parameters; the relevant "ways to vary things" may not just be "adjust one param while holding others constant". For instance, it might be more useful to look at parameter variation along local eigendirections of the Hessian. Then the claim would be something like "flat optimum = performance is flat along lots of eigendirections, therefore we can project the parameter-values onto the non-flat eigendirections and those projections are the 'compressed info'". (Tbc, I still don't know what the best way is to characterize this sort of thing, but eigendirections are an obvious approximation which will probably work.)

Turns out our methods are not actually very path-dependent in practice!

Yeah I get that's what Mingard et al are trying to show but the meaning of their empirical results isn't clear to me - but I'll try and properly read the actual paper rather than the blog post before saying any more in that direction.

"Flat minimum surrounded by areas of relatively good performance" is synonymous with compression. if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can't-vary-without-losing-performance.

I get that a truly flat area is synonymous with compression - but I think being surrounded by areas of good performance is anti-correlated with compression because it indicates redundancy and less-than-maximal sensitivity. 

I agree that viewing it as flat eigendimensions in parameter space is the right way to think about it, I still worry that the same concerns apply that maximal compression in this space is traded against ease of finding what would be a flat plain in many dimensions, but a maximally steep ravine in all of the other directions. I can imagine this could be investigated with some small experiments, or they may well already exist but I can't promise I'll follow up, if anyone is interested let me know.

Thanks a lot for posting this! A minor point about the 2nd intuition pump (100-timesteps, 4 actions: Take $1, Do Nothing, Buy Apple, Buy Banana; the point being that most action sequences take the Take $1 action a lot rather than the Do Nothing action): the "goal" of getting 3 apples seems irrelevant to the point, and may be misleading if you think that that goal is where the push to acquire resources comes from. A more central source seems to me to be the "rule" of not ending with a negative balance: this is what prunes paths through the tree that contain more "do nothing" actions.

Yup! More generally, key pieces for modeling a "resource": amounts of the resource are additive, and more resources open up more actions (operationalized by the need for a positive balance in this case). If there's something roughly like that in the problem space, then the resource-seeking argument kicks in.

Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I'm unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it.

Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)? Or, how do you use this to create a simulated long reflection? (ie what humans would decide ethics to be if they thought about it for [1000] years)

You could first figure out meta-preferences and bootstrap that in for figuring out preferences. Though, I'm unsure if there are a "correct" set of meta-preferences, with my main confusion being the blank spot in my map where "enlightenment" is.