evolution does not grow minds, it grows hyperparameters for minds.
Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.
I want to distinguish two possible takes here:
It sounds like you're arguing against (1). Fair enough, I too think (1) isn't a great take in isolation. If the evolution analogy does not help you think more clearly about AI at all then I don't think you should change your mind much on the strength of the analogy alone. But my best guess is that most people incl Nate mean (2).
I think the specification problem is still hard and unsolved. It looks like you're using a different definition of 'specification problem' / 'outer alignment' than others, and this is causing confusion.
IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they'd lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in 'what would be useful for avoiding AGI doom'? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn't help alignment much.
More generally, I think this discussion would be more grounded / useful if you made more object-level claims about how value specification being solved (on your view) might be useful, rather than meta claims about what others were wrong about.
Do you have an example of one way that the full alignment problem is easier now that we've seen that GPT-4 can understand & report on human values?
(I'm asking because it's hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it's possible for outer alignment to become easier without the rest of the problem becoming easier).
I think it's false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it's false is mostly that I haven't seen a claim like that made anywhere, including in the posts you cite.
I agree lots of the responses elide the part where you emphasize that it's important how GPT-4 doesn't just understand human values, but is also "willing" to answer questions somewhat honestly. TBH I don't understand why that's an important part of the picture for you, and I can see why some responses would just see the "GPT-4 understands human values" part as the important bit (I made that mistake too on my first reading, before I went back and re-read).
It seems to me that trying to explain the original motivations for posts like Hidden Complexity of Wishes is a good attempt at resolving this discussion, and it looks to me as if the responses from MIRI are trying to do that, which is part of why I wanted to disagree with the claim that the responses are missing the point / not engaging productively.
You make a claim that's very close to that - your claim, if I understand correctly, is that MIRI thought AI wouldn't understand human values and also not lie to us about it (or otherwise decide to give misleading or unhelpful outputs):
The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
I think this is similar enough (and false for the same reasons) that I don't think the responses are misrepresenting you that badly. Of course I might also be misunderstanding you, but I did read the relevant parts multiple times to make sure, so I don't think it makes sense to blame your readers for the misunderstanding.
My paraphrase of your (Matthews) position: while I'm not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don't systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.
(End paraphrase)
I think this claim is mistaken, or at least it rests on false assumptions about what alignment researchers believe. Here's a bunch of different angles on why I think this:
My guess is a big part of the disagreement here is that I think you make some wrong assumptions about what alignment researchers believe.
I think you're putting a bit too much weight on the inner vs outer alignment distinction. The central problem that people talked about always was how to get an AI to care about human values. E.g. in The Hidden Complexity of Wishes (THCW) Eliezer writes
To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish.
If you find something that looks to you like a solution to outer alignment / value specification, but it doesn't help make an AI care about human values, then you're probably mistaken about what actual problem the term 'value specification' is pointing at. (Or maybe you're claiming that value specification is just not relevant to AI safety - but I don't think you are?).
It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now also point at an LLM and get a result that's not all that much worse than pointing at a human is not cause for an update about how hard value specification is. Part of the difficulty is how to define the pointer to the human and get a model to maximize human values rather than maximize some error in your specification. IMO THCW makes this point pretty well.
It's tricky to communicate problems in AI alignment―people come in with lots of different assumptions about what kind of things are easy / hard, and it's hard to resolve disagreements because we don't have a live AGI to do experiments on. I think THCW and related essays you criticize are actually great resources. They don't try to communicate the entire problem at once because that's infeasible. The fact that human values are complex and hard to specify explicitly is part of the reason why alignment is hard, where alignment means get the AI to care about human values, not get an AI to answer questions about moral behavior reasonably.
You claim the existence of GPT-4 is evidence against the claims in THCW. But IMO GPT-4 fits in neatly with THCW. The post even starts with a taxonomy of genies:
There are three kinds of genies: Genies to whom you can safely say "I wish for you to do what I should wish for"; genies for which no wish is safe; and genies that aren't very powerful or intelligent.
GPT-4 is an example of a genie that is not very powerful or intelligent.
If in 5 years we build firefighter LLMs that can rescue mothers from burning buildings when you ask them to, that would also not show that we've solved value specification - it's just a didactic example, not a full description of the actual technical problem. More broadly, I think it's plausible that within a few years LLM will be able to give moral counsel far better than the average human. That still doesn't solve value specification any more than the existence of humans that could give good moral counsel 20 years ago had solved value specification.
If you could come up with a simple action-value function Q(observation, action), that when maximized over actions yields a good outcome for humans, then I think that would probably be helpful for alignment. This is an example of a result that doesn't directly make an AI care about human values, but would probably lead to progress in that direction. I think if it turned out to be easy to formalize such a Q then I would change my mind about how hard value specification is.
While language models understand human values to some extent, they aren't robust. The RHLF/RLAIF family of methods is based on using an LLM as a reward model, and to make things work you need to be careful not to optimize too hard or you'll just get gibberish (Gao et al. 2022). LLMs don't hold up against mundane RLHF optimization pressure, nevermind an actual superintelligence. (Of course, humans wouldn't hold up either).
Broadly agree with the takes here.
However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.
This seems right and I don't think we say anything contradicting it in the paper.
I also don't see how saying 'different patterns are learned at different speeds' is supposed to have any explanatory power. It doesn't explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing patterns across domains. It feels like saying 'bricks fall because it's in a brick's nature to move towards the ground': both are repackaging an observation as an explanation.
The idea is that the framing 'learning at different speeds' lets you frame grokking and double descent as the same thing. More like generalizing 'bricks move towards the ground' and 'rocks move towards the ground' to 'objects move towards the ground'. I don't think we make any grand claims about explaining everything in the paper, but I'll have a look and see if there's edits I should make - thanks for raising these points.
There are positive feedback loops between prongs:
If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.
A three-pronged approach to AGI safety. (This is assuming we couldn't just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).
Prong 1: boxing & capability control (aka ‘careful bootstrapping’)
Prong 2: scary demos and and convincing people that AGI is dangerous
Prong 3: alignment research aka “understanding minds”
The argument I think is good (nr (2) in my previous comment) doesn't go through reference classes at all. I don't want to make an outside-view argument (eg "things we call optimization often produce misaligned results, therefore sgd is dangerous"). I like the evolution analogy because it makes salient some aspects of AI training that make misalignment more likely. Once those aspects are salient you can stop thinking about evolution and just think directly about AI.