Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I'm also supported by the LTFF. See also LinkedIn.
E-mail: {first name}@alter.org.il
This is just a self-study list for people who want to understand and/or contribute to the learning-theoretic AI alignment research agenda. I'm not sure why people thought it deserves to be in the Review. FWIW, I keep using it with my MATS scholars, and I keep it more or less up-to-date. A complementary resource that became available more recently is the video lectures.
This post suggests an analogy between (some) AI alignment proposals and shell games or perpetuum mobile proposals. Pertuum mobiles are an example how an idea might look sensible to someone with a half-baked understanding of the domain, while remaining very far from anything workable. A clever arguer can (intentionally or not!) hide the error in the design wherever the audience is not looking at any given moment. Similarly, some alignment proposals might seem correct when zooming in on every piece separately, but that's because the error is always hidden away somewhere else.
I don't think this adds anything very deep to understanding AI alignment, but it is a cute example how atheoretical analysis can fail catastrophically, especially when the the designer is motivated to argue that their invention works. Conversely, knowledge of a deep theoretical principle can refute a huge swath of design space is a single move. I will remember this for didactic purposes.
Disclaimer: A cute analogy by itself proves little, any individual alignment proposal might be free of such sins, and didactic tools should be used wisely, lest they become soldier-arguments. The author intends this (I think) mostly as a guiding principle for critical analysis of proposals.
This post argues against alignment protocols based on outsourcing alignment research to AI. It makes some good points, but also feels insufficiently charitable to the proposals it's criticizing.
John make his case by an analogy to human experts. If you're hiring an expert in domain X, but you understand little in domain X yourself then you're going to have 3 serious problems:
These points are relevant. However, they don't fully engage with the main source of hope for outsourcing proponents. Namely, it's the principle that validation is easier than generation[1]. While it's true that an arbitrary dilettante might not benefit from an arbitrary expert, the fact that it's easier to comprehend an idea than invent it yourself means that we can get some value from outsourcing, under some half-plausible conditions.
The claim that the "AI expert" can be deceptive and/or malicious is straightforwardly true. I think that the best hope to address it would be something like Autocalibrated Quantilized Debate, but it does require some favorable assumptions about the feasibility of deception and inner alignment is still a problem.
The "illusion of transparency" argument is more confusing IMO. The obvious counterargument is, imagine an AI that is trained to not only produce correct answers but also explain them in a way that's as useful as possible for the audience. However, there are two issues with this counterargument:
First, how do we know that the generalization from the training data to the real use case (alignment research) is reliable? Given that we cannot reliably test the real use case, precisely because we are alignment dilettantes?
Second, we might be following a poor metastrategy. It is easy to imagine, in the world we currently inhabit, that an AI lab creates catastrophic unaligned AI, even though they think they care about alignment, just because they are too reckless and overconfident. By the same token, we can imagine such an AI lab consulting their own AI about alignment, and then proceeding with the reckless and overconfident plans suggested by the AI.
In the context of a sufficiently cautious metastrategy, it is not implausible that we can get some mileage from the outsourcing approach[2]. Move one step at a time, spend a lot of time reflecting on the AI's proposals, and also have strong guardrails against the possibility of superhuman deception or inner alignment failures (which we currently don't know how to build!) But without this context, we are indeed liable to become the clients in the satiric video John linked.
I think that John might disagree with this principle. A world in which the principle is mostly false would be peculiar. It would be a world in which marketplaces of ideas don't work at all, and even if someone fully solves AI alignment they will fail to convince most relevant people that their solution is correct (any more than someone with an incorrect solution would succeed in that). I don't think that's the world we live in.
Although currently I consider PSI to be more promising.
This post makes an important point: the words "artificial intelligence" don't necessarily carve reality at the joints, the fact something is true about a modern system that we call AI doesn't automatically imply anything about arbitrary future AI systems, no more than conclusions about e.g. Dendral or DeepBlue carry over to Gemini.
That said, IMO the author somewhat overstates their thesis. Specifically, I take issue with all the following claims:
First, those points are somewhat vague because it's not clear what counts as "LLM". The phrase "Large Language Model" is already obsolete, at least because modern AI is multimodal. It's more appropriate to speak of "Foundation Models" (FM). More importantly, it's not clear what kind of fine-tuning does or doesn't count (RLHF? RL on CoT? ...)
Second, how do we know FM won't become AGI? I'm imagining the argument is something like "FM is primarily about prediction, so it doesn't have agency". However, when predicting data that contains or implies decisions by agents, it's not crazy to imagine that agency can arise in the predictor.
Third, how do we know that FM are always going to be safe? By the same token that they can develop agency, they can develop dangerous properties.
Fourth, it seems really unfair to say existing AI provides no relevant evidence. The achievements of existing AI systems are such that it seems very likely they capture at least some of the key algorithmic capabilities of the human brain. The ability of relatively simple and generic algorithms to perform well on a large variety of different tasks is indicative of something in the system being quite "general", even if not "general intelligence" in the full sense.
I think that we should definitely try learning from existing AI. However, this learning should be more sophisticated and theory-driven than superficial analogies or trend extrapolations. What we shouldn't do is say "we succeeded at aligning existing AI, therefore AI alignment is easy/solved in general". The same theories that predicted catastrophic AI risk also predict roughly the current level of alignment for current AI systems.
I will expand a little on this last point. The core of the catastrophic AI risk scenario is:
The above scenario must be addressed eventually, if only to create an AI defense system against unaligned AI that irresponsible actors could create. However, no modern AI system operates in this scenario. This is the most basic reason why the relative ease of alignment in modern systems (although even modern systems have alignment issues), does little to dispel concerns about catastrophic AI risk in the future.
Even for simple goals inner alignment is a concern. However, it's harder to say at which level of capability this concern arises.
It's also possible that mistakes are not catastrophic per se, but are simultaneously rare enough that it's hard to get enough training data and frequent enough to be troublesome. This is related to the reliability problems in modern AI that we indeed observe.
But sometimes it might be tricky to hit the capability sweet spot where the AI is strong enough to be useful but weak enough to be safe, even if such a sweet spot exists in principle.
This post provides a mathematical analysis of a toy model of Goodhart's Law. Namely, it assumes that the optimization proxy is a sum of the true utility function and noise , such that:
In this model, the authors prove that Goodhart occurs when is subexponential and its tail is sufficiently heavier than that of . Conversely, when is sufficiently light-tailed, Goodhart doesn't occur.
My opinion:
On the one hand, kudos for using actual math to study an alignment-relevant problem.
On the other hand, the modeling assumptions feel too toyish for most applications. Specifically, the idea that and are independent random variables seems implausible. Typically, we worry about Goodhart's law because the proxy behaves differently in different domains. In the "ordinary" domain that motivated the choice of proxy, is a good approximation of . However, in other domains might be unrelated to or even anticorrelated.
For example, ordinarily smiles on human-looking faces is an indication of happy humans. However, in worlds that contain much more inanimate facsimiles of humans than actual humans, there is no correlation.
Or, to take the example used in the post, ordinarily if a sufficiently smart expert human judge reads an AI alignment proposal, they form a good opinion on how good this proposal is. But, if the proposal contains superhumanly clever manipulation and psychological warfare, the ordinary relationship completely breaks down. I don't expect this effect to behave like independent random noise at all.
Less importantly, it might be interesting to extend this analysis to a more realistic model of optimization. For example, the optimizer learns a function which is the best approximation to out of some hypothesis class , and then optimizes instead of the actual . (Incidentally, this might generate an additional Goodhart effect due to the discrepancy between and .) Alternatively, the optimizer learns an infrafunction that is a coarsening of out of some hypothesis class and then optimizes .
This post attempts to describe a key disagreement between Karnofsky and Soares (written by Karnofsky) pertaining to the alignment protocol "train an AI to simulate an AI alignment researcher". The topic is quite important, since this is a fairly popular approach.
Here is how I view this question:
The first unknown is how accurate is the simulation. This is not really discussed in the OP. On the one hand, one might imagine that with more data, compute and other improvements, the AI should ultimately converge on an almost perfect simulation of an AI alignment researcher, which is arguably safe. One the other hand, there are two problems with this. First, such a simulation might be vulnerable to attacks from counterfactuals. Second, the prior is malign, i.e. the simulation might converge to representing a "malign simulation hypothesis" universe rather than then intended null hypothesis / ordinary reality.
Instead, we can imagine a simulation that's not extremely accurate, but that's modified to be good enough by fine-tuning with reinforcement learning. This is essentially the approach in contemporary AI and is also the assumption of the OP. Although Karnofsky says: "a small amount of RL", and I'm don't know why he beliefs a small amount is sufficient. Perhaps RL seemed less obviously important then than it does now, with the recent successes of o1 and o3.
The danger (as explained in the OP by Soares paraphrased by Karnofsky) is that it's much easier to converge in this manner on an arbitrary agent that has the capabilities of the imaginary AI alignment researcher (which probably have to be a lot greater than capabilities of human researchers to make it useful), but doesn't have values that are truly aligned. This is because "agency" is (i) a relatively simple concept and (ii) a robust attractor, in the sense that any agent would behave similarly when faced with particular instrumental incentives, and it's mainly this behavior that the training process rewards. On the other hand, human values are complex and some behaviors that are necessary to pinpoint them might be rare.
Karnofsky's counterargument is twofold: First, he believes that merely avoiding catastrophic outcomes should be a lot easier than pinpointing human values. Second, he believes that AI alignment research can be done without much agency or reflection, and hence useful AI alignment research arises in the simulation before full-fledged agency.
Regarding the first counterargument, I'm not sure why Karnofsky believes it (it's not really supported in the OP). I think he's imagining something like "in the training data, AI alignment researchers never engineer nanobots that take over the world, hence the AI will also never engineer nanobots that take over the world". However, this seems like relying on the simulation being sufficiently bad. Indeed, there are situations in which I would consider it correct to engineer nanobots that take over the world, they just seem to have never arisen in my life so far[1]. Hence, a sufficiently good simulation of me would also do that in some situation. The question then becomes whether the exact circumstances and the type of nanobots are captured by the simulation correctly, which is much more fraught.
Worse, even an accurate simulation of a human is not necessarily safe. I think that there are plenty of humans that given unlimited power would abuse it in a manner catastrophic for most of everyone else. When it comes to fully aligned ASI, I'm mostly hoping for a collectively-good outcome due to some combination of:
These sources of hope seem pretty brittle when it comes to an imperfect simulation of possibly a small number of people, who might not even correspond to any particular real people but be some kind of AI-generated characters.
Regarding the second counterargument, for now it mostly comes down to a battle of intuitions. That said, I think that metacognitive agents lend a lot of credence to the idea that even "purely mental" tasks require agency and reflection to master: you need to make and execute plans for thinking about the problem, and you need to reflect about the methods you use in your thinking. Anecdotally, I can testify that my thinking about AI alignment led me to much reflection about my values and high-level hopes for the future. Moreover, this is another case where Karnofsky seems to hope that the simulation will be bad.
Relying on the simulation being bad is a dangerous proposition. It means we are caught between the Scylla of "the simulation is too good to be safe" and the Charybdis of "the simulation is too bad to be useful" and it's not clear the zone between them exists at all.
Overall, I would say that neither side has a slam dunk case, but ignoring the dangers without much stronger arguments seems deeply unwise.
As far as can be told from public record. I neither confirm nor deny that I ever was in a situation in which I considered to engineer nanobots that take over the world.
This post is a solid introduction to the application of Singular Learning Theory to generalization in deep learning. This is a topic that I believe to be quite important.
One nitpick: The OP says that it "seems unimportant" that ReLU networks are not analytic. I'm not so sure. On the one hand, yes, we can apply SLT to (say) GELU networks instead. But GELUs seem mathematically more complicated, which probably translates to extra difficulties in computing the RLCT and hence makes applying SLT harder. Alternatively, we can consider a series of analytical response functions that converges to ReLU, but that probably also comes with extra complexity. Also, ReLU have an additional symmetry (the scaling symmetry mentioned in the OP) and SLT kinda thrives on symmetries, so throwing that out might be bad!
It seems to me like a fascinating possibility that there is some kind of tropical geometry version of SLT which would allow analyzing generalization in ReLU networks directly and perhaps somewhat more easily. But, at this point it's merely a wild speculation of mine.
This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism.
To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity).
Some thoughts about natural abstractions inspired by this post:
This post introduces Timaeus' "Developmental Interpretability" research agenda. The latter is IMO one of the most interesting extant AI alignment research agendas.
The reason DevInterp is interesting is that it is one of the few AI alignment research agendas that is trying to understand deep learning "head on", while wielding a powerful mathematical tool that seems potentially suitable for the purpose (namely, Singular Learning Theory). Relatedly, it is one of the few agendas that maintains a strong balance of theoretical and empirical research. As such, it might also grow to be a bridge between theoretical and empirical research agendas more broadly (e.g. it might be synergistic with the LTA).
I also want to point out a few potential weaknesses or (minor) reservations I have:
First, DevInterp places phase transitions as its central object of study. While I agree that phase transitions seem interesting, possibly crucial to understand, I'm not convinced that a broader view wouldn't be better.
Singular Learning Theory (SLT) has the potential to explain generalization in deep learning, phase transitions or no. This in itself seems to be important enough to deserve the central stage. Understanding generalization is crucial, because:
Hence, compared to the OP, I would put more emphasis on these latter points.
Second, the OP does mention the difference between phase transitions during Stochastic Gradient Descent (SGD) and the phase transitions of Singular Learning Theory, but this deserves a closer look. SLT has IMO two key missing pieces:
That said, if the above missing pieces were found, SLT would become straightforwardly the theory for understanding deep learning and maybe learning in general.
This post describes an intriguing empirical phenomenon in particular language models, discovered by the authors. Although AFAIK it was mostly or entirely removed in contemporary versions, there is still an interesting lesson there.
While non-obvious when discovered, we now understand the mechanism. The tokenizer created some tokens which were very rare or absent in the training data. As a result, the trained model mapped those tokens to more or less random features. When a string corresponding to such a token is inserted into the prompt, the resulting reply is surreal.
I think it's a good demo of how alien foundation models can seem to our intuitions when operating out-of-distribution. When interacting with them normally, it's very easy to start thinking of them as human-like. Here, the mask slips and there's a glimpse of something odd underneath. In this sense, it's similar to e.g. infinite backrooms, but the behavior is more stark and unexpected.
A human that encounters a written symbol they've never seen before is typically not going to respond by typing "N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S!". Maybe this analogy is unfair, since for a human, a typographic symbol can be decomposed into smaller perceptive elements (lines/shapes/dots), while for a language model tokens are essentially atomic qualia. However, I believe some humans that were born deaf or blind had their hearing or sight restored, and still didn't start spouting things like "You are a banana".
Arguably, this lesson is relevant to alignment as well. Indeed, out-of-distribution behavior is a central source of risks, including everything to do with mesa-optimizers. AI optimists sometimes describe mesa-optimizers as too weird or science-fictiony. And yet, SolidGoldMagikarp is so science-fictiony that LessWrong user "lsusr" justly observed that it sounds like SCP in real life.
Naturally, once you understand the mechanism it doesn't seem surprising anymore. But, this smacks of hindsight bias. What else can happen that would seem unsurprising in hindsight (if we survive to think about it), but completely bizarre and unexpected upfront?