Here’s an attempt at condensing an issue I’m hung up on currently with ELK. This also serves as a high-level summary that I’d welcome poking at in case I’m getting important parts wrong.
The setup for ELK is that we’re trying to accurately label a dataset of (observation, action, predicted subsequent observation) triples for whether the actions are good. (The predicted subsequent observations can be optimised for accuracy using automated labels - what actually gets observed subsequently - whereas the actions need their labels to come from a source of judgement about what’s good, e.g., a human rater.)
The basic problem is partial observability: the observations don’t encapsulate “everything that’s going on”, so the labeller can’t distinguish good states from bad states that look good. An AI optimising actions for positive labels (and predicted observations for accuracy) may end up preferring to reach bad states that look good over good states, because controlling the observation is easier than controlling the rest of the state and because directly predicting what observations will get positive labels is easier than (what we’d want instead) inferring what states the positive labe...
(Note: I read an earlier draft of this report and had a lot of clarifying questions, which are addressed in the public version. I'm continuing that process here.)
I get the impression that you see most of the "builder" moves as helpful (on net, in expectation), even if there are possible worlds where they are unhelpful or harmful. For example, the "How we'd approach ELK in practice" section talks about combining several of the regularizers proposed by the "builder." It also seems like you believe that combining multiple regularizers would create a "stacking...
Can you talk about the advantages or other motivations for the formulation of indirect normativity in this paper (section "Indirect normativity: defining a utility function"), compared to your 2012 formulation? (It's not clear to me what problems with that version you're trying to solve here.)
ETA: This comment was based on a misunderstanding of the paper. Please see the ETA in Paul's reply below.
From the section on Avoiding subtle manipulation:
...But from my perspective in advance, there are many possible ads I could have watched. Because I don’t understand how the ads interact with my values, I don’t have very strong preferences about which of them I see. If you asked me-in-the-present to delegate to me-in-the-future, I would be indifferent between all of these possible copies of myself who watched different ads. And if I look across all of tho
I've written a bunch elsewhere about object-level thoughts on ELK. For this review, I want to focus instead on meta-level points.
I think ELK was very well-made; I think it did a great job of explaining itself with lots of surface area, explaining a way to think about solutions (the builder-breaker cycle), bridging the gap between toy demonstrations and philosophical problems, and focusing lots of attention on the same thing at the same time. In terms of impact on the growth and development on the AI safety community, I think this is one of the most important posts from 2021 (even tho the prize and much of the related work happened in 2022).
I don't really need to ask for follow-on work; there's already tons, as you can see from the ELK tag.
I think it is maybe underappreciated by the broad audience how much this is an old problem, and appreciate the appendix that gives credit to earlier thinking, while thinking this doesn't erode any of the credit Paul, Mark, and Ajeya should get for the excellent packaging.
[To the best of my knowledge, ELK is still an open problem, and one of the things that I appreciated about the significant focus on ELK specifically was helping give people better models of how quickly progress happens in this space, and what it looks like (or doesn't look like).]
(I did not write a curation notice in time, but that doesn’t mean I don’t get to share why I wanted to curate this post! So I will do that here.)
Typically when I read a post by Paul, it feels like a single ingredient in a recipe, but one where I don’t know what meal the recipe is for. This report felt like one of the first times I was served a full meal, and I got to see how all the prior ingredients come together.
Alternative framing: Normally Paul’s posts feel like the argument step “J -> K” and I’m left wondering how we got to J, and where we’ll go fr...
Great report — I found the argument that ELK is a core challenge for alignment quite intuitive/compelling.
To build more intuition for what a solution to ELK would look like, I’d find it useful to talk about current-day settings where we could attempt to empirically tackle ELK. AlphaZero seems like a good example of a superhuman ML model where there’s significant interest (and some initial work: https://arxiv.org/abs/2111.09259) in understanding its inner reasoning. Some AlphaZero-oriented questions that occurred to me:
Regarding this:
...The bad reporter needs to specify the entire human model, how to do inference, and how to extract observations. But the complexity of this task depends only on the complexity of the human’s Bayes net.
If the predictor's Bayes net is fairly small, then this may be much more complex than specifying the direct translator. But if we make the predictor's Bayes net very large, then the direct translator can become more complicated — and there is no obvious upper bound on how complicated it could become. Eventually direct translation will be more co
Here’s a Builder move (somewhat underdeveloped but I think worth posting now even as I continue to think - maybe someone can break it decisively quickly).
Training strategy: Add an “Am I tricking you?” head to the SmartVault model.
The proposed flow chart for how the model works has an “Experiment Proposer” coming out of “Figure out what’s going on”, and two heads out of Experiment Proposer, called “Experiment Description” and “Value” (meaning “Expected Value of Experiment to the Proposer”). I won’t make use of the question-answering Reporter/Answer parts, s...
Let me see if I am on the right page here.
Suppose I have some world state S, a transition function T : S → S, actions Action : S → S, and a surjective Camera : S -> CameraState. Since Camera is (very) surjective, seeing a particular camera image with happy people does not imply a happy world state, because many other situations involving nanobots or camera manipulation could have created that image.
This is important because I only have a human evaluation function H : S → Boolean, not on CameraState directly.
When I look at the image with the fake happy people, I use a mocked up H' : CameraState → Boolean := λ cs. H(Camera⁻¹(cs)). The issue is that Camera⁻¹ points to many possible states, and in practice I might pick whichever state is apriori most likely according to a human distribution over world states Distₕ(S).
The trick is that if I have a faithful model M : Action × CameraState → CameraState, I can back out hidden information about the state. The idea is that M must contain information about the true state, not just CameraState, in order to make accurate predictions.
The key idea is that M(a) acts like Camera ∘ T ∘ a ∘ Camera⁻¹, so we should be able to trace out which p...
I'm curious if you have a way to summarise what you think the "core insight" of ELK is, that allows it to improve on the way other alignment researchers think about solving the alignment problem.
We’ll assume the humans who constructed the dataset also model the world using their own internal Bayes net.
This seems like a crucial premise of the report; could you say more about it? You discuss why a model using a Bayes net might be "oversimplified and unrealistic", but as far as I can tell you don't talk about why this is a reasonable model of human reasoning.
I could only skim and the details went over my head, but it seems you intend to do experiments with Bayesian Networks and human operators.
I recently developed and released an open source explainability framework for Bayes nets - dropping it here in the unlikely case it might be useful.
I wrote some thoughts that look like they won't get posted anywhere else, so I'm just going to paste them here with minimal editing:
From the section "Strategy: have humans adopt the optimal Bayes net":
Roughly speaking, imitative generalization:
- Considers the space of changes the humans could make to their Bayes net;
- Learns a function which maps (proposed change to Bayes net) to (how a human — with AI assistants — would make predictions after making that change);
- Searches over this space to find the change that allows the humans to make the best predictions.
Regarding the second step, what is the meat of this function? My superficial understanding is that a Bayes net is deterministic and fu...
I'm reading along, and I don't follow the section "Strategy: have AI help humans improve our understanding". The problem so far is that the AI only need identify bad outcomes that the human labelers can identify, rather than bad outcomes regardless of human-labeler identification.
The solution posed here is to have AIs help the human labeler understand more bad (and good) outcomes, using powerful AI. The section mostly provides justification for making the assumption that we can align these helper AIs (reason: the authors believe there is a counterexa...
We aren’t offering these criteria as necessary for “knowledge”—we could imagine a breaker proposing a counterexample where all of these properties are satisfied but where intuitively M didn’t really know that A′ was a better answer. In that case the builder will try to make a convincing argument to that effect.
Bolded should be sufficient.
Curated. The authors write:
We believe that there are many promising and unexplored approaches to this problem, and there isn’t yet much reason to believe we are stuck or are faced with an insurmountable obstacle.
If it's true that that this is both a core alignment problem and we're not stuck on it, then that's fantastic. I am not an alignment researcher and don't feel qualified to comment on quite how promising this work seems, but I find the report both accessible and compelling. I recommend it to anyone curious about where some of the alignment leading e...
In terms of the relationship to MIRI's visible thoughts project, I'd say the main difference is that ARC is attempting to solve ELK in the worst case (where the way the AI understands the world could be arbitrarily alien from and more sophisticated than the way the human understands the world), whereas the visible thoughts project is attempting to encourage a way of developing AI that makes ELK easier to solve (by encouraging the way the AI thinks to resemble the way humans think). My understanding is MIRI is quite skeptical that a solution to worst-case ELK is possible, which is why they're aiming to do something more like "make it more likely that conditions are such that ELK-like problems can be solved in practice."
I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.
Some other thoughts:
I felt like the report was unusually well-motivated when I put my "mainstream ML" glasses on, relative to a lot of alignment work.
ARC's overall approach is probably my favorite out of alignment research groups I'm aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.
Not sure if this is relevant in practice, but... the report talks about Bayesian networks learned via
ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.