AI ALIGNMENT FORUM
AF

All of John_Maxwell's Comments + Replies

I'll respond to the "Predict hypothetical sensors" section in this comment.

First, I want to mention that predicting hypothetical sensors seems likely to fail in fairly obvious ways, e.g. you request a prediction about a sensor that's physically nonexistent and the system responds with a bunch of static or something. Note the contrast with the "human simulator" failure mode, which is much less obvious.

But I also think we can train the system to predict hypothetical sensors in a way that's really useful. As in my previous comment, I'll work from the assump... (read more)

2Paul Christiano3y

If we train on data about what hypothetical sensors should show (e.g. by experiments where we estimate what they would show using other means, or by actually building weird sensors), we could just end up getting predictions of whatever process we used to generate that data. In general the overall situation with these sensors seems quite similar to the original outer-level problem, i.e. training the system to answer "what would an ideal sensor show?" seems to run into the same issues as answering "what's actually going on?" E.g. your supersensor idea #3 seems to be similar to the "human operates SmartVault and knows if tampering occurred" proposal we discussed here. I do think that excising knowledge is a substantive change, I feel like it's effectively banking on "if the model is ignorant enough about what humans are capable of, it needs to err on the side of assuming they know everything." But for intelligent models, it seems hard in general to excise knowledge of whole kinds of sensors (how do you know a lot about human civilization without knowing that it's possible to build a microphone?) without interfering with performance. And there are enough signatures that the excised knowledge is still not in-distribution with hypotheticals we make up (e.g. the possibility of microphones is consistent with everything else I know about human civilization and physics, the possibility of invisible and untouchable cameras isn't) and conservative bounds on what humans can know will still hit the one but not the other.

Counterexamples to some ELK proposals

John Maxwell3y*30

Thanks for the reply! I'll respond to the "Hold out sensors" section in this comment.

One assumption which seems fairly safe in my mind is that as the operators, we have control over the data our AI gets. (Another way of thinking about it is if we don't have control over the data our AI gets, the game has already been lost.)

Given that assumption, this problem seems potentially solvable

Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get

John Maxwell3y10

I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.

Some other thoughts:

I felt like the report was unusually well-motivated when I put my "mainstream ML" glasses on, relative to a lot of alignment work.
ARC's overall approach is probably my favorite out of alignment research groups I'm aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.
Not sure if this is relevant in practice, but... the report talks about Bayesian networks learned via

... (read more)

2Paul Christiano3y

Thanks for the kind words (and proposal)! I broadly agree that "train a bunch of models and panic if any of them say something is wrong." The main catch is that this only works if none of the models are optimized to say something scary, or to say something different for the sake of being different. We discuss this a bit in this appendix. We're imagining the case where the predictor internally performs inference in a learned model, i.e. we're not explicitly learning a bayesian network but merely considering possibilities for what an opaque neural net is actually doing (or approximating) on the inside. I don't think this is a particularly realistic possibility, but if ELK fails in this kind of simple case it seems likely to fail in messier realistic cases. (We're actually planning to do a narrower contest focused on ELK proposals.)

The Plan

John Maxwell3y*40

(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.

Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.

Interpreting Yudkowsky on Deep vs Shallow Knowledge

John Maxwell3y40

For what it's worth, I often find Eliezer's arguments unpersuasive because they seem shallow. For example:

The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.

This seem like a fuzzy "outside view" sort of argument. (Compare with: "A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways." On the other hand, a causal model of a gun lets you explain which specif... (read more)

Visible Thoughts Project and Bounty Announcement

John Maxwell3y10

As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

Might depend whether the "thought" part comes before or after particular story text. If the "thought" comes after that story text, then it's generated conditional on that text, essentially a rationalization of that text from a hypothetical DM's point of view. If it comes before that sto... (read more)

Testing The Natural Abstraction Hypothesis: Project Intro

John Maxwell4y*140

I'm glad you are thinking about this. I am very optimistic about AI alignment research along these lines. However, I'm inclined to think that the strong form of the natural abstraction hypothesis is pretty much false. Different languages and different cultures, and even different academic fields within a single culture (or different researchers within a single academic field), come up with different abstractions. See for example lsusr's posts on the color blue or the flexibility of abstract concepts. (The Whorf hypothesis might also be worth looking i... (read more)

Open Problems with Myopia

John Maxwell4y30

We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.

Are you sure that "episode" is the word you're looking for here?

https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL

I'm especially confused becaus... (read more)

5Evan Hubinger4y

Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner's dilemma unit test in Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” for more detail on how breaking this sort of episodic independence plays out in practice.

A Semitechnical Introductory Dialogue on Solomonoff Induction

John Maxwell4y20

...When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.

ASHLEY: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we're trying to do.

I'm not convinced your chess example, where the practical solution resembles the hypercomputer one, is representativ... (read more)

Tournesol, YouTube and AI Risk

John Maxwell4y10

Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say.

Makes sense.

...there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded.

I think it might be useful to distinguish between being aware of onesel... (read more)

Tournesol, YouTube and AI Risk

John Maxwell4y*70

I suspect the best way to think about the polarizing political content thing which is going on right now is something like: The algorithm knows that if it recommends some polarizing political stuff, there's some chance you will head down a rabbit hole and watch a bunch more vids. So in terms of maximizing your expected watch time, recommending polarizing political stuff is a good bet. "Jumping out of the system" and noticing that recommending polarizing videos also polarizes society as a whole and gets them to spend more time on Youtube on a macro level ... (read more)

1Adam Shimi4y

I agree, but I think you can have problems (and even Predict-O-Matic like problems) without reaching that different sort of reasoning. Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say. One could also argue that because YouTube videos contain so much info about the real world, a powerful enough algorithm using them can probably develop a pretty good model of the world. And there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded. Agreed, this is more the kind of problem that emerges from RL like training. The page on the Tournesol wiki about this subject points to this recent paper that propose a recommendation algorithm tried in practice on YouTube. AFAIK we don't have access to the actual algorithm used by YouTube, so it's hard to say whether it's using RL; but the paper above looks like evidence that it eventually will be.

Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

John Maxwell4y90

Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world.

I see this argument pop up every so often. I don't find it persuasive because it presents a false choice in my view.

Our choice is not between having humans run the world and having a benevolent god run the world. Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).

If you think the alignment problem is hard, you probably ... (read more)

2Alex Flint4y

Right, I agree that having a benevolent god run the world is not within our choice set. Well just to re-state the suggestion in my original post: is this dichotomy between humans running the world or something else running the world really so inescapable? The child in the sand pit does not really run the world, and in an important way the parent also does not run the world -- certainly not from the perspective of the child's whole-life trajectory.

Why GPT wants to mesa-optimize & how we might change this

John Maxwell4y20

I was using it to refer to "any inner optimizer". I think that's the standard usage but I'm not completely sure.

Why GPT wants to mesa-optimize & how we might change this

John Maxwell4y10

With regard to the editing text discussion, I was thinking of a really simple approach where we resample words in the text at random. Perhaps that wouldn't work great, but I do think editing has potential because it allows for more sophisticated thinking.

Let's say we want our language model to design us an aircraft. Perhaps its starts by describing the engine, and then it describes the wings. Standard autoregressive text generation (assuming no lookahead) will allow the engine design to influence the wing design (assuming the engine design is inside the... (read more)

1A Ray4y

Clarifying Q: Does mesa-optimization refer to any inner optimizer, or one that is in particular not aligned with the outer context?

Why GPT wants to mesa-optimize & how we might change this

John Maxwell4y10

Thanks for sharing!

John_Maxwell's Shortform

John Maxwell4y10

That's possible, but I'm guessing that it's not hard for a superintelligent AI to suddenly swallow an entire system using something like gray goo.

John_Maxwell's Shortform

John Maxwell4y20

In this reaction to Critch's podcast, I wrote about some reasons to think that a singleton would be preferable to a multipolar scenario. Here's another rather exotic argument.

[The dark forest theory] is explained very well near the end of the science fiction novel, The Dark Forest by Liu Cixin.

...

When two [interstellar] civilizations meet, they will want to know if the other is going to be friendly or hostile. One side might act friendly, but the other side won't know if they are just faking it to put them at ease while armies are built in secret.

... (read more)

2Diffractor4y

Potential counterargument: Second-strike capabilities are still relevant in the interstellar setting. You could build a bunch of hidden ships in the oort cloud to ram the foe and do equal devastation if the other party does it first, deterring a first strike even with tensions and an absence of communication. Further, while the "ram with high-relativistic objects" idea works pretty well for preemptively ending a civilization confined to a handful of planets, AI's would be able to colonize a bunch of little asteroids and KBO's and comets in the oort cloud, and the higher level of dispersal would lead to preemptive total elimination being less viable.

John_Maxwell's Shortform

John Maxwell4y*70

A friend and I went on a long drive recently and listened to this podcast with Andrew Critch on ARCHES. On the way back from our drive we spent some time brainstorming solutions to the problems he outlines. Here are some notes on the podcast + some notes on our brainstorming.

In a possibly inaccurate nutshell, Critch argues that what we think of as the "alignment problem" is most likely going to get solved because there are strong economic incentives to solve it. However, Critch is skeptical of forming a singleton--he says people tend to resist that kind... (read more)