Thanks for the reply! I'll respond to the "Hold out sensors" section in this comment.
One assumption which seems fairly safe in my mind is that as the operators, we have control over the data our AI gets. (Another way of thinking about it is if we don't have control over the data our AI gets, the game has already been lost.)
Given that assumption, this problem seems potentially solvable
...Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get
I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.
Some other thoughts:
I felt like the report was unusually well-motivated when I put my "mainstream ML" glasses on, relative to a lot of alignment work.
ARC's overall approach is probably my favorite out of alignment research groups I'm aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.
Not sure if this is relevant in practice, but... the report talks about Bayesian networks learned via
(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)
There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.
Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.
For what it's worth, I often find Eliezer's arguments unpersuasive because they seem shallow. For example:
The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.
This seem like a fuzzy "outside view" sort of argument. (Compare with: "A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways." On the other hand, a causal model of a gun lets you explain which specif...
As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.
Might depend whether the "thought" part comes before or after particular story text. If the "thought" comes after that story text, then it's generated conditional on that text, essentially a rationalization of that text from a hypothetical DM's point of view. If it comes before that sto...
I'm glad you are thinking about this. I am very optimistic about AI alignment research along these lines. However, I'm inclined to think that the strong form of the natural abstraction hypothesis is pretty much false. Different languages and different cultures, and even different academic fields within a single culture (or different researchers within a single academic field), come up with different abstractions. See for example lsusr's posts on the color blue or the flexibility of abstract concepts. (The Whorf hypothesis might also be worth looking i...
We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.
Are you sure that "episode" is the word you're looking for here?
https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL
I'm especially confused becaus...
...When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.
ASHLEY: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we're trying to do.
I'm not convinced your chess example, where the practical solution resembles the hypercomputer one, is representativ...
Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say.
Makes sense.
...there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded.
I think it might be useful to distinguish between being aware of onesel...
I suspect the best way to think about the polarizing political content thing which is going on right now is something like: The algorithm knows that if it recommends some polarizing political stuff, there's some chance you will head down a rabbit hole and watch a bunch more vids. So in terms of maximizing your expected watch time, recommending polarizing political stuff is a good bet. "Jumping out of the system" and noticing that recommending polarizing videos also polarizes society as a whole and gets them to spend more time on Youtube on a macro level ...
Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world.
I see this argument pop up every so often. I don't find it persuasive because it presents a false choice in my view.
Our choice is not between having humans run the world and having a benevolent god run the world. Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).
If you think the alignment problem is hard, you probably ...
With regard to the editing text discussion, I was thinking of a really simple approach where we resample words in the text at random. Perhaps that wouldn't work great, but I do think editing has potential because it allows for more sophisticated thinking.
Let's say we want our language model to design us an aircraft. Perhaps its starts by describing the engine, and then it describes the wings. Standard autoregressive text generation (assuming no lookahead) will allow the engine design to influence the wing design (assuming the engine design is inside the...
In this reaction to Critch's podcast, I wrote about some reasons to think that a singleton would be preferable to a multipolar scenario. Here's another rather exotic argument.
[The dark forest theory] is explained very well near the end of the science fiction novel, The Dark Forest by Liu Cixin.
...
...When two [interstellar] civilizations meet, they will want to know if the other is going to be friendly or hostile. One side might act friendly, but the other side won't know if they are just faking it to put them at ease while armies are built in secret.
A friend and I went on a long drive recently and listened to this podcast with Andrew Critch on ARCHES. On the way back from our drive we spent some time brainstorming solutions to the problems he outlines. Here are some notes on the podcast + some notes on our brainstorming.
In a possibly inaccurate nutshell, Critch argues that what we think of as the "alignment problem" is most likely going to get solved because there are strong economic incentives to solve it. However, Critch is skeptical of forming a singleton--he says people tend to resist that kind...
Your philosophical point is interesting; I have a post in the queue about that. However I don't think it really proves what you want it to.
Having John_Maxwell in the byline makes it far more likely that I'm the author of the post.
If humans can make useful judgements re: whether this is something I wrote, vs something nostalgebraist wrote to make a point about bylines, I don't see why a language model can't do the same, in principle.
...GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahea
So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.
A system which develops small-L lookahead (for L > 1) may find large-L lookahead to be nearby in programspace. If so, incentivizing the development of small-L lookahead makes it more likely that the system will try large-L lookahead and find it to be useful as well (in predicting chess moves for instance).
My intuition is that small-L lookahead could be close to large-L...
- Stopping mesa-optimizing completely seems mad hard.
As I mentioned in the post, I don't think this is a binary, and stopping mesa-optimization "incompletely" seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn't seem mad hard to me.
- Managing "incentives" is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence.
I'm less optimistic about this approach.
There is a stochastic aspect to training ML models, so it's not enough to say "the incentives favor Mesa-Optimi
Now it's true that efficiently estimating that conditional using a single forward pass of a transformer might involve approximations to beam search sometimes.
Yeah, that's the possibility the post explores.
At a high level, I don't think we really need to be concerned with this form of "internal lookahead" unless/until it starts to incorporate mechanisms outside of the intended software environment (e.g. the hardware, humans, the external (non-virtual) world).
Is there an easy way to detect if it's started doing that / tell it to restrict its lookahead...
BTW with regard to "studying mesa-optimization in the context of such systems", I just published this post: Why GPT wants to mesa-optimize & how we might change this.
I'm still thinking about the point you made in the other subthread about MAML. It seems very plausible to me that GPT is doing MAML type stuff. I'm still thinking about if/how that could result in dangerous mesa-optimization.
Well I suppose mesa-optimization isn't really a binary is it? Like, maybe there's a trivial sense in which self-attention "mesa-optimizes" over its input when figuring out what to pay attention to.
But ultimately, what matters isn't the definition of the term "mesa-optimization", it's the risk of spontaneous internal planning/optimization that generalizes in unexpected ways or operates in unexpected domains. At least in my mind. So the question is whether this considering multiple possibilities about text stuff could also improve its ability to consider ...
This post distinguishes between mesa-optimization and learned heuristics. What you're describing sounds like learned heuristics. ("Learning which words are easy to rhyme" was an example I gave in the post.) Learned heuristics aren't nearly as worrisome as mesa-optimization because they're harder to modify and misuse to do planning in unexpected domains. When I say "lookahead" in the post I'm pretty much always referring to the mesa-optimization sort.
The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way.
GPT generates text by repeatedly picking whatever word seems highest probability given all the words that came before. So if its notion of "highest probability" is almost, but not quite, answering every question accurately, I would expect a system which usually answers questions accurately but sometimes answers them inaccurately. That doesn't sound very scary?
esp. since GPT-3's 0-shot learning looks like mesa-optimization
Could you provide more details on this?
Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they'd like to see from GPT-3 in response to those inputs ("few-shot learning", right? I don't know what 0-shot learning you're referring to.) Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?
If that's what you're saying... That seems unlikely to me. GPT-3 is essentially a stack of 96 tr...
Let's say I'm trying to describe a hockey game. Modularizing the preferences from other aspects of the team algorithm makes it much easier to describe what happens at the start of the second period, when the two teams switch sides.
The fact that humans find an abstraction useful is evidence that an AI will as well. The notion that agents have preferences helps us predict how people will change their plans for achieving their goals when they receive new information. Same for an AI.
Note that this decomposition is simpler than a "reasonable" version of figure 4, since the boundaries between the three modules don't need to be specified.
Consider two versions of the same program. One makes use of a bunch of copy/pasted code. The other makes use of a nice set of re-usable abstractions. The second program will be shorter/simpler.
Boundaries between modules don't cost very much, and modularization is super helpful for simplifying things.
Mostly no. I've been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I'd like the answ...
This translates in my head to "all we need to do is solve the main problems of alignment, and then we'll have an assistant which can help us clean up any easy loose ends".
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don't think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer...
If we had that - not necessarily a full model of human values, just a formalization which we were confident could represent them - then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.
Do you have in mind a specific aspect of human values that couldn't be represented using, say, the reward function of a reinforcement learning agent AI?
...On the tools side, I assume the tools will be reasoning about systems/problems which humans can't understand - that's the main value prop in the first place. Trying to
I don't expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.
Can you be more specific about the theoretical bottlenecks that seem most important?
I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away).
I agree that Tool AI is not inherently safe. The key question is which problem is easier: the alignment problem, or the safe-use-of...
My take is that corrigibility is sufficient to get you an AI that understands what it means to "keep improving their understanding of Alice's values and to serve those values". I don't think the AI needs to play the "genius philosopher" role, just the "loyal and trustworthy servant" role. A superintelligent AI which plays that role should be able to facilitate a "long reflection" where flesh and blood humans solve philosophical problems.
(I also separately think unsupervised learning systems could in principle make philosophical breakthroughs. Maybe one already has.)
Thanks a lot for writing this. I've been thinking about FAI plans along these lines for a while now, here are some thoughts on specific points you made.
First, I take issue with the "Alignment By Default" title. There are two separate questions here. Question #1 is whether we'd have a good outcome if everyone concerned with AI safety got hit by a bus. Question #2 is whether there's a way to create Friendly AI using unsupervised learning. I'm rather optimistic that the answer to Question #2 is yes. I find the unsupervised learning family of approaches ...
Some notes on the loss function in unsupervised learning:
Since an unsupervised learner is generally just optimized for predictive power
I think it's worthwhile to distinguish the loss function that's being optimized during unsupervised learning, vs what the practitioner is optimizing for. Yes, the loss function being optimized in an unsupervised learning system is frequently minimization of reconstruction error or similar. But when I search for "unsupervised learning review" on Google Scholar, I find this highly cited paper by Bengio et al. The abstr...
See also Robustness to Scale. You wrote that "we expect that the failure modes which still appear under such assumptions are the hard failure modes" (emphasis mine). But there are some failure modes which don't appear with existing algorithms, yet are hypothesized to appear in the limit of more data and compute, such as the "malign universal prior" problem. It's unclear how much to worry about these problems, because as you say, we don't actually expect to use e.g. Solomonoff induction. I suspect a key issue is whether the problem is an inevitable result o...
But there are some failure modes which don't appear with existing algorithms, yet are hypothesized to appear in the limit of more data and compute...
This is a great point to bring up. One thing the OP probably doesn't emphasize enough is: just because one particular infinite-data/compute algorithm runs into a problem, does not mean that problem is hard.
Zooming out for a moment, the strategy the OP is using is problem relaxation: we remove a constraint from the problem (in this case data/compute constraints), solve that relaxed problem, then use the relaxed...
A general method for identifying dangers: For every topic which gets discussed on AF, figure out what could go wrong if GPT-N decided to write a post on that topic.
GPT-N writes a post about fun theory. It illustrates principles of fun theory by describing an insanely fun game you can play with an ordinary 52-card deck. FAI work gets pushed aside as everyone becomes hooked on this new game. (Procrastination is an existential threat!)
GPT-N writes a post about human safety problems. To motivate its discussion, it offers some extraordinarily compelli
One class of problem comes about if GPT-N starts thinking about "what would a UFAI do in situation X":
There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper.
This is simultaneously
You've picked a position vaguely in between th...
It doesn't matter how high-fidelity the upload is or how benevolent the human is, I'm not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that.
Here are some of the people who have the power to set off nukes right now:
Donald Trump
Vladimir Putin
Kim Jong-un
Both parties in this conflict
And this conflict
"Don't let the perfect be the enemy of the good" is advice for writing emails and cleaning the house, not nuclear security.
Tell that to the Norwegian commandos who successfu...
You say: "we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute". I think this depends on specific details of how the system is engineered.
"Physical process which generates the label koala" is not the same as "koala", and the system can get higher predictive power by modelling the former rather than the latter.
Suppose we use classification accuracy as our loss function. If all the koalas are correctly classified by both models, then the two models have equal loss function scores....
It doesn't matter how high-fidelity the upload is or how benevolent the human is, I'm not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that. "Don't let the perfect be the enemy of the good" is advice for writing emails and cleaning the house, not nuclear security.
The capabilities of powerful AGI will be a lot more dangerous than nukes, and merit a lot more perfectionism.
Humans themselves are not aligned enough that I would be happy giving them the sort of power that AGI will eventually hav...
This is essentially the "tasty ice cream flavors" problem, am I right? Trying to check if we're on the same page.
If so: John Wentsworth said
"Tasty ice cream flavors" is also a natural category if we know who the speaker is
So how about instead of talking about "human values", we talk about what a particular moral philosopher endorses saying or doing, or even better, what a committee of famous moral philosophers would endorse saying/doing.
Likewise, GPT-style models should have no trouble learning some model with human values embedded in it. But that embedding will not necessarily be simple; there won't just be a neuron that lights up in response to humans having their values met. The model will have a notion of human values embedded in it, but it won't actually use "human values" as an abstract object in its internal calculations; it will work with some lower-level "components" which themselves implement/embed human values.
If it's read moral philosophy, it should have some notion of what...
And that's exactly the kind of complexity which is hard for something based on predictive power. Lower abstraction levels should generally perform better in terms of raw prediction, but the thing we want to point to lives at a high abstraction level.
You told me that "it's not actually hard for an unsupervised learner to end up with some notion of human values embedded in its world-model". Now you're saying that things based on "predictive power" have trouble learning things at high abstraction levels. Doesn't this suggest that your origin...
Summary: the sense in which human values are "complex" is not about predictive power. A low-level physical model of some humans has everything there is to know about human values embedded within it; it has all the predictive power which can be had by a good model of human values. The hard part is pointing to the thing we consider "human values" embedded within that model. In large part, that's hard because it's not just a matter of predictive power.
This still sounds like a shift in arguments to me. From what I remember, the MIRI-sphere take on upload
I'll respond to the "Predict hypothetical sensors" section in this comment.
First, I want to mention that predicting hypothetical sensors seems likely to fail in fairly obvious ways, e.g. you request a prediction about a sensor that's physically nonexistent and the system responds with a bunch of static or something. Note the contrast with the "human simulator" failure mode, which is much less obvious.
But I also think we can train the system to predict hypothetical sensors in a way that's really useful. As in my previous comment, I'll work from the assump... (read more)