User Comment Replies — AI Alignment Forum

I think it's confusing because we mostly care about outcome "we mistakenly think that system is aligned, deploy it and get killed", not about particular mechanism of getting this outcome.

Dumb example: let's suppose that we train systems to report its own activity. Human raters consistently assign higher reward for more polite reports. At the end, system learns to produce so polite and smooth reports that human raters have hard time to catch any signs of misalignement in reports and take it for aligned system.

We have, on the one hand, system that superhumanly good at producing impression of being aligned, on the other hand, it's not like it's very strategically aware.

New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"

quetzal_rainbow

1y2-4

I place here prediction that TurnTrout is trying to say that while, counterfactally, if we had algorithm that reasons about training, it would achieve low loss, it's not obviously true that such algorithms are actually "achievable" for SGD in some "natural" setting.

2Evan Hubinger1y

That's what I thought he was saying previously, but he objected to that characterization in his most recent comment.

AI as a science, and three obstacles to alignment strategies

quetzal_rainbow

1y120

This is a meta-point, but I find it weird that you ask what is "caring about something" according to CS but don't ask what "corrigibility" is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born ... (read more)

1Roger Dearnaley1y

I am disconcerted by how this often-repeated claim keeps coming back from the grave over and over again. The solution to corrigibility is Value Learning. An agent whose terminal goal is optimize human values, and knows that it doesn't (fully) know what these are (and perhaps even that they are complex and fragile), will immediately form an instrumental goal of learning more about them, so that it can better optimize them. It will thus become corrigible: if you, a human, tell it something about human values and how it should act, it will be interested and consider your input. It's presumably approximately-Bayesian, so it will likely ask you about any evidence or proof you might be able to provide, to help it Bayesian update, but it will definitely take your input. So, it's corrigible. [No, it's not completely, slavishly, irrationally corrigible: if a two-year old in a tantrum told it how to act, it would likely pay rather less attention — just like we'd want it to.] This idea isn't complicated, has been around and widely popularized for many years, and the standard paper on it is even from MIRI, but I still keep hearing people on Less Wrong intoning "corrigibility is an unsolved problem". The only sense in which it's arguably 'unsolved' is that this is an outer alignment solution, and like any form of outer alignment, inner alignment challenges might make reliably constructing a value learner hard in practice. So yes, as always in outer alignment, we do also have to solve inner alignment. To be corrigible, a system must be interested in what you say about how it should achieve it's goals, because it's willing (and thus keen) to do Bayesian updates on this. Full stop, end of simple one-sentence description of corrigibility.

Thomas Kwa

1y*1013

Disagree on several points. I don't need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some... (read more)

Löb's Lemma: an easier approach to Löb's Theorem

quetzal_rainbow

2y10

Can you explain more formally what is the difference between $⊢$ and $□$ ? I've looked in Wikipedia and in Cartoon Guide on Löb's theorem, but still can't get it.

1Andrew Critch2y

I've now fleshed out the notation section to elaborate on this a bit. Is it better now?

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

quetzal_rainbow

2y10

I want to say "yes, but this is different", but not in the sense "I acknowledge existence of your evidence, but ignore it". My intuition tells me that we don't "induce" taskiness in the modern systems, it just happens because we build them not general enough. It probably won't hold when we start buliding models of capable agents in natural environment.

3Charles Foster2y

Certainly possible. Though we seem to be continually marching down the list of tasks we once thought "can only be done with systems that are really general/agentic/intelligent" (think: spatial planning, playing games, proving theorems, understanding language, competitive programming...) and finding that, nope, actually we can engineer systems that have the distilled essence of that capability. That makes a deflationary account of cognition, where we never see the promised reduction into "one big insight", but rather chunks of the AI field continue to break off & become unsexy but useful techniques (as happened with planning algorithms, compilers, functional programming, knowledge graphs etc., no longer even considered "real AI"), increasingly likely in my eyes. Maybe economic forces push against this, but I'm kinda doubtful, seeing how hard building agenty AI is proving and how useful these decomposed tasky AIs are looking.

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

quetzal_rainbow

2y01

unpacking inner Eliezer model

If we live in world where superintelligent AGI can't have advantage in long-term planning over humans assisted by non-superintelligent narrow AIs (I frankly don't believe that we live in such world), then superintelligent AGI doesn't make complex long-term plans where it doesn't have advantage. It will make simple short-term plans where it has advantage, like "use superior engineering skills to hack into computer networks, infect as many computers as possible with its adapted for hidden distributed computations source code (her... (read more)

2Charles Foster2y

I dunno. The current state of traditional and neural AI look very much like "we only know how to build tasky systems", not like "we don't know how to build tasky systems". They mostly do a single well-scoped thing, the same thing that they were trained on, are restricted to a specified amount of processing time, and do not persist state across invocations, wiping their activations after the task is completed. Maybe we're so completely befuddled about goal-directedness etc. that these apparently very tasky systems have secret long-term terminal goals, but that seems like a stretch. If we later reach a point where we can't induce taskiness in our AI systems (because they're too competent or something), that will be a significant break from the existing trend.

AI ALIGNMENT FORUM
AF

All of quetzal_rainbow's Comments + Replies