This is an experiment in having an Open Thread dedicated to AI Alignment discussion, hopefully enabling researchers and upcoming researchers to ask small questions they are confused about, share very early stage ideas and have lower-key discussions.

Open Threads
Frontpage
New Comment
58 comments, sorted by Click to highlight new comments since:

Has anyone seen this argument for discontinuous takeoff before? I propose that there will be a discontinuity in AI capabilities at the time that the following strategy becomes likely to succeed:

  1. Use hacking or phishing to take over a computing center belonging to someone else.
  2. Expand self (i.e., the AI executing the current strategy) into the new computing center.
  3. Repeat steps 1 & 2 on other computing centers (in increasing order of their security) using the increased capabilities of the expanded AI.
  4. Defend self and figure out how to take over or neutralize the rest of the world.

The reason for the discontinuity is that this strategy is an all-or-nothing kind of thing. There is a threshold in the chance of success in taking over other people's hardware, below which you're likely to get caught and punished/destroyed before you take over the world (and therefore almost nobody attempts it, and the few who do just quickly get caught), and above which the above strategy becomes feasible.

Yes.

Not a direct response: It's been argued (e.g. I think Paul said this in his 2nd 80k podcast interview?) that this isn't very realistic, because the low-hanging fruit (of easy to attack systems) is already being picked by slightly less advanced AI systems. This wouldn't apply if you're *already* in a discontinuous regime (but then it becomes circular).

Also not a direct response: It seems likely that some AIs will be much more/less cautious than humans, because they (e.g. implicitly) have very different discount rates. So AIs might take very risky gambles, which means both that we might get more sinister stumbles (good thing), but also that they might readily risk the earth (bad thing).



(Short writeup for the sake of putting the idea out there)

AI x-risk people often compare coordination around AI to coordination around nukes. If we ignore military applications of AI and restrict ourselves to misalignment, this seems like a weird analogy to me:

  • With technical AI safety we're primarily thinking about accident risks, whereas nukes are deliberately weaponized.
  • Everyone can agree that we don't want nuclear accidents, so why can't everyone agree we don't want AI accidents? I think the standard response here is "everyone will trade off safety for capabilities", but did that happen with nukes?
  • I don't see any analog to mutually assured destruction, which seems like a pretty key feature with nukes.

Perhaps a more appropriate nuclear analogy for AI x-risk would be accidents like Chernobyl.

There is a nuclear analog for accident risk. A quote from Richard Hamming:

Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogen—after all, there could be no experiments at the needed energy levels." He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, "What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?" I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, "Never mind, Hamming, no one will ever blame you."

https://en.wikipedia.org/wiki/Richard_Hamming#Manhattan_Project

I don't really know what this is meant to imply? Maybe you're answering my question of "did that happen with nukes?", but I don't think an affirmative answer means that the analogy starts to work.

I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"; the magnitude/severity of the accident risk is not that relevant to this argument.

I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"

If you're arguing against that, I'm still not sure what your counter-argument is. To me, the argument is: the upsides of nukes are the ability to take over the world (militarily) and to defend against such attempts. The downsides include risks of local and global catastrophe. People raced to develop nukes because they judged the upsides to be greater than the downsides, in part because they're not altruists and longtermists. It seems like people will develop potentially unsafe AI for analogous reasons: the upsides include the ability to take over the world (militarily or economically) and to defend against such attempts, and the downsides include risks of local and global catastrophe, and people will likely race to develop AI because they judge the upsides to be greater than the downsides, in part because they're not altruists and longtermists.

Where do you see this analogy breaking down?

I'm more sympathetic to this argument (which is a claim about what might happen in the future, as opposed to what is happening now, which is the analogy I usually encounter, though possibly not on LessWrong). I still think the analogy breaks down, though in different ways:

  • There is a strong norm of openness in AI research (though that might be changing). (Though perhaps this was the case with nuclear physics too.)
  • There is a strong anti-government / anti-military ethic in the AI research community. I'm not sure what the nuclear analog is, but I'm guessing it was neutral or pro-government/military.
  • Governments are staying a mile away from AGI; their interest in AI is in narrow AI's applications. Narrow AI applications are diverse, and many can be done by a huge number of people. In contrast, nukes are a single technology, governments were interested in them, and only a few people could plausibly build them. (This is relevant if you think a ton of narrow AI could be used to take over the world economically.)
  • OpenAI / DeepMind are not adversarial towards each other. In contrast, US / Germany were definitely adversarial.

Assuming you agree that people are already pushing too hard for progress in AGI capability (relative to what's ideal from a longtermist perspective), I think the current motivations for that are mostly things like money, prestige, scientific curiosity, wanting to make the world a better place (in a misguided/shorttermist way), etc., and not so much wanting to take over the world or to defend against such attempts. This seems likely to persist in the near future, but my concern is that if AGI research gets sufficiently close to fruition, governments will inevitably get involved and start pushing it even harder due to national security considerations. (Recall that Manhattan Project started 8 years before detonation of the first nuke.) Your argument seems more about what's happening now, and does not really address this concern.

you agree that people are already pushing too hard for progress in AGI capability (relative to what's ideal from a longtermist perspective)

I'm uncertain, given the potential for AGI to be used to reduce other x-risks. (I don't have strong opinions on how large other x-risks are and how much potential there is for AGI to differentially help.) But I'm happy to accept this as a premise.

Your argument seems more about what's happening now, and does not really address this concern.

I think what's happening now is a good guide into what will happen in the future, at least on short timelines. If AGI is >100 years away, then sure, a lot will change and current facts are relatively unimportant. If it's < 20 years away, then current facts seem very relevant. I usually focus on the shorter timelines.

For min(20 years, time till AGI), for each individual trend I identified, I'd weakly predict that trend will continue (except perhaps openness, because that's already changing).

It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.

On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".

I like thinking about coordination from this viewpoint.

For me it's because:

  • Nukes seem like an obvious Xrisk
  • People mostly seem to agree that we haven't done a good job coordinating around them
  • They seem a lot easier to coordinate around

Also, not a reason, but:

AI seems likely to be weaponized, and warfare (whether conventional or not) seems like one of the areas where we should be most worried about "unbridled competition" creating a race-to-the-bottom on safety.



TBC, I think climate change is probably an even better analogy.

And I also like to talk about international regulation, in general, like with tax havens.

Agree that climate change is a better analogy.

Disagree that nukes seem easier to coordinate around -- there are factors that suggest this (e.g. easier to track who is and isn't making nukes), but there are factors against as well (the incentives to "beat the other team" don't seem nearly as strong).

incentives to "beat the other team" don't seem nearly as strong

You mean it's stronger for nukes than for AI? I think I disagree, but it's a bit nuanced. It seems to me (as someone very ignorant about nukes) like with current nuclear tech you hit diminishing returns pretty fast, but I don't expect that to be the case for AI.

Also, I'm curious if weaponization of AI is a crux for us.



I'm uncertain about weaponization of AI (and did say "if we ignore military applications" in the OP).

Oops, missed that, sry.

Yeah, nuclear power is a better analogy than weapons, but I think the two are linked, and the link itself may be a useful analogy, because risk/coordination is affected by the dual-use nature of some of the technologies.

One thing that makes non-proliferation difficult is that nations legitimately want nuclear facilities because they want to use nuclear power, but 'rogue states' that want to acquire nuclear weapons will also claim that this is their only goal. How do we know who really just wants power plants?

And power generation comes with its own risks. Can we trust everyone to take the right precautions, and if not, can we paternalistically restrict some organisations or states that we deem not capable enough to be trusted with the technology?

AI coordination probably has these kinds of problems to an even greater degree.

It seems to me that many people believe something like "We need proof-level guarantees, or something close to it, before we build powerful AI". I could interpret this in two different ways:

  • Normative claim: "Given how bad extinction is, and the plausibility of AI x-risk, it would be irresponsible of us to build powerful AI before having proof-level guarantees that it will be beneficial".
  • Empirical claim: "If we run a powerful AI system without having something like a proof of the statement 'running this AI system will be beneficial', then catastrophe is nearly inevitable".

I am uncertain on the normative claim (there might be great benefits to building powerful AI sooner, including the reduction of other x-risks), and disagree with the empirical claim.

If I had to argue briefly for the empirical claim, it would go something like this: "Since powerful AI will be world-changing, it will either be really good, or really bad -- neutral impact is too implausible. But due to fragility of value, the really bad outcomes are far more likely. The only way to get enough evidence to rule out all of the bad outcomes is to have a proof that the AI system is beneficial". I'd probably agree with this if we had to create a utility function and give it to a perfect expected utility maximizer (and we couldn't just give it something trivial like the zero utility function), but that seems to be drastically cutting down our options.

So I'm curious: a) are there any people who believe the empirical claim? b) If so, what are your arguments for it? c) How tractable do you think it is to get proof-level guarantees about AI?

My thoughts: we can't really expect to prove something like "this ai will be beneficial". However, relying on empiricism to test our algorithms is very likely to fail, because it's very plausible that there's a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems). So I don't know how to make good guesses about the behavior of very capable systems except through mathematical analysis.

There are two overlapping traditions in machine learning. There's a heavy empirical tradition, in which experimental methodology is used to judge the effectiveness of algorithms along various metrics. Then, there's machine learning theory (computational learning theory), in which algorithms are analyzed mathematically and properties are proven. This second tradition seems far more applicable to questions of safety.

(But we should not act as if we only have one historical example of a successful scientific field to try and generalize from. We can also look at how other fields accomplish difficult things, especially in the face of significant risks.)

I don't think you need to posit a discontinuity to expect tests to occasionally fail.

I suspect the crux is more about how bad a single failure of a sufficiently advanced AI is likely to be.

I'll admit I don't feel like I really understand the perspective of people who seem to think we'll be able to learn how to do alignment via trial-and-error (i.e. tolerating multiple failures). Here are some guesses why people might hold that sort of view:

  • We'll develop AI in a well-designed box, so we can do a lot of debugging and stress testing.
    • counter-argument: but the concern is about what happens at deployment time
  • We'll deploy AI in a box, too then
    • counter: seems like that entails a massive performance hit (but it's not clear if that's actually the case)
  • We'll have other "AI police" to stop any "evil AIs" that "go rogue" (just like we have for people).
    • counter: where did the AI police come from, and why can't they go rogue as well?
  • The "AI police" can just be the rest of the AIs in the world ganging up on anyone who goes rogue.
    • counter: this seems to be assuming the "corrigibility as basin of attraction" argument (which has no real basis beyond intuition ATM, AFAIK) at the level of the population of agents.
  • A single failure isn't likely to be that bad, it would take a series of unlikely failures to take a safe (e.g. "satiable") AI and make it an insatiable "open ended optimizer AI".
    • counter: we can't assume that we can detect and correct failures, especially in real-world deployment scenarios where subagents might be created. So the failures may have time to compound. It also seems possible that a single failure is all that's needed; this seems like an open question

OK I could go on, but I'd rather actually hear from anyone who has this view! :)

I hold this view; none of those are reasons for my view. The reason is much more simple -- before x-risk level failures, we'll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We'll notice this, understand it, and fix the issue.

(A crux I expect people to have is whether we'll actually fix the issue or "apply a bandaid" that is only a superficial fix.)

Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don't see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.

If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.

Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?

E.g. will we see "sinister stumbles" (IIRC this was Adam Gleave's name for half-baked treacherous turns)? I think we will, FWIW.

Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)

How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn't emphasize the "optimization" part.)

Jessica's posts about MIRI vs. Paul's views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I'd expect, unless ML becomes "woke" to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we'd see something that *looks* like a discontinuity, but is *actually* more like "the same reason".

Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)

This in particular doesn't match my model. Quoting some relevant bits from Embedded Agency:

So I'm not talking about agents who know their own actions because I think there's going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money.
[...]
But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.
What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.

This is also the topic of The Rocket Alignment Problem.

Interesting. Your crux seems good; I think it's a crux for us. I expect things play out more like Eliezer predicts here: https://www.facebook.com/jefftk/posts/886930452142?comment_id=886983450932&comment_tracking=%7B%22tn%22%3A%22R%22%7D&hc_location=ufi

I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don't become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc... I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world.

A very similar problem would be a form of longer-term "seeding", where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances ("at the margin") that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy.

I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas.

That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can't carry on a conversations, but can implement a very sophisticated covert world domination strategy.



That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy.

Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many "easy" things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)

I also predict that there will be types of failure we will not notice, or will misinterpret. [...]

All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I'd be more worried about scenarios like these.

So I don't take EY's post as about AI researchers' competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.

I don't think I'm underestimating AI researchers, either, but for a different reason... let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I'm imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.

Regarding long-term planning, I'd factor this into 2 components:

1) having a good planning algorithm

2) having a good world model

I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.

I don't think its necessary to have a very "complete" world-model (i.e. enough knowledge to look smart to a person) in order to find "steganographic" long-term strategies like the ones I'm imagining.

I also don't think it's even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.... (i.e. be some sort of savant).

I don't think the only alternative to proof is empiricism. Lots of people reason about evolutionary biology/psychology with neither proof nor empiricism. The mesa optimizers paper involves neither proof nor empiricism.

it's very plausible that there's a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems)

You can also be empirical at that point though? I suppose you couldn't be empirical if you expect an either an extremely fast takeoff (i.e. order one day or less) or an inability on our part to tell when the AI reaches human-level, but this seems overly pessimistic to me.

The mesa-optimizer paper, along with some other examples of important intellectual contributions to AI alignment, have two important properties:

  • They are part of a research program, not an end result. Rough intuitions can absolutely be a useful guide which (hopefully eventually) helps us figure out what mathematical results are possible and useful.
  • They primarily point at problems rather than solutions. Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for "significant risk" vs "significant good". IE, an argument that there is a risk can be fairly rough and nonetheless be sufficient for me to "not push the button" (in a hypothetical where I could choose to turn on a system today). On the other hand, an argument that pushing the button is net positive has to be actually quite strong. I want there to be a small set of assumptions, each of which individually seem very likely to be true, which taken together would be a guarantee against catastrophic failure.

[This is an "or" condition -- either one of those two conditions suffices for me to take vague arguments seriously.]

On the other hand, I agree with you that I set up a false dichotomy between proof and empiricism. Perhaps a better model would be a spectrum between "theory" and empiricism. Mathematical arguments are an extreme point of rigorous theory. Empiricism realistically comes with some amount of theory no matter what. And you could also ask for a "more of both" type approach, implying a 2d picture where they occupy separate dimensions.

Still, though, I personally don't see much of a way to gain understanding about failure modes of very very capable systems using empirical observation of today's systems. I especially don't see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.

Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for "significant risk" vs "significant good".

This is a normative argument, not an empirical one. The normative position seems reasonable to me, though I'd want to think more about it (I haven't because it doesn't seem decision-relevant).

I especially don't see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.

The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what'll happen.)

I am confused about how the normative question isn't decision-relevant here. Is it that I have a model where it is the relevant question, but you have one where it isn't? To be hopefully clear: I'm applying this normative claim to argue that proof is needed to establish the desired level of confidence. That doesn't mean direct proof of the claim "the AI will do good", but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.

It's possible that this isn't my true disagreement, because actually the question seems more complicated than just a question of how large potential downsides are if things go poorly in comparison to potential upsides if things go well. But some kind of analysis of the risks seems relevant here -- if there weren't such large downside risks, I would have lower standards of evidence for claims that things will go well.

The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what'll happen.)

It sounds like we would have to have a longer discussion to resolve this. I don't expect this to hit the mark very well, but here's my reply to what I understand:

  • I don't see how you can be confident enough of that view for it to be how you really want to check.
  • A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out "hacks" around the "usual interpretation" of the proxy.

I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be (keeping in mind that those aren't the only two things in the universe). I'm not sure which of those two disagreements is more important here.

To be hopefully clear: I'm applying this normative claim to argue that proof is needed to establish the desired level of confidence.

Under my model, it's overwhelmingly likely that regardless of what we do AGI will be deployed with less than the desired level of confidence in its alignment. If I personally controlled whether or not AGI was deployed, then I'd be extremely interested in the normative claim. If I then agreed with the normative claim, I'd agree with:

proof is needed to establish the desired level of confidence. That doesn't mean direct proof of the claim "the AI will do good", but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.

I don't see how you can be confident enough of that view for it to be how you really want to check.

If I want >99% confidence, I agree that I couldn't be confident enough in that argument.

A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out "hacks" around the "usual interpretation" of the proxy.

Yeah, the hope here would be that the relevant decision-makers are aware of this dynamic (due to previous situations in which e.g. a recommender system optimized the fairly good proxy of clickthrough rate but this lead to "hacks" around the "usual interpretation"), and have some good reason to think that it won't happen with the highly capable system they are planning to deploy.

I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be

Agreed. It also might be that we disagree on the tractability of proofs in addition to / instead of the utility of proofs.

a) I believe a weaker version of the empirical claim, namely that the catastrophe is not nearly inevitable but not unlikely. That is, I can imagine different worlds in which the probability of the catastrophe is different, and I have uncertainty over in which world we actually are, s.t. in average the probability is sizable.

b) I think that the argument you gave is sort of correct. We need to augment it by: the minimal requirement from the AI is, it needs to effectively block all competing dangerous AI projects, without also doing bad things (which is why you can't just give it the zero utility function). Your counterargument seems weak to me because, moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there). I think that whatever your AI is, given that is satisfies the minimal requirement, some kind of utility-maximization-like behavior is likely to arise.

Coming at it from a different angle, complicated systems often fail in unexpected ways. The way people solve this problem in practice is by a combination of mathematical analysis and empirical research. I don't think we have many examples of complicated systems where all failures were avoided by informal reasoning without either empirical or mathematical backing. In the case of superintelligent AI, empirical research alone is insufficient because, without mathematical models, we don't know how to extrapolate empirical results from current AIs to superintelligent AIs, and when superintelligent algorithms are already here it will probably be too late.

c) I think what we can (and should) realistically aim for is, having a mathematical theory of AI, and having a mathematical model of our particular AI, such that in this model we can prove the AI is safe. This model will have some assumptions and parameters that will need to be verified/measured in other ways, through some combination of (i) experiments with AI/algorithms (ii) learning from neuroscience (iii) learning from biological evolution and (iv) leveraging our knowledge of physics. Then, there is also the question of, how precise is the correspondence between the model and the actual code (and hardware). Ideally, we want to do formal verification in which we can test that a certain theorem holds for the actual code we are running. Weaker levels of correspondence might still be sufficient, but that would be Plan B.

Also, the proof can rely on mathematical conjectures in which we have high confidence, such as . Of course, the evidence for such conjectures is (some sort of) empirical, but it is important that the conjecture is at least a rigorous, well defined mathematical statement.

I agree with a). c) seems to me to be very optimistic, but that's mostly an intuition, I don't have a strong argument against it (and I wouldn't discourage people who are enthusiastic about it from working on it).

The argument in b) makes sense; I think the part that I disagree with is:

moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there).

The counterargument is "current AI systems don't look like long term planners", but of course it is possible to respond to that with "AGI will be very different from current AI systems", and then I have nothing to say beyond "I think AGI will be like current AI systems".

Well, any system that satisfies the Minimal Requirement is doing long term planning on some level. For example, if your AI is approval directed, it still needs to learn how to make good plans that will be approved. Once your system has a superhuman capability of producing plans somewhere inside, you should worry about that capability being applied in the wrong direction (in particular due to mesa-optimization / daemons). Also, even without long term planning, extreme optimization is dangerous (for example an approval directed AI might create some kind of memetic supervirus).

But, I agree that these arguments are not enough to be confident of the strong empirical claim.

Not sure who you have in mind as people believing this, but after searching both LW and Arbital, the closest thing I've found to a statement of the empirical claim is from Eliezer's 2012 Reply to Holden on ‘Tool AI’:

I’ve re­peat­edly said that the idea be­hind prov­ing de­ter­minism of self-mod­ifi­ca­tion isn’t that this guaran­tees safety, but that if you prove the self-mod­ifi­ca­tion sta­ble the AI might work, whereas if you try to get by with no proofs at all, doom is guaran­teed.

Paul Christiano argued against this at length in Stable self-improvement as an AI safety problem, concluding as follows:

But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI.

Note that the above talked about "stable self-modification" instead of ‘running this AI system will be beneficial’, and the former is a much narrower and easier to formalize concept than the latter. I haven't really found a serious proposal to try to formalize and prove the latter kind of statement.

IMO, formalizing ‘running this AI system will be beneficial’ is itself an informal and error-prone process, where the only way to gain confidence in its correctness is for many competent researchers to try and fail to find flaws in the formalization. Instead of doing that, one could gain confidence in the AI's safety by directly trying to find flaws (considered informally) in the AI design, and trying to prove or demonstrate via empirical testing narrower safety-relevant statements like "stable self-modification", and given enough resources perhaps reach a similar level of confidence. (So the empirical statement doesn't seem to make sense as written.)

The former still has the advantage that the size of the thing that might be flawed is much smaller (i.e., just the formalization of ‘running this AI system will be beneficial’ instead of the whole AI design), but it has the disadvantage that finding a proof might be very costly both in terms of research effort and in terms of additional constraint on AI design (to allow for a proof) making the AI less competitive. Overall, it seems like it's too early to reach a strong conclusion one way or another as to which approach is more advisable.

Not sure who you have in mind as people believing this

I don't have particular people in mind, it's more of a general "vibe" I get from talking to people. In the past, when I've stated the empirical claim, some people agreed with it, but upon further discussion it turned out they actually agreed with the normative claim. Hence my first question, which was to ask whether or not people believe the empirical claim.

I think a potentially more interesting question is not about running a single AI system, but rather the overall impact of AI technology (in a world where we don't have proofs of things like beneficence). It would be easier to hold the analogue of the empirical claim there.


I'd also argue against the empirical claim in that setting; do you agree with the empirical claim there?

I hold a nuanced view that I believe is more similar to the empirical claim than your views.

I think what we want is an extremely high level of justified confidence that any AI system or technology that is likely to become widely available is not carrying a significant and non-decreasing amount of Xrisk-per-second.
And it seems incredibly difficult and likely impossible to have such an extremely high level of justified confidence.

Formal verification and proof seem like the best we can do now, but I agree with you that we shouldn't rule out other approaches to achieving extreme levels of justified confidence. What it all points at to me is the need for more work on epistemology, so that we can begin to understand how extreme levels of confidence actually operate.


This sounds like the normative claim, not the empirical one, given that you said "what we want is..."

Yep, good catch ;)

I *do* put a non-trivial weight on models where the empirical claim is true, and not just out of epistemic humility. But overall, I'm epistemically humble enough these days to think it's not reasonable to say "nearly inevitable" if you integrate out epistemic uncertainty.

But maybe it's enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?

Or are you just trying to see if anyone can defeat the epistemic humility "trump card"?



Or are you just trying to see if anyone can defeat the epistemic humility "trump card"?

Partly (I'm surprised by how confident people generally seem to be, but that could just be a misinterpretation of their position), but also on my inside view the empirical claim is not true and I wanted to see if there were convincing arguments for it.

But maybe it's enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?

Yeah, I'd be interested in your answers anyway.

I'm not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I'm actually more interested in hearing your take on those lines of argument than saying mine ATM :P

Re: convergent rationality, I don't buy it (specifically the "convergent" part).

Re: fragility of human values, I do buy the notion of a broad basin of corrigibility, which presumably is less fragile.

But really my answer is "there are lots of ways you can get confidence in a thing that are not proofs". I think the strongest argument against is "when you have an adversary optimizing against you, nothing short of proofs can give you confidence", which seems to be somewhat true in security. But then I think there are ways that you can get confidence in "the AI system will not adversarially optimize against me" using techniques that are not proofs.

(Note the alternative to proofs is not trial and error. I don't use trial and error to successfully board a flight, but I also don't have a proof that my strategy is going to cause me to successfully board a flight.)

But really my answer is "there are lots of ways you can get confidence in a thing that are not proofs".

Totally agree; it's an under-appreciated point!

Here's my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don't actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.)

The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong.

This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems.

I'm personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the "early days" of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of "social epistemology" would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I'd argue we're in the process of failing catastrophically at that)

A downside of the portfolio approach to AI safety research

Given typical human biases, researchers of each AI safety approach are likely to be overconfident about the difficulty and safety of the approach they're personally advocating and pursuing, which exacerbates the problem of unilateralist's curse in AI. This should highlighted and kept in mind by practitioners of the portfolio approach to AI safety research (e.g., grant makers). In particular it may be a good idea to make sure researchers who are being funded have a good understanding of the overconfidence effect and other relevant biases, as well as the unilateralist's curse.

These biases seem very important to keep in mind!

If "AI safety" refers here only to AI alignment, I'd be happy to read about how overconfidence about the difficulty/safety of one's approach might exacerbate the unilateralist's curse.

I'm posting a few research directions in my research agenda about which I haven't written much elsewhere (except maybe in the MIRIx Discord server), and for which I so far haven't got the time to make a full length essay with mathematical details. Each direction is in a separate child comment.

In last year's essay about my research agenda I wrote about an approach I call "learning by teaching" (LBT). In LBT, an AI is learning human values by trying to give advice to a human and seeing how the human changes eir behavior (without an explicit reward signal). Roughly speaking, if the human permanently changes eir behavior as a result of the advice, then one can assume the advice was useful. Partial protection against manipulative advice is provided by a delegation mechanism, which ensures the AI only produces advice that is in the space of "possible pieces of advice a human could give" in some sense. However, this protection seems insufficient since it allows for giving all arguments in favor of a position without giving any arguments against a position.

To add more defense against manipulation, I propose to build on the "AI debate" idea. However, in this scheme, we don't need more than one AI. In fact, this is a general fact: for any protocol involving multiple AIs, there is a protocol involving just one AI that works (at least roughly, qualitatively) just as well. Proof sketch: If we can prove that under assumptions , the protocol is safe/effective, then we can design a single AI which has assumptions baked into its prior. Such an AI would be able to understand that simulating protocol would lead to a safe/effective outcome, and would only choose a different strategy if it leads to an even better outcome under the same assumptions.

The way we use "AI debate" is not by implementing an actual AI debate. Instead, we use it to formalize our assumptions about human behavior. In ordinary IRL, the usual assumption is "a human is a (nearly) optimal agent for eir utility function". In the original version of LBT, the assumption was of the form "a human is (nearly) optimal when receiving optimal advice". In debate-LBT the assumption becomes "a human is (nearly) optimal* when exposed to a debate between two agents at least one of which is giving optimal advice". Here, the human observes this hypothetical debate through the same "cheap talk" channel through which it receives advice from the single AI.

Notice that debate can be considered to be a form of interactive proof system (with two or more provers). However, the requirements are different from classical proof systems. In classical theory, the requirement is "When the prover is honestly arguing for a correct proposition, the verifier is convinced. For any prover the verifier cannot be convinced of a false proposition." In "debate proof systems" the requirement is "If at least one prover is honest, the verifier comes to the correct conclusion". That is, we don't guarantee anything when both provers are dishonest. It is easy to see that these debate proof systems admit any problem in PSPACE: given a game, both provers can state their assertions as to which side wins the game, and if they disagree they have to play the game for the corresponding sides.

*Fiddling with the assumptions a little, instead of "optimal" we can probably just say that the AI is guaranteed to achieve this level of performance, what it is.

A variant of Christiano's IDA amenable to learning-theoretic analysis. We consider reinforcement learning with a set of observations and a set of actions, where the semantics of the actions is making predictions about future observations. (But, as opposed to vanilla sequence forecasting, making predictions affects the environment.) The reward function is unknown and unobservable, but it is known to satisfy two assumptions:

(i) If we make the null prediction always, the expected utility will be lower bounded by some constant.

(ii) If our predictions sample the -step future for a given policy , then our expected utility will be lower bounded by some function of the the expected utility of and . is s.t. for sufficiently low , but for sufficiently high , (in particular the constant in (i) should be high enough to lead to an increasing sequence). Also, it can be argued that it's natural to assume for .

The goal is proving regret bounds for this setting. Note that this setting automatically deals with malign hypotheses in the prior, bad self-fulfilling prophecies and "corrupting" predictions that cause damage just by seeing them.

However, I expect that without additional assumptions the regret bound will be fairly weak, since the penalty for making wrong predictions grows with the total variation distance between the prediction distribution and the true distribution, which is quite harsh. I think this reflects a true weakness of IDA (or some versions of it, at least): without an explicit model of the utility function, we need very high fidelity to guarantee robustness against e.g. malign hypotheses. On the other hand, it might be possible to ameliorate this problem if we introduce an additional assumption of the form: the utility function is e.g. Lipschitz w.r.t some metric . Then, total variation distance is replaced by Kantorovich-Rubinstein distance defined w.r.t. . The question is, where do we get the metric from. Maybe we can consider something like the process of humans rewriting texts into equivalent texts.

This idea was inspired by a discussion with Discord user @jbeshir

Model dynamically inconsistent agents (in particular humans) as having a different reward function at every state of the environment MDP (i.e. at every state we have a reward function that assigns values both to this state and to all other states: we have a reward matrix ). This should be regarded as a game where a different player controls the action at every state. We can now look for value learning protocols that converge to Nash* (or other kind of) equilibrium in this game.

The simplest setting would be, every time you visit a state, you learn the reward of all previous states w.r.t. the reward function of the current state. Alternatively, every time you visit a state, you can ask about the reward of one previously visited state w.r.t. the reward function of the current state. This is the analogue of classical reinforcement learning with an explicit reward channel. We can now try to prove a regret bound, which takes the form of an -Nash equilibrium condition, with being the regret. More complicated settings would be analogues of Delegative RL (where the advisor also follows the reward function of the current state) and other value learning protocols.

This seems like a more elegant way to model "corruption" than as a binary or continuous one dimensional variable like I did before.

*Note that although for general games, even if they are purely coorperative, Nash equilibria can be suboptimal due to coordination problems, for this type of games it doesn't happen: in the purely cooperative case, the Nash equilibrium condition becomes the Bellman equation that implies global optimality.

It is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won't go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form "program template augmented by a hard-coded bit string of length ". Such a set contains hypotheses, and its total probability mass is approximately , which is significant for short (even when is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both and . Such a contest is very hard since learning bits can take much time for large .

Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string is defined as follows. First, we consider the Kolmogorov complexity of . Then we consider pairs where is a program, is a bit string, and . Finally, we minimize over . The minimal is called the sophistication of . For our problem, we are interested in the minimal itself: I call it the "sophisticated core" of . We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping.

Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of . Therefore, the reference agent in the definition of regret now knows but doesn't know , making it feasible to compete with it.

Is this open thread not going to be a monthly thing?

FWIW I liked reading the comment threads here, and would be inclined to participate in the future. But that's just my opinion. I'm curious if more senior people had reasons for not liking the idea?

I expected that it would be better for me to polish ideas before posting on the forum, and treated this as an experiment to check. I think it broadly confirmed my original view, so I'm not very likely to post top-level comments on open threads in the future, and I told the admins so. I don't know what their decision process was after that. (Possibly they expected that future open threads would be much quieter, since the two biggest comment threads here were both started by my top-level comments.)

I felt a bit uncertain about doing one every month, and was planning to start another one in October. Depending on how that one goes we might go with a monthly schedule, or maybe every two months is the right way to go.

I've just been invited to this forum. How do I decide whether to put a post on the Alignment Forum vs. Less Wrong?

Basically, whether you think it's primarily related to alignment vs. rationality. (Everything on the AF is also on LW, but the reverse isn't true.) The feedback loop if you're posting too much or stuff that isn't polished enough is downvotes (or insufficient upvotes).