AI ALIGNMENT FORUM
AF

All of Richard_Ngo's Comments + Replies

When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.

But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.

This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.

Vanessa Kosoy6d50

In (non-monotonic) infra-Bayesian physicalism, there is a vaguely similar asymmetry even though it's formalized via a loss function. Roughly speaking, the loss function expresses preferences over "which computations are running". This means that you can have a "positive" preference for a particular computation to run or a "negative" preference for a particular computation not to run^[1].

^{^}
There are also more complicated possibilities, such as "if P runs then I want Q to run but if P doesn't run then I rather that Q also doesn't run" or even preferences that

Richard Ngo8d*92

Nice post. I read it quickly but think I agree with basically all of it. I particularly like the section starting "The AI doesn’t have a cached supergoal for “maximize reward”, but it decides to think anyway about whether reward is an instrumental goal".

"The distinct view that truly terminal reward maximization is kind of narrow or bizarre or reflection-unstable relative to instrumental reward maximization" is a good summary of my position. You don't say much that directly contradicts this, though I do think that even using the "terminal reward seeker" vs ... (read more)

Hierarchical Agency: A Missing Piece in AI Alignment

Richard Ngo22d20

I didn't end up putting this in my coalitional agency post, but at one point I had a note discussing our terminological disagreement:

I don’t like the word hierarchical as much. A theory can be hierarchical without being scale-free—e.g. a theory which describes something in terms of three different layers doing three different things is hierarchical but not scale-free.

Whereas coalitions are typically divided into sub-coalitions (e.g. the "western civilization" coalition is divided into countries which are divided into provinces/states; political coalitions ... (read more)

Towards a scale-free theory of intelligent agency

Richard Ngo1mo50

I've now edited that section. Old version and new version here for posterity.

Old version:

None of these is very satisfactory! Intuitively speaking, Alice and Bob want to come to an agreement where respect for both of their interests is built in. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to weighted averages. The best they can do is to agree on a probabilistic mixture of EUMs—e.g. tossing a coin to decide between option 1 and opti... (read more)

Towards a scale-free theory of intelligent agency

Richard Ngo1mo52

I was a bit lazy in how I phrased this. I agree with all your points; the thing I'm trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion "Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for":

Strongly incentivizes deception (including self-deception) during bargaining (e.g. each agent wants to overstate the difficulty of providing cake for it).
Strongly incentiv

... (read more)

5Richard Ngo1mo

I've now edited that section. Old version and new version here for posterity. Old version: None of these is very satisfactory! Intuitively speaking, Alice and Bob want to come to an agreement where respect for both of their interests is built in. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to weighted averages. The best they can do is to agree on a probabilistic mixture of EUMs—e.g. tossing a coin to decide between option 1 and option 2—which is still very inflexible, since it locks in one of them having priority indefinitely. Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to follow through on commitments they made about which decision procedure to follow (or even hypothetical commitments). New version: These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal. We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes agg

Ten people on the inside

Richard Ngo3mo96

FWIW I think of "OpenAI leadership being untrustworthy" (a significant factor in me leaving) as different from "OpenAI having bad safety policies" (not a significant factor in me leaving). Not sure if it matters, I expect that Scott was using "safety policies" more expansively than I do. But just for the sake of clarity:

I am generally pretty sympathetic to the idea that it's really hard to know what safety policies to put in place right now. Many policies pushed by safety people (including me, in the past) have been mostly kayfabe (e.g. being valuable as c... (read more)

2Oliver Habryka3mo

(I meant the more expansive definition. Plausible that me and Zac talked past each other because of that)

Anthropic: Three Sketches of ASL-4 Safety Case Components

Richard Ngo5mo80

Cool, ty for (characteristically) thoughtful engagement.

I am still intuitively skeptical about a bunch of your numbers but now it's the sort of feeling which I would also have if you were just reasoning more clearly than me about this stuff (that is, people who reason more clearly tend to be able to notice ways that interventions could be surprisingly high-leverage in confusing domains).

Anthropic: Three Sketches of ASL-4 Safety Case Components

Richard Ngo5mo110

1. Yepp, seems reasonable. Though FYI I think of this less as some special meta argument, and more as the common-sense correction that almost everyone implicitly does when giving credences, and rationalists do less than most. (It's a step towards applying outside view, though not fully "outside view".)

2. Yepp, agreed, though I think the common-sense connotations of "if this became" or "this would have a big effect" are causal, especially in the context where we're talking to the actors who are involved in making that change. (E.g. the non-causal interpreta... (read more)

Daniel Kokotajlo5mo90

Good point re 2. Re 1, meh, still seems like a meta-argument to me, because when I roll out my mental simulations of the ways the future could go, it really does seem like my If... condition obtaining would cut out about half of the loss-of-control ones.

Re 3: point by point:
1. AISIs existing vs. not: Less important; I feel like this changes my p(doom) by more like 10-20% rather than 50%.
2. Big names coming out: idk this also feels like maybe 10-20% rather than 50%
3. I think Anthropic winning the race would be a 40% thing maybe, but being a runner-up doesn'... (read more)

Anthropic: Three Sketches of ASL-4 Safety Case Components

Richard Ngo5mo131

We have discussed this dynamic before but just for the record:

I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half.

This is IMO not the type of change that should be able to cut someone's P(doom) in half. There are so many different factors that are of this size and importance or bigger (including many that people simply have not thought of yet) such that, if this change could halv... (read more)

Daniel Kokotajlo5mo60

Sorry! My response:

1. Yeah you might be right about this, maybe I should get less excited and say something like "it feels like it should cut in half but taking into account Richard's meta argument I should adjust downwards and maybe it's just a couple percentage points"

2. If the conditional obtains, that's also evidence about a bunch of other correlated good things though (timelines being slightly longer, people being somewhat more reasonable in general, etc.) so maybe it is legit to think this would have quite a big effect

3. Are you sure there are so man... (read more)

Richard Ngo's Shortform

Richard Ngo8mo56

The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:

Talking about partial hypotheses rather than full hypotheses. You can't have a prior over partial hypotheses, because several of them can be true at once (though you can still assign them credences and update those credences according to evidence).
Talking about models with degrees of truth rather than just hypotheses with degrees of likelihood. E.g. when using a binary conception of truth, general relativity is definitely false because it's inconsistent wit

... (read more)

Coalitional agency

Richard Ngo8mo20

Scott Garrabrant just convinced me that my notion of conservatism was conflating two things:

Obligations to (slash constraints imposed by) the interests of existing agents.
The assumption that large agents would grow in a bottom-up way (e.g. by merging smaller agents) rather than in a top-down way (e.g. by spinning up new subagents).

I mainly intend conservatism to mean the former.

Defining alignment research

Richard Ngo8mo30

Whose work is relevant, according to you?

3Oliver Habryka8mo

Lots of people's work: * Paul's work (ELK more than RLHF though it was useful to see what happens when you throw RL at LLMs in a way that's kind of similar to how I do get some value out of Chris's work) * Eliezer's work * Nate's work * Holden's writing on cold takes * Ajeya's work * Wentworth's work * The debate stuff * Redwood's work * Bostrom's work * Evan's work * Scott and Abram's work There is of course still huge variance in how relevant and how much for the throat these different people's work is going for, but all of these seem more relevant to AI Alignment/AI-not-kill-everyonism than Chris's work (which again, I found interesting, but not like super interesting).

Defining alignment research

Richard Ngo8mo70

If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science.

Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris's interpretability work and still not know whether or not he's an "alignment researcher"... (read more)

Buck Shlegeris8mo75

I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting).

I disagree, I'd rather they'd hire someone who cares about making AI go well. E.g. I like Sam Marks's work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that's in the direction of being useful. (T... (read more)

2Daniel Kokotajlo8mo

I feel like we are misunderstanding each other, and I think it's at least in large part my fault. I definitely agree that we don't want to be handing out grants or judging people on the basis of what shibboleths they spout or what community they are part of. In fact I agree with most of what you've said above, except for when you start attributing stuff to me. I think that grantmakers should evaluate research proposals not on the basis of the intentions of the researcher, but on the basis of whether the proposed research seems useful for alignment. This is not your view though right? You are proposing something else?

5Oliver Habryka8mo

(FWIW I think Chris Olah's work is approximately irrelevant to alignment and indeed this is basically fully explained by the motivational dimension)

Defining alignment research

Richard Ngo8mo31

What if your research goal is "I'd like to understand how neural networks work?" This is not research primarily about how to make AIs aligned. We tend to hypothesize, as a community, that it will help with alignment more than it helps with capabilities. But that's not an inherent part of the research goal for many interpretability researchers.

(Same for "I'd like to understand how agency works", which is a big motivation for many agent foundations researchers.)

Conversely, what if your research goal is "I'm going to design a training run that will produce a ... (read more)

5Daniel Kokotajlo8mo

Answering your first question. If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science. Good for you. I still think it would be better if you switched to doing alignment research (e.g. you could switch to 'i want to understand how neural networks work... so that I can understand how a prosaic AGI system being RLHF'd might behave when presented with genuine credible opportunities for takeover + lots of time to think about what to do) but I don't feel so strongly about it as I would if you were doing capabilities research. re: judging the research itself rather than the motivations: idk I think it's actually easier, and less subjective, to judge the motivations, at least in this case. People usually just state what their motivations are. Also I'm not primarily trying to judge people, I'm trying to exhort people -- I'm giving people advice about what they should do to make the world a better place, I'm not e.g. talking about which research should be published and which should be restricted (I'm generally in favor of publishing research with maybe a few exceptions, but I think corporations on the margin should publish more) Moreover it's much easier for the researcher to judge their own motivations, than for them to judge the long-term impact of their work or to fit it into your diagram.

Defining alignment research

Richard Ngo8mo40

Fair point. I've now removed that section from the post (and also, unrelatedly, renamed the post).

I was trying to make a point about people wanting to ensure that AI in general (not just current models) is "aligned", but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.

2[comment deleted]8mo

In Defense of Open-Minded UDT

Richard Ngo8mo72

Nice post. I'm excited about the bargaining interpretation of UDT.

However, if we think of our probability for the coin-flip as the result of bargaining, it makes sense that it might be sensitive to size. The negotiation which was willing to trade $100 from one branch to get $10,000 in another branch need not be equally willing to perform that trade arbitrarily many times.

Given this, is there any reason to focus on iterated counterfactual mugging, as opposed to just counterfactual muggings with higher stakes?

It seems like iteration is maybe related to learn... (read more)

2Abram Demski8mo

Yeah, I'm kind of connecting a lot of threads here in a messy way. This post definitely could be better-organized. * I have a sense that open-minded UDT should relate to objective probabilities in a frequentist sense. For example, in decision problems involving Omega, it's particularly compelling if we stipulate that Omega has a long history of offering similar choices to mortals and a track record of being honest and predicting correctly. This is in some sense the most compelling way we can come to know what decision problem we face; and, it relies on framing our decision as part of a sequence. Counterlogical mugging on a digit of pi is similarly compelling if we imagine a very large digit, but becomes less compelling as we imagine digits closer to the beginning of pi. I want to suggest a learning principle with frequentist-flavored guarantees (similar to LIDT or BRIA but less updateful). * On the other hand, the bargaining framing does not have anything to do with iteration. The bargaining idea in some sense feels much more promising, since I can already offer a toy analysis supporting my intuition that iterated counterfactual mugging with the some coin is less tempting than iterated muggings with different coins.

A simple case for extreme inner misalignment

Richard Ngo8mo30

The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.

I agree that the particular type of misaligned goal is not crucial. I'm thinking of molecular squiggles as an unusually clean type of misalignment to make arguments about, because it's very clear that they're not valuable. If you told me that molecular squiggles weren't a central example of a goal that you think a misaligned superintelligence might have, then I'd update, but it sounds like your statements are consistent with t... (read more)

Eliezer Yudkowsky8mo128

If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?

Value them primarily? Uhhh... maybe 1:3 against? I admit I have never actually pondered this question before today; but 1 in 4 uncontrolled superintelligences spending most of their resources on tiny squiggles doesn't sound off by, like, more than 1-2 orders of magnitude in either direction.

Clocks ar

... (read more)

Richard Ngo's Shortform

Richard Ngo9mo1911

I haven't yet read through them thoroughly, but these four papers by Oliver Richardson are pattern-matching to me as potentially very exciting theoretical work.

tl;dr: probabilistic dependency graphs (PDGs) are directed graphical models designed to be able to capture inconsistent beliefs (paper 1). The definition of inconsistency is a natural one which allows us to, for example, reframe the concept of "minimizing training loss" as "minimizing inconsistency" (paper 2). They provide an algorithm for inference in PDGs (paper 3) and an algorithm for learning vi... (read more)

A more systematic case for inner misalignment

Richard Ngo9mo20

Ah, sorry for the carelessness on my end. But this still seems like a substantive disagreement: you expect

$f^{- 1} [f (G)] \approx G$ , and I don't, for the reasons in my comment.

A more systematic case for inner misalignment

Richard Ngo9mo20

Thanks for the extensive comment! I'm finding this discussion valuable. Let me start by responding to the first half of your comment, and I'll get to the rest later.

The simplicity of a goal is inherently dependent on the ontology you use to view it through: while $K_{ϕ} (f (G), O_{2}) < K_{ϕ} (G, O_{1})$ is (likely) true, pay attention to how this changes the ontology! The goal of the agent is indeed very simple, but not because the "essence" of the goal simplifies; instead, it's merely because it gets access to a more powerful ontology that has more detail, granu

... (read more)

2[anonymous]9mo

To clarify, I don't think f is invertible, and that is why I talked about the preimage and not the inverse. I find it very plausible that f is not injective, i.e. that in a more compact ontology coming from a more intelligent agent, ideas/configurations etc that were different in the old ontology get mapped to the same thing in the new ontology (because the more intelligent agent realizes that they are somehow the same on a deeper level). I also believe f would not be surjective, as I wrote in response to rif a. sauros: Nonetheless, I still expect f−1[f(G)] (viewed as the preimage of f(G) under the f mapping) and G to only differ very slightly.

A simple case for extreme inner misalignment

Richard Ngo9mo60

I don't argue at any point that ASIs will have a single goal. The argument goes through equally well if it has many. The question is why some of those goals are of the form "tile the universe with squiggles" at all. That's the part I'm addressing in this post.

A simple case for extreme inner misalignment

Richard Ngo9mo20

Curious who just strong-downvoted and why.

Oliver Habryka9mo74

Early stage votes are pretty noisy (and I think have been getting noisier over time, which is somewhat of a proxy of polarization, which makes me sad).

Buck's Shortform

Richard Ngo10mo41

Another concern about safety cases: they feel very vulnerable to regulatory capture. You can imagine a spectrum from "legislate that AI labs should do X, Y, Z, as enforced by regulator R" to "legislate that AI labs need to provide a safety case, which we define as anything that satisfies regulator R". In the former case, you lose flexibility, but the remit of R is very clear. In the latter case, R has a huge amount of freedom to interpret what counts as an adequate safety case. This can backfire badly if R is not very concerned about the most important thr... (read more)

Richard Ngo's Shortform

Richard Ngo10mo*26-13

Some opinions about AI and epistemology:

One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues.
A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either.
For example: there's no privileged way to combin

... (read more)

Thomas Kwa10mo30

How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

Oliver Habryka10mo*1243

That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom.

I don't really buy this statement. Most frames, from most disciplines, and most styles of reasoning, do not make clear predictions about what will happen to humanity in the long-run future. A very few do, but the vast majority are silent on this issue. Silence is not anything like "50%".

Most frames, from most disciplines, and most styles of reasoning, don't predict sparks when you put metal in a microwave. This doesn't mean I don't... (read more)

5Andrew Critch10mo

I may be missing context here, but as written / taken at face value, I strongly agree with the above comment from Richard. I often disagree with Richard about alignment and its role in the future of AI, but this comment is an extremely dense list of things I agree with regarding rationalist epistemic culture.

We might be dropping the ball on Autonomous Replication and Adaptation.

Richard Ngo11mo20

Part of my view here is that ARA agents will have unique affordances that no human organization will have had before (like having truly vast, vast amounts of pretty high skill labor).

The more labor they have, the more detectable they are, and the easier they are to shut down. Also, are you picturing them gaining money from crimes, then buying compute legitimately? I think the "crimes" part is hard to stop but the "paying for compute" part is relatively easy to stop.

My guess is that you need to be a decent but not amazing software engineer to ARA.

Yeah, you're probably right. I still stand by the overall point though.

7Hjalmar Wijk11mo

I think in our current situation shutting down all rogue AI operations might be quite tough (though limiting their scale is certainly possible, and I agree with the critique that absent slowdowns or alignment problems or regulatory obstacles etc. it would be surprising if these rogue agents could compete with legitimate actors with many more GPUs). Assuming the AI agents have money there are maybe three remaining constraints the AI agents need to deal with: * purchasing/renting the GPUs, * (if not rented through a cloud provider) setting them up and maintaining them, and * evading law enforcement or other groups trying to locate them and shut them down For acquiring the GPUs, there are currently a ton of untracked GPUs spread out around the world, both gaming GPUs like 4090s and datacenter GPUs like H100s. I can buy or rent them with extremely minimal KYC from tons of different places. If we assume ARA agents are able to do all of the following: * Recruit people to physically buy GPUs from retail stores * Take delivery of online ordered GPUs in anonymized ways using intermediaries * Recruit people to set up fake AI startups and buy GPUs through them then the effort required to prevent them from acquiring any GPUs in a world anything like today seems crazy. And even if the US + allies started confiscating or buying up 4090s en masse, it's enough that some other countries adopt laxer standards (perhaps to attract AI talent that is frustrated by the new draconian GPU-control regime). As for setting up acquired GPUs, I think the AI agents will probably be able to find colocation operators that don't ask too many questions in many parts of the world, but even if the world was to coordinate on extremely stringent controls, the AI agents could set up datacenters of their own in random office buildings or similar - each inference setup wouldn't need that much power and I think it would be very hard to track it down. As for not being shut down by law enforcement

2Ryan Greenblatt11mo

Both legitimately and illegitimately acquiring compute could plausibly be the best route. I'm uncertain. It doesn't seem that easy to lock down legitimate compute to me? I think you need to shutdown people buying/renting 8xH100 style boxes. This seems quite difficult potentially. The model weights probably don't fit in VRAM in a single 8xH100 device (the model weights might be 5 TB when 4 bit quantized), but you can maybe fit with 8-12 of these. And, you might not need amazing interconnect (e.g. normal 1 GB/s datacenter internet is fine) for somewhat high latency (but decently good throughput) pipeline parallel inference. (You only need to send the residual stream per token.) I might do a BOTEC on this later. Unless you're aware of a BOTEC on this?

We might be dropping the ball on Autonomous Replication and Adaptation.

Richard Ngo11mo30

1) It’s not even clear people are going to try to react in the first place.

I think this just depends a lot on how large-scale they are. If they are using millions of dollars of compute, and are effectively large-scale criminal organizations, then there are many different avenues by which they might get detected and suppressed.

If we don't solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop.

A world which can pause AI development is one which can also easily throttle ARA AIs.

The central point is:
At some

... (read more)

1Charbel-Raphael Segerie11mo

I agree that's a bit too much, but it seems to me that we're not at all on the way to stopping open source development, and that we need to stop it at some point; maybe you think ARA is a bit early, but I think we need a red line before AI becomes human-level, and ARA is one of the last arbitrary red lines before everything accelerates. But I still think no return to loss of control because it might be very hard to stop ARA agent still seems pretty fair to me. I agree with your comment on twitter that evolutionary forces are very slow compared to deliberate design, but that is not way I wanted to convey (that's my fault). I think an ARA agent would not only depend on evolutionary forces, but also on the whole open source community finding new ways to quantify, prune, distill, and run the model in a distributed way in a practical way. I think the main driver this "evolution" would be the open source community & libraries who will want to create good "ARA", and huge economic incentive will make agent AIs more and more common and easy in the future.

3Ryan Greenblatt11mo

I push back on this somewhat in a discussion thread here. (As a pointer to people reading through.) Overall, I think this is likely to be true (maybe 60% likely), but not enough that we should feel totally fine about the situation.

We might be dropping the ball on Autonomous Replication and Adaptation.

Richard Ngo11mo50

However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger.

This is what I meant by "ARA as a benchmark"; maybe I should have described it as a proxy instead. Though while I agree that ARA rules out most danger, I think that's because it's just quite a low bar. The sort of tasks involved in buying compute etc are ones most humans could do. Meanwhile more plausible threat models involve expert-level or superhuman hacking. So I expect a significant gap between ARA and those threat models.

once you do h

... (read more)

2Ryan Greenblatt11mo

Yep, I was imagining big slow down or uncertainty in how explosive AI R&D will be. It agree that this cuts the risk, but I don't think it fully eliminates it. I think it might be considerably harder to fully cut off ARA agents than to do a big slowdown. One way to put this is that quite powerful AIs in the wild might make slowing down 20% less likely to work (conditional on a serious slow down effort) which seems like a decent cost to me. Part of my view here is that ARA agents will have unique affordances that no human organization will have had before (like having truely vast, vast amounts of pretty high skill labor). Of course, AI labs will also have access to vast amounts of high skill labor (from possibly misaligned AIs). I don't know how the offense defense balance goes here. I think full defense might require doing really crazy things that organizations are unwilling to do. (E.g. unleashing a huge number of controlled AI agents which may commit crimes in order to take free energy.) My guess is that you need to be a decent but not amazing software engineer to ARA. So, I wouldn't describe this as "tasks most humans can do".

We might be dropping the ball on Autonomous Replication and Adaptation.

Answer by Richard NgoMay 31, 20243634

I think the opposite: ARA is just not a very compelling threat model in my mind. The key issue is that AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down. While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources. Meanwhile, if they grow large enough to be spending serious amounts of money, they'... (read more)

3Donald Hobson10mo

If AI labs are slamming on the recursive self improvement ASAP, it may be that Autonomous Replicating Agents are irrelevant. But that's a "ARA can't destroy the world if AI labs do it first" argument. ARA may well have more compute than AI labs. Especially if the AI labs are trying to stay within the law, and the ARA is stealing any money/compute that it can hack it's way into. (Which could be >90% of the internet if it's good at hacking. ) Ok. That's a world model in which humans are being INCREDIBLY stupid. If we want to actually win, we need to both be careful about deploying those other misaligned models, and stop ARA. Alice: That snake bite looks pretty nasty, it could kill you if you don't get it treated. Bob: That snake bite won't kill me, this hand grenade will. Pulls out pin.

3Charbel-Raphael Segerie11mo

Thanks for this comment, but I think this might be a bit overconfident. Yes, I have no doubt that if humans implement some kind of defense, this will slow down ARA a lot. But: * 1) It’s not even clear people are going to try to react in the first place. As I say, most AI development is positive. If you implement regulations to fight bad ARA, you are also hindering the whole ecosystem. It’s not clear to me that we are going to do something about open source. You need a big warning shot beforehand and this is not really clear to me that this happens before a catastrophic level. It's clear they're going to react to some kind of ARAs (like chaosgpt), but there might be some ARAs they won't react to at all. * 2) it’s not clear this defense (say for example Know Your Customer for providers) is going to be sufficiently effective to completely clean the whole mess. if the AI is able to hide successfully on laptops + cooperate with some humans, this is going to be really hard to shut it down. We have to live with this endemic virus. The only way around this is cleaning the virus with some sort of pivotal act, but I really don’t like that. "at the same rate" not necessarily. If we don't solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop. The real crux is how much time the ARA AI needs to evolve into something scary. We don't learn much here. From my side, I think that superintelligence is not going to be neglected, and big labs are taking this seriously already. I’m still not clear on ARA. This is not the central point. The central point is: * At some point, ARA is unshutdownable unless you try hard with a pivotal cleaning act. We may be stuck with a ChaosGPT forever, which is not existential, but pretty annoying. People are going to die. * the ARA evolves over time. Maybe this evolution is very slow, maybe fast. Maybe it plateaus, maybe it does not plateau. I don't know * This may take an indefinite number

Ryan Greenblatt11mo*138

Hmm, I agree that ARA is not that compelling on its own (as a threat model). However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger. And, once you do have ARA ability, you just need some moderately potent self-improvement ability (including training successor models) for the situation to look reasonably scary. Further, it seems somewhat hard to do capabilities evaluations that rule out this self-improvement if models are ARA capable given that there are so possible routes.

Edit: TBC, I think it'... (read more)

Richard Ngo's Shortform

Richard Ngo1y52

You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.

Richard Ngo's Shortform

Richard Ngo1y88

Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible.

This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition?

The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-ca... (read more)

5Richard Ngo1y

Policy Selection Solves Most Problems

Richard Ngo1y30

Is there a principled way to avoid the chaos of a too-early market state while also steering clear of knowledge we need to be updateless toward?

Is there a particular reason to think that the answer to this shouldn't just be "first run a logical inductor to P_f(f(n)), then use that distribution to determine how to use P_f(n) to determine how to choose an action from P_n" (at least for large enough n)?

3Abram Demski1y

It seems better in principle to find a way to respect human intuitions about which things to be updateless about. Getting something wrong in the too-updateless direction can give up control of the AI to entities which we don't think of as existing; getting something wrong in the too-updateful direction can miss out on multiverse-wide coorination via superrationality.

More people getting into AI safety should do a PhD

Richard Ngo1y55

But if you think TAI is coming within 10 years (for example, if you think that the current half-life on worlds surviving is 10 years; if you think 10 years is the amount of time in which half of worlds are doomed)

Note that these are very different claims, both because the half-life for a given value is below its mean, and because TAI doesn't imply doom. Even if you do have very high P(doom), it seems odd to just assume everyone else does too.

then depending on your distribution-over-time you should absolutely not wait 5 years before doing research, because

... (read more)

Richard Ngo's Shortform

Richard Ngo1y7176

I feel kinda frustrated whenever "shard theory" comes up in a conversation, because it's not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

This is a particular pity because I think there's a version of the "shard" framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be int... (read more)

6Alex Turner1y

Personally, I'm not ignoring that question, and I've written about it (once) in some detail. Less relatedly, I've talked about possible utility function convergence via e.g. A shot at the diamond-alignment problem and my recent comment thread with Wei_Dai. It's not that there isn't more shard theory content which I could write, it's that I got stuck and burned out before I could get past the 101-level content. I felt * a) gaslit by "I think everyone already knew this" or even "I already invented this a long time ago" (by people who didn't seem to understand it); and that * b) I wasn't successfully communicating many intuitions;[1] and * c) it didn't seem as important to make theoretical progress anymore, especially since I hadn't even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network). So I didn't want to post much on the site anymore because I was sick of it, and decided to just get results empirically. I've always read "assume heuristics" as expecting more of an "ensemble of shallow statistical functions" than "a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed." Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed. 1. ^ The 2022 review indicates that more people appreciated the shard theory posts than I realized at the time.

The case for ensuring that powerful AIs are controlled

Richard Ngo1y2214

Copying over a response I wrote on Twitter to Emmett Shear, who argued that "it's just a bad way to solve the problem. An ever more powerful and sophisticated enemy? ... If the process continues you just lose eventually".

I think there are (at least) two strong reasons to like this approach:

1. It’s complementary with alignment.

2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs,... (read more)

Richard Ngo's Shortform

Richard Ngo1y85

People sometimes try to reason about the likelihood of deceptive alignment by appealing to speed priors and simplicity priors. I don't like such appeals, because I think that the differences between aligned and deceptive AGIs will likely be a very small proportion of the total space/time complexity of an AGI. More specifically:

1. If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring... (read more)

6Ryan Greenblatt1y

Why do you think SGD will do this? Or are you imagining non-SGD mechanisms? It seems non-obvious to me that this will occur with SGD, though possible.

The alignment problem from a deep learning perspective

Richard Ngo1y50

Ty for review. I still think it's better, because it gets closer to concepts that might actually be investigated directly. But happy to agree to disagree here.

Small relevant datapoint: the paper version of this was just accepted to ICLR, making it the first time a high-level "case for misalignment as an x-risk" has been accepted at a major ML conference, to my knowledge. (Though Langosco's goal misgeneralization paper did this a little bit, and was accepted at ICML.)

Value systematization: how values become coherent (and misaligned)

Richard Ngo1y20

Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn't forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear's role in the whole mechanism)?

I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves th... (read more)

1Thane Ruthenis1y

That seems more like value reflection, rather than a value change? The way I'd model it is: you have some value v(x), whose implementations you can't inspect directly, and some guess about what it is P(v(x)). (That's how it often works in humans: we don't have direct knowledge of how some of our values are implemented.) Before you were introduced to the question Q of "what if we swap the gear for a different one: which one would you care about then?", your model of that value put the majority of probability mass on v1(x), which was "I value this particular gear". But upon considering Q, your PD over v(x) changed, and now it puts most probability on v2(x), defined as "I care about whatever gear is moving the piston". Importantly, that example doesn't seem to involve any changes to the object-level model of the mechanism? Just the newly-introduced possibility of switching the gear. And if your values shift in response to previously-unconsidered hypotheticals (rather than changes to the model of the actual reality), that seems to be a case of your learning about your values. Your model of your values changing, rather than them changing directly. (Notably, that's only possible in scenarios where you don't have direct access to your values! Where they're black-boxed, and you have to infer their internals from the outside.) Sounds right, yep. I'd argue that translating a value up the abstraction levels would almost surely lead to simpler cached strategies, though, just because higher levels are themselves simpler. See my initial arguments. Sure, but: the preference for simplicity needs to be strong enough to overpower the object-level values it wants to systematize, and it needs to be stronger than them the more it wants to shift them. The simplest values are no values, after all. I suppose I see what you're getting at here, and I agree that it's a real dynamic. But I think it's less important/load-bearing to how agents work than the basic "value translation in a hi

Value systematization: how values become coherent (and misaligned)

Richard Ngo1y20

(drafted this reply a couple months ago but forgot to send it, sorry)

your low-level model of a specific gear's dynamics didn't change — locally, it was as correct as it could ever be.
And if you had a terminal utility function over that gear (e. g., "I want it to spin at the rate of 1 rotation per minutes"), that utility function won't change in the light of your model expanding, either. Why would it?

Let me list some ways in which it could change:

Your criteria for what counts as "the same gear" changes as you think more about continuity of identity over tim

... (read more)

1Thane Ruthenis1y

If I recall correctly, the hypothetical under consideration here involved an agent with an already-perfect world-model, and we were discussing how value translation up the abstraction levels would work in it. That artificial setting was meant to disentangle the "value translation" phenomenon from the "ontology crisis" phenomenon. Shifts in the agent's model of what counts as "a gear" or "spinning" violate that hypothetical. And I think they do fall under the purview of ontology-crisis navigation. Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn't forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear's role in the whole mechanism)? I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that's a separate topic. I think that, if you have some value v(x), and it doesn't run into direct conflict with any other values you have, and your model of x isn't wrong at the abstraction level it's defined at, you'll never want to change v(x). That's the closest example, but it seems to be just an epistemic mistake? Your value is well-defined over "the gear that was driving the piston". After you learn it's a different gear from the one you thought, that value isn't updated: you just naturally shift it to the real gear. Plainer example: Suppose you have two bank account numbers at hand, A and B. One belongs to your friend, another to a stranger. You want to wire some money to your friend, and you think A is their account number. You prepare to send the money... but then you realize that was a mistake, and actually your friend's number is B, so you send the money there. That didn't involve any value-related shift. ---------------------------------------- I'll try agai

Value systematization: how values become coherent (and misaligned)

Richard Ngo1y30

In the standard story, what are the terminal goals? You could say "random" or "a mess", but I think it's pretty compelling to say "well, they're probably related to the things that the agent was rewarded for during training". And those things will likely include "curiosity, gaining access to more tools or stockpiling resources".

I call these "convergent final goals" and talk more about them in this post.

I also think that an AGI might systematize other goals that aren't convergent final goals, but these seem harder to reason about, and my central story for w... (read more)

Common misconceptions about OpenAI

Richard Ngo1y60

Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there's probably some additional argument that people would make about why this isn't just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)

Value systematization: how values become coherent (and misaligned)

Richard Ngo1y40

Why not? It seems like this is a good description of how values change for humans under self-reflection; why not for AIs?

2Charlie Steiner1y

An AI trained with RL that suddenly gets access to self-modifying actions might (briefly) have value dynamics according to idiosyncratic considerations that do not necessarily contain human-like guardrails. You could call this "systematization," but it's not proceeding according to the same story that governed systematization during training by gradient descent.

Value systematization: how values become coherent (and misaligned)

Richard Ngo1y32

I'd classify them as values insofar as people care about them intrinsically.

Then they might also be strategies, insofar as people also care about them instrumentally.

I guess I should get rid of the "only" in the sentence you quoted? But I do want to convey "something which is only a strategy, not a goal or value, doesn't have any intrinsic value". Will think about phrasing.

Value systematization: how values become coherent (and misaligned)

Richard Ngo1y35

It's not actually the case that the derivation of a higher abstraction level always changes our lower-level representation. Again, consider people -> social groups -> countries. Our models of specific people we know, how we relate to them, etc., don't change just because we've figured out a way to efficiently reason about entire groups of people at once. We can now make better predictions about the world, yes, we can track the impact of more-distant factors on our friends, but we don't actually start to care about our friends in a different way in th

... (read more)

2Thane Ruthenis1y

Mm, I'll concede that point. I shouldn't have used people as an example; people are messy. Literal gears, then. Suppose you're studying some massive mechanism. You find gears in it, and derive the laws by which each individual gear moves. Then you grasp some higher-level dynamics, and suddenly understand what function a given gear fulfills in the grand scheme of things. But your low-level model of a specific gear's dynamics didn't change — locally, it was as correct as it could ever be. And if you had a terminal utility function over that gear (e. g., "I want it to spin at the rate of 1 rotation per minutes"), that utility function won't change in the light of your model expanding, either. Why would it? ... which can be represented as utility functions. Take a given deontological rule, like "killing is bad". Let's say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that "predicts" that you're very likely/unlikely to take specific actions. Probability distributions of this form could be transformed into utility functions by reverse-softmaxing them; thus, it's perfectly coherent to model a deontologist as an agent with a lot of separate utility functions. See Friston's predictive-processing framework in neuroscience, plus this (and that comment). Deontologists reject utility-maximization in the sense that they refuse to engage in utility-maximizing calculations using their symbolic intelligence, but similar dynamics are still at play "under the hood". Well, not a flaw as such; a design choice. Humans are trained in an on-line regime, our values are learned from scratch, and... this process of active value learning just never switches off (although it plausibly slows down with age, see old people often being "set in their ways"). Our values change by the same process by which they were learned to begin with.

Value systematization: how values become coherent (and misaligned)

Richard Ngo1y40

I agree that this is closely related to the predictive processing view of the brain. In the post I briefly distinguish between "low-level systematization" and "high-level systematization"; I'd call the thing you're describing the former. Whereas the latter seems like it might be more complicated, and rely on whatever machinery brains have on top of the predictive coding (e.g. abstract reasoning, etc).

In particular, some humans are way more systematizing than others (even at comparable levels of intelligence). And so just saying "humans are constantly doing... (read more)

4Jan_Kulveit1y

My impression is you get a lot of "the later" if you run "the former" on the domain of language and symbolic reasoning, and often the underlying model is still S1-type. E.g. does not sound to me like someone did a ton of abstract reasoning to systematize other abstract values, but more like someone succeeded to write words which resonate with the "the former". Also, I'm not sure why do you think the later is more important for the connection to AI. Curent ML seem more similar to "the former", informal, intuitive, fuzzy reasonining. That's interesting - in contrast, I have a pretty clear intuitive sense of a direction where some people have a lot of internal conflict and as a result their actions are less coherent, and some people have less of that. In contrast I think in case of humans who you would likely describe as 'having systematized there values' ... I often doubt what's going on. A lot people who describe themselves as hardcore utilitarians seem to be ... actually not that, but more resemble a system where somewhat confused verbal part fights with other parts, which are sometimes suppressed. That's where I think looking at what human brains are doing seems interesting. Even if you believe the low-level / "the former" is not what's going with human theories of morality, the technical problem seems very similar and the same math possibly applies

Value systematization: how values become coherent (and misaligned)

Richard Ngo1y62

Thanks for the comment! I agree that thinking of minds as hierarchically modeling the world is very closely related to value systematization.

But I think the mistake you're making is to assume that the lower levels are preserved after finding higher-level abstractions. Instead, higher-level abstractions reframe the way we think about lower-level abstractions, which can potentially change them dramatically. This is what happens with most scientific breakthroughs: we start with lower-level phenomena, but we don't understand them very well until we discover th... (read more)

3Wei Dai1y

Why is this a problem, that calls out to be fixed (hence leading to systematization)? Why not just stick with the default of "go with whichever value/preference/intuition that feels stronger in the moment"? People do that unthinkingly all the time, right? (I have my own thoughts on this, but curious if you agree with me or what your own thinking is.) How would you cash out "don't make sense" here?

2Kaj Sotala1y

I think we should be careful to distinguish explicit and implicit systematization. Some of what you are saying (e.g. getting answers to question like "what counts as lying") sounds like you are talking about explicit, consciously done systematization; but some of what you are saying (e.g. minds identifying aspects of thinking that "don't make sense" and correcting them) also sounds like it'd apply more generally to developing implicit decision-making procedures. I could see the deontologist solving their problem either way - by developing some explicit procedure and reasoning for solving the conflict between their values, or just going by a gut feel for which value seems to make more sense to apply in that situation and the mind then incorporating this decision into its underlying definition of the two values. I don't know how exactly deontological rules work, but I'm guessing that you could solve a conflict between them by basically just putting in a special case for "in this situation, rule X wins over rule Y" - and if you view the rules as regions in state space where the region for rule X corresponds to the situations where rule X is applied, then adding data points about which rule is meant to cover which situation ends up modifying the rule itself. It would also be similar to the way that rules work in skill learning in general, in that experts find the rules getting increasingly fine-grained, implicit and full of exceptions. Here's how Josh Waitzkin describes the development of chess expertise: "Sitting with paradox, being at peace with and navigating the tension of competing truths, letting go of any notion of solidity" also sounds to me like some of the models for higher stages of moral development, where one moves past the stage of trying to explicitly systematize morality and can treat entire systems of morality as things that all co-exist in one's mind and are applicable in different situations. Which would make sense, if moral reasoning is a skill i

2Thane Ruthenis1y

Mm, I think there's two things being conflated there: ontological crises (even small-scale ones, like the concept of fitness not being outright destroyed but just re-shaped), and the simple process of translating your preference around the world-model without changing that world-model. It's not actually the case that the derivation of a higher abstraction level always changes our lower-level representation. Again, consider people -> social groups -> countries. Our models of specific people we know, how we relate to them, etc., don't change just because we've figured out a way to efficiently reason about entire groups of people at once. We can now make better predictions about the world, yes, we can track the impact of more-distant factors on our friends, but we don't actually start to care about our friends in a different way in the light of all this. In fact: Suppose we've magically created an agent that already starts our with a perfect world-model. It'll never experience an ontology crisis in its life. This agent would still engage in value translation as I'd outlined. If it cares about Alice and Bob, for example, and it's engaging in plotting at the geopolitical scales, it'd still be useful for it to project its care for Alice and Bob into higher abstraction levels, and start e. g. optimizing towards the improvement of the human economy. But optimizing for all humans' welfare would still remain an instrumental goal for it, wholly subordinate to its love for the two specific humans. I think you do, actually? Inasmuch as real-life deontologists don't actually shut down when facing a values conflict. They ultimately pick one or the other, in a show of revealed preferences. (They may hesitate a lot, yes, but their cognitive process doesn't get literally suspended.) I model this just as an agent having two utility functions, u1 and u2, and optimizing for their sum u1+u2. If the values are in conflict, if taking an action that maximizes u1 hurts u2 and vice versa

AI as a science, and three obstacles to alignment strategies

Richard Ngo1y45

I'm very sympathetic to this complaint; I think that these arguments simply haven't been made rigorously, and at this point it seems like Nate and Eliezer are not in an epistemic position where they're capable of even trying to do so. (That is, they reject the conception of "rigorous" that you and I are using in these comments, and therefore aren't willing to formulate their arguments in a way which moves closer to meeting it.)

You should look at my recent post on value systematization, which is intended as a framework in which these claims can be discussed more clearly.

AI as a science, and three obstacles to alignment strategies

Richard Ngo1y5-6

FWIW I think that gradient hacking is pretty plausible, but it'll probably end up looking fairly "prosaic", and may not be a problem even if it's present.

Lukas Finnveden1y60

Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?

Lying is Cowardice, not Strategy

Richard Ngo1y130

How do you feel about "In an ideal world, we'd stop all AI progress"? Or "ideally, we'd stop all AI progress"?

Meta Questions about Metaphilosophy

Richard Ngo2y20

FWIW I think some of the thinking I've been doing about meta-rationality and ontological shifts feels like metaphilosophy. Would be happy to call and chat about it sometime.

I do feel pretty wary about reifying the label "metaphilosophy" though. My preference is to start with a set of interesting questions which we can maybe later cluster into a natural category, rather than starting with the abstract category and trying to populate it with questions (which feels more like what you're doing, although I could be wrong).

Against Almost Every Theory of Impact of Interpretability

Richard Ngo2y5115

Strong disagree. This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs (e.g. imagine trying to describe the useful applications of electricity before anyone knew what it was or how it worked; or imagine Galileo trying to justify the practical use of studying astronomy).

Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works; this by itself is sufficient to re... (read more)