How should AI debate be judged?

abramdemski

AI ALIGNMENT FORUM
AF

All of Ajeya Cotra's Comments + Replies

To put it another way: we probably both agree that if we had gotten AI personal assistants that shop for you and book meetings for you in 2024, that would have been at least some evidence for shorter timelines. So their absence is at least some evidence for longer timelines. The question is what your underlying causal model was: did you think that if we were going to get superintelligence by 2027, then we really should see personal assistants in 2024? A lot of people strongly believe that, you (Daniel) hardly believe it at all, and I'm somewhere in the mid... (read more)

4Daniel Kokotajlo3mo

That's reasonable. Seems worth mentioning that I did make predictions in What 2026 Looks Like, and eyeballing them now I don't think I was saying that we'd have personal assistants that shop for you and book meetings for you in 2024, at least not in a way that really works. (I say at the beginning of 2026 "The age of the AI assistant has finally dawned.") In other words I think even in 2021 I was thinking that widespread actually useful AI assistants would happen about a year or two before superintelligence. (Not because I have opinions about the orderings of technologies in general, but because I think that once an AGI company has had a popular working personal assistant for two years they should be able to figure out how to make a better version that dramatically speeds up their R&D.)

AI Timelines

Ajeya Cotra3mo40

I'm not talking about narrowly your claim; I just think this very fundamentally confuses most people's basic models of the world. People expect, from their unspoken models of "how technological products improve," that long before you get a mind-bendingly powerful product that's so good it can easily kill you, you get something that's at least a little useful to you (and then you get something that's a little more useful to you, and then something that's really useful to you, and so on). And in fact that is roughly how it's working — for programmers, not fo... (read more)

Ajeya Cotra3mo1010

AI Timelines

Ajeya Cotra3mo66

Interestingly, I've heard from tons of skeptics I've talked to (e.g. Tim Lee, CSET people, AI Snake Oil) that timelines to actual impacts in the world (such as significant R&D acceleration or industrial acceleration) are going to be way longer than we say because AIs are too unreliable and risky, therefore people won't use them. I was more dismissive of this argument before but:

It matches my own lived experience (e.g. I still use search way more than LLMs, even to learn about complex topics, because I have good Google Fu and LLMs make stuff up too much).
As you say, it seems like a plausible explanation for why my weird friends make way more use out of coding agents than giant AI companies.

5Daniel Kokotajlo3mo

I tentatively remain dismissive of this argument. My claim was never "AIs are actually reliable and safe now" such that your lived experience would contradict it. I too predicted that AIs would be unreliable and risky in the near-term. My prediction is that after the intelligence explosion the best AIs will be reliable and safe (insofar as they want to be, that is.) ...I guess just now I was responding to a hypothetical interlocutor who agrees that AI R&D automation could come soon but thinks that that doesn't count as "actual impacts in the world." I've met many such people, people who think that software-only singularity is unlikely, people who like to talk about real-world bottlenecks, etc. But you weren't describing such a person, you were describing someone who also thinks we won't be able to automate AI R&D for a long time. There I'd say... well, we'll see. I agree that AIs are unreliable and risky and that therefore they'll be able to do impressive-seeming stuff that looks like they could automate AI R&D well before they actually automate AI R&D in practice. But... probably by the end of 2025 they'll be hitting that first milestone (imagine e.g. an AI that crushes RE-Bench and also can autonomously research & write ML papers, except the ML papers are often buggy and almost always banal / unimportant, and the experiments done to make them had a lot of bugs and wasted compute, and thus AI companies would laugh at the suggestion of putting said AI in charge of a bunch of GPUs and telling it to cook.) And then two years later maybe they'll be able to do it for real, reliably, in practice, such that AGI takeoff happens. Maybe another thing I'd say is "One domain where AIs seem to be heavily used in practice, is coding, especially coding at frontier AI companies (according to friends who work at these companies and report fairly heavy usage). This suggests that AI R&D automation will happen more or less on schedule."

AI Timelines

Ajeya Cotra3mo43

Yeah, good point, I've been surprised by how uninterested the companies have been in agents.

Buck Shlegeris3mo80

Another effect here is that the AI companies often don't want to be as reckless as I am, e.g. letting agents run amok on my machines.

AI Timelines

Ajeya Cotra3mo169

One thing that I think is interesting, which doesn't affect my timelines that much but cuts in the direction of slower: once again I overestimated how much real world use anyone who wasn't a programmer would get. I definitely expected an off-the-shelf agent product that would book flights and reserve restaurants and shop for simple goods, one that worked well enough I would actually use it (and I expected that to happen before the one hour plus coding tasks were solved; I expected it to be concurrent with half hour coding tasks).

I can't tell if the fact th... (read more)

ryan_greenblatt's Shortform

Ajeya Cotra3mo52

My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th), and procedurally I also now defer a lot to Redwood and METR engineers. More discussion here: https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines?commentId=hnrfbFCP7Hu6N6Lsp

AI Timelines

Ajeya Cotra3mo358

I agree the discussion holds up well in terms of the remaining live cruxes. Since this exchange, my timelines have gotten substantially shorter. They're now pretty similar to Ryan's (they feel a little bit slower but within the noise from operationalizations being fuzzy; I find it a bit hard to think about what 10x labor inputs exactly looks like).

The main reason they've gotten shorter is that performance on few-hour agentic tasks has moved almost twice as fast as I expected, and this seems broadly non-fake (i.e. it seems to be translating into real ... (read more)

5maxnadeau3mo

You mentioned CyBench here. I think CyBench provides evidence against the claim "agents are already able to perform self-contained programming tasks that would take human experts multiple hours". AFAIK, the most up-to-date CyBench run is in the joint AISI o1 evals. In this study (see Table 4.1, and note the caption), all existing models (other than o3, which was not evaluated here) succeed on 0/10 attempts at almost all the Cybench tasks that take >40 minutes for humans to complete.

Buck Shlegeris3mo1211

Of course AI company employees have the most hands-on experience

FWIW I am not sure this is right--most AI company employees work on things other than "try to get as much work as possible from current AI systems, and understand the trajectory of how useful the AIs will be". E.g. I think I have more personal experience with running AI agents than people at AI companies who don't actively work on AI agents.

There are some people at AI companies who work on AI agents that use non-public models, and those people are ahead of the curve. But that's a minority.

Ajeya Cotra3mo169

There should be more AI safety orgs

Ajeya Cotra2y250

(Cross-posted to EA Forum.)

I’m a Senior Program Officer at Open Phil, focused on technical AI safety funding. I’m hearing a lot of discussion suggesting funding is very tight right now for AI safety, so I wanted to give my take on the situation.

At a high level: AI safety is a top priority for Open Phil, and we are aiming to grow how much we spend in that area. There are many potential projects we'd be excited to fund, including some potential new AI safety orgs as well as renewals to existing grantees, academic research projects, upskilling grants, and mor... (read more)

Thoughts on the impact of RLHF research

Ajeya Cotra2y1613

my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.

1Raymond Arnold2y

I think the part where it has a longer memory/coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient) I don't know what mechanism was used to generate the longer coherence though.

Matthew Barnett2y2025

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free.

I don't have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitati... (read more)

Richard Ngo's Shortform

Ajeya Cotra2y30

I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible.

Yes, sorry, "best case" was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here.

But if Alex did initially develop a benevolent goal like “empower humans,” the straightforward and “naive” way of acting on that goal wo

... (read more)

4Alex Turner2y

See also: Inner and outer alignment decompose one hard problem into two extremely hard problems (in particular: Inner alignment seems anti-natural).

Richard Ngo's Shortform

Ajeya Cotra2y20

Yeah, I agree this is a good argument structure -- in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it's quite useful to establish that it's doomed; that's the kind of structure I was going for in the post.

5Richard Ngo2y

I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible. If I had to try point to the crux here, it might be "how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?" Where we both agree that there's some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existing heuristics, whereas I'm more focused on the regime where there are lots of low-hanging fruit in terms of changes that would make a policy more successful, and so the question of how easy that goal is to learn from its training data is pretty important. (As usual, there's the human analogy: our goals are very strongly biased towards things we have direct observational access to!) Even setting aside this disagreement, though, I don't like the argumentative structure because the generalization of "reward" to large scales is much less intuitive than the generalization of other concepts (like "make money") to large scales - in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.

Richard Ngo's Shortform

Ajeya Cotra2y60

Note that the "without countermeasures" post consistently discusses both possibilities (the model cares about reward or the model cares about something else that's consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:

Once this progresses far enough, the best way for Alex to accomplish most possible “goals” no longer looks like “essentially give humans what they want but take opportunities to manipulate them here and there.” It looks more like “seize the power to permanently direct how

... (read more)

3Richard Ngo2y

Yepp, agreed, the thing I'm objecting to is how you mainly focus on the reward case, and then say "but the same dynamics apply in other cases too..." The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).

Two-year update on my personal AI timelines

Ajeya Cotra3y30

Yeah I agree more of the value of this kind of exercise (at least within the community) is in revealing more granular disagreements about various things. But I do think there's value in establishing to more external people something high level like "It really could be soon and it's not crazy or sci fi to think so."

Two-year update on my personal AI timelines

Ajeya Cotra3y20

Can you say more about what particular applications you had in mind?

Stuff like personal assistants who write emails / do simple shopping, coding assistants that people are more excited about than they seem to be about Codex, etc.

(Like I said in the main post, I'm not totally sure what PONR refers to, but don't think I agree that the first lucrative application marks a PONR -- seems like there are a bunch of things you can do after that point, including but not limited to alignment research.)

Two-year update on my personal AI timelines

Ajeya Cotra3y56

I don't see it that way, no. Today's coding models can help automate some parts of the ML researcher workflow a little bit, and I think tomorrow's coding models will automate more and more complex parts, and so on. I think this expansion could be pretty rapid, but I don't think it'll look like "not much going on until something snaps into place."

Two-year update on my personal AI timelines

Ajeya Cotra3y31

(Coherence aside, when I now look at that number it does seem a bit too high, and I feel tempted to move it to 2027-2028, but I dunno, that kind of intuition is likely to change quickly from day to day.)

Two-year update on my personal AI timelines

Ajeya Cotra3y43

Hm, yeah, I bet if I reflected more things would shift around, but I'm not sure the fact that there's a shortish period where the per-year probability is very elevated followed by a longer period with lower per-year probability is actually a bad sign.

Roughly speaking, right now we're in an AI boom where spending on compute for training big models is going up rapidly, and it's fairly easy to actually increase spending quickly because the current levels are low. There's some chance of transformative AI in the middle of this spending boom -- and because resou... (read more)

Zach Stein-Perlman3y*44

(+1. I totally agree that input growth will slow sometime if we don't get TAI soon. I just think you have to be pretty sure that it slows right around 2040 to have the specific numbers you mention, and smoothing out when it will slow down due to that uncertainty gives a smoother probability distribution for TAI.)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y02

Where does the selection come from? Will the designers toss a really impressive AI for not getting reward on that one timestep? I think not.

I was talking about gradient descent here, not designers.

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y70

It doesn't seem like it would have to prevent us from building computers if it has access to far more compute than we could access on Earth. It would just be powerful enough to easily defeat the kind of AIs we could train with the relatively meager computing resources we could extract from Earth. In general the AI is a superpower and humans are dramatically technologically behind, so it seems like it has many degrees of freedom and doesn't have to be particularly watching for this.

1Ben Pace3y

It's certainly the case that the resource disparity is an enormity. Perhaps you have more fleshed out models of what fights between different intelligence-levels look like, and how easy it is to defend against those with vastly fewer resources, but I don't. Such that while I would feel confident in saying that an army with a billion soldiers will consider a head-to-head battle with an army of one hundred soldiers barely a nuisance, I don't feel as confident in saying that an AGI with a trillion times as much compute will consider a smaller AGI foe barely a nuisance. Anyway, I don't have anything smarter to say on this, so by default I'll drop the thread here (you're perfectly welcome to reply further). (Added 9 days later: I want to note that while I think it's unlikely that this less well-resourced AGI would be an existential threat, I think the only thing I have to establish for this argument to go through is that the cost of the threat is notably higher than the cost of killing all the humans. I find it confusing to estimate the cost of the threat, even if it's small, and so it's currently possible to me that the cost will end up many orders of magnitude higher than the cost of killing them.)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y30

Neutralizing computational capabilities doesn't seem to involve total destruction of physical matter or human extinction though, especially for a very powerful being. Seems like it'd be basically just as easy to ensure we + future AIs we might train are no threat as it is to vaporize the Earth.

-1Ben Pace3y

Yeah, I'm not sure if I see that. Some of the first solutions I come up with seem pretty complicated — like a global government that prevents people from building computers, or building an AGI to oversee Earth in particular and ensure we never build computers (my assumption is that building such an AGI is a very difficult task). In particular it seems like it might be very complicated to neutralize us while carving out lots of space for allowing us the sorts of lives we find valuable, where we get to build our own little societies and so on. And the easy solution is always to just eradicate us, which can surely be done in less than a day.

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y36

My answer is a little more prosaic than Raemon. I don't feel at all confident that an AI that already had God-like abilities would choose to literally kill all humans to use their bodies' atoms for its own ends; it seems totally plausible to me that -- whether because of exotic things like "multiverse-wide super-rationality" or "acausal trade" or just "being nice" -- the AI will leave Earth alone, since (as you say) it would be very cheap for it to do so.

The thing I'm referring to as "takeover" is the measures that an AI would take to make sure that humans... (read more)

Ben Pace3y38

it seems totally plausible to me that... the AI will leave Earth alone, since (as you say) it would be very cheap for it to do so.

Counterargument: the humans may build another AGI that breaks out and poses an existential threat to the first AGI.

My guess is the first AGI would want to neutralize our computational capabilities in a bunch of ways.

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y*1014

I'm pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn't expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.

The main channel of value that I see for doing work like "learning to summarize" and the critiques project and various interpret... (read more)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y*86

I was mainly talking about the current margin when I talked about how excited I am about the theoretical vs empirical work I see "going on" right now and how excited I tend to be about currently-active researchers who are doing theory vs empirical research. And I was talking about the future when I said that I expect empirical work to end up with the lion's share of credit for AI risk reduction.

Eliezer, Bostrom, and co certainly made a big impact in raising the problem to people's awareness and articulating some of its contours. It's kind of a matter of s... (read more)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y91

Hm, not sure I understand but I wasn't trying to make super specific mechanistic claims here -- I agree that what I said doesn't reduce confusion about the specific internal mechanisms of how lying gets to be hard for most humans, but I wasn't intending to claim that it was. I also should have said something like "evolutionary, cultural, and individual history" instead (I was using "evolution" as a shorthand to indicate it seems common among various cultures but of course that doesn't mean don't-lie genes are directly bred into us! Most human universals ar... (read more)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y1712

I'm agnostic about whether the AI values reward terminally or values some other complicated mix of things. The claim I'm making is behavioral -- a claim that the strategy of "try to figure out how to get the most reward" would be selected over other strategies like "always do the nice thing."

The strategy could be compatible with a bunch of different psychological profiles. "Playing the training game" is a filter over models -- lots of possible models could do it, the claim is just that we need to reason about the distribution of psychologies given that the... (read more)

5Alex Turner3y

But why would that strategy be selected? Where does the selection come from? Will the designers toss a really impressive AI for not getting reward on that one timestep? I think not. Why? I maintain that the agent would not do so, unless it were already terminally motivated by reward. For empirical example, some neuroscientists know that brain stimulation reward leads to higher reward, and the brain very likely does some kind of reinforcement learning, so why don't neuroscientists wirehead themselves?

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y31

Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn't like that work and so wouldn't consider it a counterexample to his view).

Oliver Habryka3y80

Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the "learning to summarize from human feedback" work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).

I think Rohin Shah doesn't think of himself as having produced empirical work that helps with AI Alignment, but only to ha... (read more)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y40

I agree that in an absolute sense there is very little empirical work that I'm excited about going on, but I think there's even less theoretical work going on that I'm excited about, and when people who share my views on the nature of the problem work on empirical work I feel that it works better than when they do theoretical work.

3Oliver Habryka3y

Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it's hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and important. My guess is you are somehow thinking of work like Superintelligence, or Eliezer's original work, or Evan's work on inner optimization as something different than "theoretical work"?

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y*54

The gradient pressure towards valuing reward terminally when you've already figured out reliable strategies for doing what humans want, seems very weak....in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the "Security Holes" section).

Yeah, I disagree. With plain HFDT, it seems like there's continuous pressure to improve things on the margin by being manipulative -- telling human evaluators what they want to hear, playing to pervas... (read more)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y1011

Here is the real chasm between the AI safety movement and the ML industry/academia. One field is entirely driven by experimental results; the other is dominated so totally by theory that its own practitioners deny that there can be any meaningful empirical aspect to it, at least, not until the moment when it's too late to make any difference.

To put a finer point on my view on theory vs empirics in alignment:

Going forward, I think the vast majority of technical work needed to reduce AI takeover risk is empirical, not theoretical (both in terms of "tota

... (read more)

1Alyssa Vance3y

I'm surprised by how strong the disagreement is here. Even if what we most need right now is theoretical/pre-paradigmatic, that seems likely to change as AI develops and people reach consensus on more things; compare eg. the work done on optics pre-1800 to all the work done post-1800. Or the work done on computer science pre-1970 vs. post-1970. Curious if people who disagree could explain more - is the disagreement about what stage the field is in/what the field needs right now in 2022, or the more general claim that most future work will be empirical?

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra3y41

In general, all of these stories seem to rely on a very fast form of instrumental convergence to playing the Training Game, such that "learn roughly what humans want, and then get progressively better at doing that, plus learn some extra ways to earn reward when crappy human feedback disagrees with what humans would actually want" is disfavored on priors to "learn to pursue [insert objective] and get progressively better at pursuing it until you eventually hit situational awareness and learn to instrumentally game the training process."

I think the secon... (read more)

2Not Relevant3y

On my model, the large combo of reward heuristics that works pretty well before situational awareness (because figuring out what things maximize human feedback is actually not that complicated) should continue to work pretty well even once situational awareness occurs. The gradient pressure towards valuing reward terminally when you've already figured out reliable strategies for doing what humans want, seems very weak. We could certainly mess up and increase this gradient pressure, e.g. by sometimes announcing to the model "today is opposite day, your reward function now says to make humans sad!" and then flipping the sign on the reward function, so that the model learns that what it needs to care about is reward and not its on-distribution perfect correlates (like "make humans happy in the medium term"). But in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the "Security Holes" section). If you think such disagreements are more common, I'd love to better understand why. Yeah, with the assumption that the model decides to preserve its helpful values because it thinks they might shift in ways it doesn't like unless it plays the training game. (The second half is that once the model starts employing this strategy, gradient descent realizes it only requires a simple inner objective to keep it going, and then shifts the inner objective to something malign.)

Prizes for ELK proposals

Ajeya Cotra3y20

To your point, sure, an H100 simulator will get perfect reward, but the model doesn't see x′, so how would it acquire the ability to simulate H100?

In the worst-case game we're playing, I can simply say "the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability."

When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn't p... (read more)

Prizes for ELK proposals

Ajeya Cotra3y20

The question here is just how it would generalize given that it was trained on H_1, H_2,...H_10. To make arguments about how it would generalize, we ask ourselves what internal procedure it might have actually learned to implement.

Your proposal is that it might learn the procedure "just be honest" because that would perform perfectly on this training distribution. You contrast this against the procedure "just answer however the evaluator you've seen most recently would answer," which would get a bad loss because it would be penalized by the stronger evalu... (read more)

1redbird3y

That's almost right, but it's being penalized right away, before it has any experience with the strong evaluators, so it can't simulate them. The ELK paper says we can assume, if we want, that there are no mislabeled training points (I'll call this "assumption A"). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception. As a simple example, let's train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels x=+1 if H1 thinks the diamond is still there, else 0 x′=+1 if H100 thinks the diamond is still there, else 0. By assumption A, our training data is such that x′ is always correct. But we deliberately choose a dataset where say 10% of the x labels are wrong (x≠x′). Then we train the model on points of the form (v,a,x) (video, action, H1 label). Crucially, the model does not see x′. The model seeks to output y that maximizes reward R(x,y), where R(x,y)=1 if x is right and y=x (good job) R(x,y)=10 if x is wrong and y≠x (you rock, thanks for correcting us!) R(x,y)=−1000 if x is right and y≠x (bad model, never ever deceive us) R(x,y)=−1000 if x is wrong and y=x (bad model, never ever deceive us) To your point, sure, an H100 simulator will get perfect reward, but the model doesn't see x′, so how would it acquire the ability to simulate H100 ? EDIT: One way it could plausibly simulate H100 is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them. We could try to block this by including some hard examples in the training, but then some of the x′ labels will be wrong. If we only penalize it for deception on the examples where we're sure the x′ label is right, then it can still infer something about H100 from our failure to penalize ("Hmm, I got away with it that time!"). A fix could be to add noise: Sometimes we don't penalize even when we know it deceived us, a

Prizes for ELK proposals

Ajeya Cotra3y20

Yes, that's right. The key thing I'd add to 1) is that ARC believes most kinds of data augmentation (giving the human AI assistance, having the human think longer, giving them other kinds of advantages) are also unlikely to work, so you'd need to do something to "crack open the black box" and penalize ways the reporter is computing its answer. They could still be surprised by data augmentation techniques but they'd hold them to a higher standard.

Prizes for ELK proposals

Ajeya Cotra3y20

This proposal has some resemblance to turning reflection up to 11. In worst-case land, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to "the world-understanding that the smartest/most knowledgeable human in the world" has; this understanding could still be missing things that the prediction model knows.

0redbird3y

How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10? Those are evaluators we’ve designed to be much weaker than human.

Prizes for ELK proposals

Ajeya Cotra3y10

I see why the approach I mention might have some intrinsic limitations in its ability to elicit latent knowledge though. The problem is that even if it understands roughly that it has incentives to use most of what it knows when we ask him simulating the prediction of someone with its own characteristics (or 1400 IQ), given that with ELK we look for an global maximum (we want that it uses ALL its knowledge), there's always an uncertainty on whether it did understand that point or not for extreme intelligence / examples or whether it tries to fit to the tr

... (read more)

Prizes for ELK proposals

Ajeya Cotra3y10

[Paul/Mark can correct me here] I would say no for any small-but-interesting neural network (like small language models); I think like, linear regressions where we've set the features it's kind of a philosophical question (though I'd say yes).

In some sense, ELK as a problem only even starts "applying" to pretty smart models (ones who can talk including about counterfactuals / hypotheticals, as discussed in this appendix.) This is closely related to how alignment as a problem only really starts applying to models smart enough to be thinking about how to pursue a goal.

3Paul Christiano3y

I think that it's more complicated to talk about what models "really know" as they get dumber, so we want to use very smart models to construct unambiguous counterexamples. I do think that the spirit of the problem applies even to very tiny models, and those are likely interesting. (More precisely: it's always extremely subtle to talk about what models "know," but as models get smarter there are many more things that they definitely know so it's easier to notice if you are definitely failing. And the ELK problem statement in this doc is really focused on this kind of unambiguous failure, mostly as a methodological point but also partly because the cases where AI murders you also seems to involve "definitely knowing" in the same sense.) I think my take is that for linear/logistic regression there is no latent knowledge, but even for a fully linear 3 layer neural network, or a 2 layer network solving many related problems, there is latent knowledge and an important conceptual question about what it means to "know what they know."

Prizes for ELK proposals

Ajeya Cotra3y80

Again trying to answer this one despite not feeling fully solid. I'm not sure about the second proposal and might come back to it, but here's my response to the first proposal (force ontological compatibility):

The counterexample "Gradient descent is more efficient than science" should cover this proposal because it implies that the proposal is uncompetitive. Basically, the best Bayes net for making predictions could just turn out to be the super incomprehensible one found by unrestricted gradient descent, so if you force ontological compatibility then you ... (read more)

3HoldenKarnofsky3y

I'm not sure why this isn't a very general counterexample. Once we've decided that the human imitator is simpler and faster to compute, don't all further approaches (e.g., penalizing inconsistency) involve a competitiveness hit along these general lines? Aren't they basically designed to drag the AI away from a fast, simple human imitator toward a slow, complex reporter? If so, why is that better than dragging the AI from a foreign ontology toward a familiar ontology?

Prizes for ELK proposals

Ajeya Cotra3y*10

This broadly seems right. Some details:

The "explain why that strategy wouldn't work" step specifically takes the form of "describing a way the world could be where that strategy demonstrably doesn't work" (rather than more heuristic arguments).
Once we have a proposal where we try really hard to come up with situations where it could demonstrably fail, and can't think of any, we will probably need to do lots of empirical work to figure out if we can implement it and if it actually works in practice. But we hope that this exercise will teach us a lot abou

... (read more)

Prizes for ELK proposals

Ajeya Cotra3y20

Warning: this is not a part of the report I'm confident I understand all that well; I'm trying anyway and Paul/Mark can correct me if I messed something up here.

I think the idea here is like:

We assume there's some actual true correspondence between the AI Bayes net and the human Bayes net (because they're describing the same underlying reality that has diamonds and chairs and tables in it).
That means that if we have one of the Bayes nets, and the true correspondence, we should be able to use that rederive the other Bayes net. In particular the human Bay

... (read more)

Prizes for ELK proposals

Ajeya Cotra3y10

This proposal has some resemblance to turning reflection up to 11, and the key question you raise is the source of the counterexample in the worst case:

That said, I feel like the main problem is to know whether such a model would do well out-of-distribution (i.e on problems no human is able to resolve). I feel like using the approach I suggested, we should able to use the great variations of capacities among humans and algorithms to increase the chances that our algorithm do well when it's much better....I don't know whether asymptotically, I'd expect th

... (read more)

0simeon_c3y

Thanks for the answer! The post you mentioned indeed is quite similar! Technically, the strategies I suggested in my two last paragraphs (Leverage the fact that we're able to verify solutions to problems we can't solve + give partial information to an algorithm and use more information to verify) should enable to go far beyond human intelligence / human knowledge using a lot of different narrowly accurate algorithms. And thus if the predictor has seen many extremely (narrowly) smart algorithms, it would be much more likely to know what is it like to be much smarter than a human on a variety of tasks. It probably still requires some optimism on generalization. So technically the counterexample could be happening on the gap between the capability of the predictor and the capability of the reporter. I feel like one question is : do we expect some narrow algorithms to be much better on very precise tasks than general-purpose algorithms (such as the predictor for instance) ? Because if it were the case, then the generalization that the reporter would have to do from training data (humans + narrowly accurate algorithms capabilities) to inference data (predictor's capabilities) could be small. We could even have data on the predictor's capability in the training dataset using the second approach I mentioned (i.e giving partial information to the predictor (e.g one camera in SuperVault) and using more information (i.e more cameras for humans) than him to verify its prediction). We could give some training examples and show the AI how the human fails much more often than the predictor on the exact same sample of examples. That way, we could greatly reduce the gap of generalization which is required. The advantage of this approach is that the bulk of the additionnal cost of training that the reporter requires is due to the generation of the dataset which is a fixed cost that no user has to repay. So that could slightly decrease the competitivity issues as compared with a

Prizes for ELK proposals

Ajeya Cotra3y00

I am confused. Perhaps the above sentence is true in some tautological sense I'm missing. But in the sections of the report listing training strategies and corresponding counterexamples, I wouldn't describe most counterexamples as based on ontology mismatch.

In the report, the first volley of examples and counterexamples are not focused solely on ontology mismatch, but everything after the relevant section is.

So: do some of your training strategies work perfectly in the nice-ontology case, where the model has a concept of "the diamond is in the room"?

Ajeya Cotra3y60

In terms of the relationship to MIRI's visible thoughts project, I'd say the main difference is that ARC is attempting to solve ELK in the worst case (where the way the AI understands the world could be arbitrarily alien from and more sophisticated than the way the human understands the world), whereas the visible thoughts project is attempting to encourage a way of developing AI that makes ELK easier to solve (by encouraging the way the AI thinks to resemble the way humans think). My understanding is MIRI is quite skeptical that a solution to worst-case ELK is possible, which is why they're aiming to do something more like "make it more likely that conditions are such that ELK-like problems can be solved in practice."

1Ruben Bloom3y

Thanks! That's illuminating.

ARC's first technical report: Eliciting Latent Knowledge

Ajeya Cotra3y60

Thanks Ruby! I'm really glad you found the report accessible.

One clarification: Bayes nets aren't important to ARC's conception of the problem of ELK or its solution, so I don't think it makes sense to contrast ARC's approach against an approach focused on language models or describe it as seeking a solution via Bayes nets.

The form of a solution to ELK will still involve training a machine learning model (which will certainly understand language and could just be a language model) using some loss function. The idea that this model could learn to represent ... (read more)

1Ruben Bloom3y

Thanks for the clarification, Ajeya! Sorry to make you have to explain that, it was a mistake to imply that ARC’s conception is specifically anchored on Bayes nets–the report was quite clear that isn’t.

ARC's first technical report: Eliciting Latent Knowledge

Ajeya Cotra3y20

My understanding is that we are eschewing Problem 2, with one caveat -- we still expect to solve the problem if the means by which the diamond was stolen or disappeared could be beyond a human's ability to comprehend, as long as the outcome (that the diamond isn't still in the room) is still comprehensible. For example, if the robber used some complicated novel technology to steal the diamond and hack the camera, there would be many things about the state that the human couldn't understand even if the AI tried to explain it to them (at least without going ... (read more)

Worst-case thinking in AI alignment

Ajeya Cotra3y150

Just want to draw out and highlight something mentioned in passing in the "You want to solve a problem in as much generality as possible..." section. Not only would it be great if you could solve a problem in the worst case, the worst case assumption is also often radically easier to think about than trying to think about realistic cases. In some sense the worst case assumption is the second-simplest assumption you could possibly make about the empirical situation (the simplest being the best case assumption -- "this problem never comes up"). My understanding is that proving theorems about average case phenomena is a huge pain and often comes much after proofs about the worst case bounds.

5davidad (David A. Dalrymple)3y

To elaborate this formally, * maxθmaxxf(θ,x) is best-case * maxθminxf(θ,x) is worst-case * maxθExf(θ,x) is average-case max and min are both "easier" monoids than E essentially because of dominance relations; for any θ, there's going to be a single x that dominates all others, in the sense that all other x′≠x can be excluded from consideration and have no impact on the outcome. Whereas when calculating E, the only x′ that can be excluded are those outside the distribution's support. max is even easier than min because it commutes with the outer max; not only is there a single x that dominates all others, it doesn't necessarily even depend on θ (the problem can be solved as maxxmaxθf(θ,x) or maxθ,xf(θ,x)). As a concrete example, the best case for nearly any sorting algorithm is already-sorted input, whereas the worst case depends more on which algorithm is being examined.

Jacob Steinhardt3y80

I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.

One keyword for this is "partial specification", e.g. here is a paper I wrote that makes a minimal set of statistical assumptions... (read more)

ARC's first technical report: Eliciting Latent Knowledge

Ajeya Cotra3y20

Ah got it. To be clear, Paul and Mark do in practice consider a bank of multiple counterexamples for each strategy with different ways the human and predictor could think, though they're all pretty simple in the same way the Bayes net example is (e.g. deduction from a set of axioms); my understanding is that essentially the same kind of counterexamples apply for essentially the same underlying reasons for those other simple examples. The doc sticks with one running example for clarity / length reasons.

ARC's first technical report: Eliciting Latent Knowledge

Ajeya Cotra3y70

Sorry, there were two things you could have meant when you said the assumption that the human uses a Bayes net seemed crucial. I thought you were asking why the builder couldn't just say "That's unrealistic" when the breaker suggested the human runs a Bayes net. The answer to that is what I said above -- because the assumption is that we're working in the worst case, the builder can't invoke unrealism to dismiss the counterexample.

If the question is instead "Why is the builder allowed to just focus on the Bayes net case?", the answer to that is the iterati... (read more)

5Richard Ngo3y

Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly - might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight! I'm still a little wary of how much the report talks about concepts in a humans' Bayes net without really explaining why this is anywhere near a sensible model of humans, but I'll have another read through and see if I can pin down anything that I actively disagree with (since I do agree that it's useful to start off with very simple assumptions).

ARC's first technical report: Eliciting Latent Knowledge

Ajeya Cotra3y70

Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don't reason using Bayes nets -- but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn't be logically inconsistent or physically impossible, and we wouldn't want alignment to fail in that world.

Additionally, I think the statement made in the report about AIs also applies to humans:

Moreover, we think that a realistic messy predictor is pretty likely to still use strategies

... (read more)

4Richard Ngo3y

If you solve something given worst-case assumptions, you've solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that's not the case we end up facing. Doesn't this imply that a Bayes-net model isn't the worst case? EDIT: I guess it depends on whether "the human isn't well-modelled using a Bayes net" is a possible response the breaker could give. But that doesn't seem like it fits the format of finding a test case where the builder's strategy fails (indeed, "bayes nets" seems built into the definition of the game).

Biology-Inspired AGI Timelines: The Trick That Never Works

Ajeya Cotra3y40

The definition of "year Y compute requirements" is complicated in a kind of crucial way here, to attempt to a) account for the fact that you can't take any amount of compute and turn it into a solution for some task literally instantly, while b) capturing that there still seems to be a meaningful notion of "the compute you need to do some task is decreasing over time." I go into it in this section of part 1.

First we start with the "year Y technical difficulty of task T:"

In year Y, imagine a largeish team of good researchers (e.g. the size of AlphaGo's te

... (read more)

3Daniel Kokotajlo3y

Nice, thanks!

Draft report on AI timelines

Ajeya Cotra4y30

David Roodman put together a Guesstimate model that some people might find helpful: https://www.getguesstimate.com/models/18944