Daniel Kokotajlo — AI Alignment Forum

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

In ww2 the Americans thought they had bombsights for high-altitude bombing with 23m CEP, but actually they were more like 370m CEP. (according to quick google). So, terribly inaccurate. I wonder if modern computers and sensors could enable significantly more accurate bombing. (Sensors to pinpoint your position relative to the target + to judge wind conditions, computers to simulate bomb trajectories). I wouldn't be surprised if the answer is yes.

I read somewhere recently that there's a fiber optic fpv kamikaze drone with a 40km range. By contrast typical such drones have 10km, maybe 20km ranges.

Either way, it seems clear that EW-resistant drones with 20km+ range are on the horizon. Millions per year will be produced by Ukraine and Russia and maybe some other countries. And I wouldn't be surprised if ranges go up to more than 40km soon.

I wonder if this will cause major problems for Israel. Gaza and Lebanon and Syria are within 20km of some decent-sized israeli cities. Iron Dome wouldn't work against these drones because they'd fly real low to the ground, and plus, each Iron Dome interceptor probably costs at least an OOM more than each drone. Moreover these drones would be more deadly than the usual unguided rockets launched by hezbollah, hamas, etc. because they can be aimed. (they are "smart" weapons.) So e.g. 10 drones getting through defenses would do as much damage as 100 or 1000 rockets getting through defenses.

I appreciate your recent anti-super-short timelines posts Ryan and basically agree with them. I'm curious who you see yourself as arguing against. Maybe me? But I haven't had 2027 timelines since last year, now I'm at 2029.

I've been puzzling about the meaning of horizon lengths and whether to expect trends to be exponential or superexponential. Also how much R&D acceleration we should expect to come from what horizon length levels -- Eli was saying something like "90%-horizons of 100 years sound about right for Superhuman Coder level performance" and I'm like "that's insane, I would have guessed 80%-horizons of 1 month." How to arbitrate this dispute?

This appendix from METR's original paper seems relevant. I'm going to think out loud below.

OK so, how should we define horizon length? On one way of defining it, it's inherently pegged to what human experts can do. E.g. arguably, METR's HCAST benchmark is constructed by selecting tasks that human experts can do, and labelling them with horizon lengths based on how long it takes human experts to do them. Thus an arbitrarily extended HCAST (with longer and longer, more difficult tasks) would still only have tasks in it that human experts can do. Thus, superintelligent AI would have infinite horizon length. Thus, the trend must be superexponential, because it needs to get to infinity in finite time (unless you think ASI is impossible)

But maybe that's not the only way of thinking about it. Maybe instead we should think of some bigger set of tasks and a bigger set of skills, such that some of them simply can't be done by any human. And we should think of HCAST as sampling from that bigger set. And it's just a minor methodological issue that the current version of HCAST only includes tasks humans can do; the proper way to think about it--and the proper way to think about how to extend it--would be to sample tasks in some unbiased way that will gradually include more tasks over time that are harder and harder and that fewer and fewer humans can do, eventually including tasks that no human can do.

In this case, horizon length only goes to infinity if and when there is an AI which can do all tasks from the bigger set. Depending on how ambitiously you define that bigger set, this may not be possible even in principle. But at any rate, an AI system which is exactly as good as the best humans at everything while being faster and cheaper (minimum possible superintelligence) would still fail at a bunch of tasks from this bigger set, so it would still have a finite horizon length. So even if the underlying trend was superexponential it wouldn't be going to infinity anytime soon, and maybe it would make sense to model it as exponential.

So which of these two ways is the better way of thinking about it?

Enter the above picture from the original paper.

While they are filtering to only include tasks humans can do, it seems like fewer and fewer humans can do the tasks as they get longer. So maybe that means it's appropriate to imagine extrapolating HCAST in the second way, the way in which it eventually gets to tasks that no human can do. On that way, individual humans have a finite horizon length and the max of all humans (i.e. for each task, pick the world's best human to do that task) also has a finite horizon length, and ASI would probably also have a finite horizon length.

So then the question becomes, what are those horizon lengths? What is the horizon length (on coding tasks) of the best frontier AI company engineers? What is the horizon length (on coding tasks) of the max of all humans?

Well, according to the image above, it seems like the horizon length (on coding tasks) of METR's baseliners is... about one and a half hours!!!

Further complication. That's the horizon length of baseliners given 8 hours; had they been given more time they would have performed better. Similarly perhaps, we shouldn't talk about the horizon length of the max of all humans, or ASI, or any given AI system, in isolation, but rather relative to a time/compute/etc. budget. (Or perhaps more cleanly, the allotted time/compute/etc. budget should be considered part of the task.)

Further complication: As METR points out in the screenshot, they condition on success when defining the time horizon of a task. So e.g. if they have 10 baseliners try a task, allotted 8 hours, and only 2 succeed, then the task length is defined as the average of how long it took those 2 people. But this is a bit crazy because it guarantees that the horizon length of the task will be less than 8 hours even in cases where it arguably should be more! E.g. suppose the 10 people would have all succeeded at the task if they were given more time, and the median time-to-succeed would have been 20 hours, but the top 2 people with a combination of skill and luck managed to do it in 3 and 6 hours respectively. Then the task would be labelled as a 4.5 hour task even though the median baseliner needs 20 hours to do it.

At this point there are enough complications that I'm confused and want to take a break and return to the matter later. I'd be grateful for any comments shedding light on the matter.

I basically agree with this whole post. I used to think there were double-digit % chances of AGI in each of 2024 and 2025 and 2026, but now I'm more optimistic, it seems like "Just redirect existing resources and effort to scale up RL on agentic SWE" is now unlikely to be sufficient (whereas in the past we didn't have trends to extrapolate and we had some scary big jumps like o3 to digest)

I still think there's some juice left in that hypothesis though. Consider how in 2020, one might have thought "Now they'll just fine-tune these models to be chatbots and it'll become a mass consumer product" and then in mid-2022 various smart people I know were like "huh, that hasn't happened yet, maybe LLMs are hitting a wall after all" but it turns out it just took till late 2022/early 2023 for the kinks to be worked out enough.

Also, we should have some credence on new breakthroughs e.g. neuralese, online learning, whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.

But why would the if-medical gate be prone to misfiring? Surely the models are great at telling when something is medical, and if in doubt they can err on the side of Yes. That won't cause them to e.g. say that they'd want to invite Hitler for dinner.

Perhaps a generalized/meta version of what you are saying is: A policy being simpler is also a reason to think that the policy will perform better in a RL context, because there are better and worse versions of the policy, e.g. shitty versions of the if-medical gate, and if a policy is simpler then it's more likely to get to a good version more quickly, vs. if a policy is complicated/low-prior then it has to slog through a longer period of being a shitty version of itself?

My memory agrees with Rohin. Some safety people were trying to hold the line but most weren't, I don't think it reached as much consensus as this CoT monitorability just did.

I think it's good to defend lines even if they get blown past, unless you have a better strategy this trades off against. "Defense in depth" "Fighting retreat" etc.

(Unimportant: My position has basically been that process-based reinforcement and outcome-based reinforcement are interesting to explore separately but that mixing them together would be bad, and also, process-based reinforcement will not be competitive capabilities-wise.)

Yes, it's a big concern. For example, over the next year, there might be more and more examples accumulating of AIs scheming or otherwise thinking bad thoughts in the CoT, some fraction of which will lead to bad behavior such as writing hacky code or lying to users. Companies will notice that if you just have a bad thoughts classifier look over the CoT during training and dole out negative reinforcement when it's detected, the rates of bad behavior in deployment will go down. So, they'll be under some economic and PR pressure to do that. But this way leads to disaster in the long run for reasons described here: https://www.planned-obsolescence.org/july-2022-training-game-report/ and also here: https://openai.com/index/chain-of-thought-monitoring/

I'm very happy to see this happen. I think that we're in a vastly better position to solve the alignment problem if we can see what our AIs are thinking, and I think that we sorta mostly can right now, but that by default in the future companies will move away from this paradigm into e.g. neuralese/recurrence/vector memory, etc. or simply start training/optimizing the CoT's to look nice. (This is an important sub-plot in AI 2027) Right now we've just created common knowledge of the dangers of doing that, which will hopefully prevent that feared default outcome from occurring, or at least delay it for a while.

All this does is create common knowledge, it doesn't commit anyone to anything, but it's a start.

Really cool stuff, thank you!

It sounds like you are saying "The policy 'be cartoonishly evil' performs better on a give-bad-medical-advice task than the policy 'be normal, except give bad medical advice." Is that what you are saying? Isn't that surprising and curious if true? Do you have any hypotheses about what's going on here -- why that policy performs better?

(I can easily see how e.g. the 'be cartoonishly evil' policy could be simpler than the other policy. But perform better, now that surprises me.)

AI ALIGNMENT FORUM
Petrov Day
AF

AI ALIGNMENT FORUM
Petrov Day
AF

Sequences

Posts

Wikitag Contributions

Comments