Conditioning Generative Models with Restrictions

Adam Jermyn

AI ALIGNMENT FORUM
AF

All of Daniel Kokotajlo's Comments + Replies

METR: Measuring AI Ability to Complete Long Tasks

I found this comment helpful, thanks!

The bottom line is basically "Either we definite horizon length in such a way that the trend has to be faster than exponential eventually (when we 'jump all the way to AGI') or we define it in such a way that some unknown finite horizon length matches the best humans and thus counts as AGI."

I think this discussion has overall made me less bullish on the conceptual argument and more interested in the intuition pump about the inherent difficulty of going from 1 to 10 hours being higher than the inherent difficulty of going from 1 to 10 years.

METR: Measuring AI Ability to Complete Long Tasks

Daniel Kokotajlo2d40

Great question. You are forcing me to actually think through the argument more carefully. Here goes:

Suppose we defined "t-AGI" as "An AI system that can do basically everything that professional humans can do in time t or less, and just as well, while being cheaper." And we said AGI is an AI that can do everything at least as well as professional humans, while being cheaper.

Well, then AGI = t-AGI for t=infinity. Because for anything professional humans can do, no matter how long it takes, AGI can do it at least as well.

Now, METR's definition is different. ... (read more)

nostalgebraist2d142

ICYMI, the same argument appears in the METR paper itself, in section 8.1 under "AGI will have 'infinite' horizon length."

The argument makes sense to me, but I'm not totally convinced.

In METR's definition, they condition on successful human task completion when computing task durations. This choice makes sense in their setting for reasons they discuss in B.1.1, but things would get weird if you tried to apply it to extremely long/hard tasks.

If a typical time-to-success for a skilled human at some task is ~10 years, then the task is probably so ambiti... (read more)

METR: Measuring AI Ability to Complete Long Tasks

Daniel Kokotajlo7d78

I don't believe it. I don't believe that overall algorithmic progress is 3x faster. Maaaybe coding is 3x faster but that would maybe increase overall algo progress by like 30% idk. But also I don't think coding is really 3x faster on average for the things that matter.

1Jacob Steinhardt6d

I meant coding in particular, I agree algorithmic progress is not 3x faster. I checked again just now with someone and they did indeed report 3x speedup for writing code, although said that the new bottleneck becomes waiting for experiments to run (note this is not obviously something that can be solved by greater automation, at least up until the point that AI is picking better experiments than humans).

METR: Measuring AI Ability to Complete Long Tasks

Daniel Kokotajlo8d81

I indeed think that AI assistance has been accelerating AI progress. However, so far the effect has been very small, like single-digit percentage points. So it won't be distinguishable in the data from zero. But in the future if trends continue the effect will be large, possibly enough to more than counteract the effect of scaling slowing down, possibly not, we shall see.

3Jacob Steinhardt8d

Research engineers I talk to already report >3x speedups from AI assistants. It seems like that has to be enough that it would be showing up in the numbers. My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential. (This would argue for dropping the pre-2022 models from the graph which I think would give slightly faster doubling times, on the order of 5-6 months if I had to eyeball.).

METR: Measuring AI Ability to Complete Long Tasks

Daniel Kokotajlo9d20

I'm not sure if I understand what you are saying. It sounds like you are accusing me of thinking that skills are binary--either you have them or you don't. I agree, in reality many skills are scalar instead of binary; you can have them to greater or lesser degrees. I don't think that changes the analysis much though.

METR: Measuring AI Ability to Complete Long Tasks

Daniel Kokotajlo9d327

This is probably the most important single piece of evidence about AGI timelines right now. Well done! I think the trend should be superexponential, e.g. each doubling takes 10% less calendar time on average. Eli Lifland and I did some calculations yesterday suggesting that this would get to AGI in 2028. Will do more serious investigation soon.

Why do I expect the trend to be superexponential? Well, it seems like it sorta has to go superexponential eventually. Imagine: We've got to AIs that can with ~100% reliability do tasks that take professional humans 1... (read more)

4Anthony DiGiovanni3d

I'm confused as to what the actual argument for this is. It seems like you've just kinda asserted it. (I realize in some contexts all you can do is offer an "incredulous stare," but this doesn't seem like the kind of context where that suffices.) I'm not sure if the argument is supposed to be the stuff you say in the next paragraph (if so, the "Also" is confusing).

Jacob Steinhardt8d6-4

Doesn't the trend line already take into account the effect you are positing? ML research engineers already say they get significant and increasing productivity boosts from AI assistants and have been for some time. I think the argument you are making is double-counting this. (Unless you want to argue that the kink with Claude is the start of the super-exponential, which we would presumably get data on pretty soon).

Matt Goldenberg9d56

I'm not at all convinced it has to be something discrete like "skills" or "achieved general intelligence".

There are many continuous factors that I can imagine that help planning long tasks.

OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo15d40

Since '23 my answer to that question would have been "well the first step is for researchers like you to produce [basically exactly the paper OpenAI just produced]"

So that's done. Nice. There are lots of follow-up experiments that can be done.

I don't think trying to shift the market/consumers as a whole is very tractable.

But talking to your friends at the companies, getting their buy-in, seems valuable.

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Daniel Kokotajlo15d20

I think I don't understand why the version of the experiment I proposed is worse/bad/etc., and am getting hung up on that.

I like your second experiment design. Seems good to control for the chunk tags.

Question: Why do you need chunk tags at all?

4Fabien Roger14d

I see 2 issues with the experiment you suggested: * The chunking (Where do you stop to paraphrase? I think to have something meaningful to paraphrase you want something like every "big paragraph" (but maybe not each equation, which could each be a line). Idk how to do that cleanly without doing the distillation. And not doing it cleanly might result in benign performance degradation.) * The OODness, but I guess you're fine with this one? I suspect this might get you worrisome results for no worrying reason, just because it's somewhat degenerate to prefill a scratchpad with a big paraphrase, get a new paragraph which is probably influence by the paraphrasing style already, then paraphrase that. Besides the paraphrasing style being distracting, there is also a "paraphrase of paraphrase of paraphrase of ..." effect that might degrade the scratchpad for "normal" reasons but in ways that are hard to pull apart from the worrying reasons.

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Daniel Kokotajlo16d70

Great stuff, thank you! It's good news that training on paraphrased scratchpads doesn't hurt performance. But you know what would be even better? Paraphrasing the scratchpad at inference time, and still not hurting performance.

That is, take the original reasoning claude and instead of just having it do regular autoregressive generation, every chunk or so pause & have another model paraphrase the latest chunk, and then unpause and keep going. So Claude doesn't really see the text it wrote, it sees paraphrases of the text it wrote. If it still gets to th... (read more)

4Fabien Roger15d

How happy would you be about the following experiment: 1. Cut scratchpads in chunks for N paragraphs each 2. Paraphrase the first k chunks (where k is drawn at random), wrap them in <paraphrased_chunk> or <chunk> tags and join them 3. Distill that 4. Do the inference time experiment (asking for a chunk by pre-filling <chunk>, stop when you see </chunk>, paraphrase it and put it in <paraphrased_chunk> tags) This way the inference-time experiment is not OOD and should not reduce performance. This has roughly the same drawbacks as the experiment I described in this post though (i.e. risk of "learning the paraphrased version of the encoded reasoning scheme", not testing the OOD-generalization, ...), so I am not sure if this is valuable. Another experiment: Cut scratchpads in chunks for N paragraphs each. Wrap them in <chunk> tags and distill them (without paraphrasing). Then do the inference time experiment. This gets rid of the FUD around chunking reducing performance (which I observed, as described in the footnote). This could still result in degraded performance just because it's OOD, but arguably it's slightly worrisome if being exactly IID matters for scratchpads. I am also curious if you have takes about what value of N (or what chunking scheme) would make the most sense.

OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo16d32

Indeed these are some reasons for optimism. I really do think that if we act now, we can create and cement an industry standard best practice of keeping CoT's pure (and also showing them to the user, modulo a few legitimate exceptions, unlike what OpenAI currently does) and that this could persist for months or even years, possibly up to around the time of AGI, and that this would be pretty awesome for humanity if it happened.

3James Chua15d

Do you have a sense of what I, as a researcher, could do? I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?