User Comment Replies — AI Alignment Forum

“Sharp Left Turn” discourse: An opinionated review

For (2), I’m gonna uncharitably rephrase your point as saying: “There hasn’t been a sharp left turn yet, and therefore I’m overall optimistic there will never be a sharp left turn in the future.” Right?

Hm, I wouldn't have phrased it that way. Point (2) says nothing about the probability of there being a "left turn", just the speed at which it would happen. When I hear "sharp left turn", I picture something getting out of control overnight, so it's useful to contextualize how much compute you have to put in to get performance out, since this suggests that (... (read more)

3Steve Byrnes25d

Thanks! I still feel like you’re missing my point, let me try again, thanks for being my guinea pig as I try to get to the bottom of it. :) In terms of the “genome = ML code” analogy (§3.1), humans today have the same compute as humans 100,000 years ago. But humans today have dramatically more capabilities—we have invented the scientific method and math and biology and nuclear weapons and condoms and Fortnite and so on, and we did all that, all by ourselves, autonomously, from scratch. There was no intelligent external non-human entity who was providing humans with bigger brains or new training data or new training setups or new inference setups or anything else. If you look at AI today, it’s very different from that. LLMs today work better than LLMs from six months ago, but only because there was an intelligent external entity, namely humans, who was providing the LLM with more layers, new training data, new training setups, new inference setups, etc. …And if you’re now thinking “ohhh, OK, Steve is just talking about AI doing AI research, like recursive self-improvement, yeah duh, I already mentioned that in my first comment” … then you’re still misunderstanding me! Again, think of the “genome = ML code” analogy (§3.1). In that analogy, * “AIs building better AIs by doing the exact same kinds of stuff that human researchers are doing today to build better AIs” * …would be analogous to… * “Early humans creating more intelligent descendants by doing biotech or selective breeding or experimentally-optimized child-rearing or whatever”. But humans didn’t do that. We still have basically the same brains as our ancestors 100,000 years ago. And yet humans were still able to dramatically autonomously improve their capabilities, compared to 100,000 years ago. We were making stone tools back then, we’re making nuclear weapons now. Thus, autonomous learning is a different axis of AI capabilities improvement. It’s unrelated to scaling, and it’s unrelated to “automa

“Sharp Left Turn” discourse: An opinionated review

Joel Burget1mo6-3

I like this post but I think it misses / barely covers two of the most important cases for optimism.

1. Detail of specification

Frontier LLMs have a very good understanding of humans, and seem to model them as well as or even better than other humans. I recall seeing repeated reports of Claude understanding its interlocutor faster than they thought was possible, as if it just "gets" them, e.g. from one Reddit thread I quickly found:

"sometimes, when i’m tired, i type some lousy prompts, full of typos, incomplete info etc, but Claude still gets me, on a deep f

... (read more)

5Steve Byrnes1mo

For (2), I’m gonna uncharitably rephrase your point as saying: “There hasn’t been a sharp left turn yet, and therefore I’m overall optimistic there will never be a sharp left turn in the future.” Right? I’m not really sure how to respond to that … I feel like you’re disagreeing with one of the main arguments of this post without engaging it. Umm, see §1. One key part is §1.5: …And then §3.7: This post is agnostic over whether the sharp left turn will be a big algorithmic advance (akin to switching from MuZero to LLMs, for example), versus a smaller training setup change (akin to o1 using RL in a different way than previous LLMs, for example). [I have opinions, but they’re out-of-scope.] A third option is “just scaling the popular LLM training techniques that are already in widespread use as of this writing”, but I don’t personally see how that option would lead to the (1-3) triad, for reasons in the excerpt above. (This is related to my expectation that LLM training techniques in widespread use as of this writing will not scale to AGI … which should not be a crazy hypothesis, given that LLM training techniques were different as recently as ≈6 months ago!) But even if you disagree, it still doesn’t really matter for this post. I’m focusing on the existence of the sharp left turn and its consequences, not what future programmers will do to precipitate it. ~~ For (1), I did mention that we can hope to do better than Ev (see §5.1.3), but I still feel like you didn’t even understand the major concern that I was trying to bring up in this post. Excerpting again: Again, the big claim of this post is that the sharp left turn has not happened yet. We can and should argue about whether we should feel optimistic or pessimistic about those “wrenching distribution shifts”, but those arguments are as yet untested, i.e. they cannot be resolved by observing today’s pre-sharp-left-turn LLMs. See what I mean?

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Joel Burget1y42

How would you distinguish between weak and strong methods?

Abram Demski1y50

"Weak methods" means confidence is achieved more empirically, so there's always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). "Strong methods" means there's a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what's happening, so there is much less doubt about what systems the method will successfully apply to.

Shard Theory: An Overview

Joel Burget3y20

Subcortical reinforcement circuits, though, hail from a distinct informational world... and so have to reinforce computations "blindly," relying only on simple sensory proxies.

This seems to be pointing in an interesting direction that I'd like to see expanded.

Because your subcortical reward circuitry was hardwired by your genome, it's going to be quite bad at accurately assigning credit to shards.

I don't know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you we... (read more)

1David Udell3y

Say that the triggers for pleasure are hardwired. After a pleasurable event, how do only those computations running in the brain that led to pleasure (and not those randomly running computations) get strengthened? After all, the pleasure circuit is hardwired, and can't reason causally about what thoughts led to what outcomes. (I'm not currently confident that pleasure is exactly the same thing as reinforcement, but the two are probably closely related, and pleasure is a nice and concrete thing to discuss.) Nothing except those shards fighting for their own interests and succeeding to some extent. You probably have many contending values that you hang on to now, and would even be pretty careful with write access to your own values, for instrumental convergence reasons. If you mostly expect outcomes where one shard eats all the others, why do you have a complex balance of values rather than a single core value?

A central AI alignment problem: capabilities generalization, and the sharp left turn

Joel Burget3y*7-3

Two points:

The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I'm curious if other share this model and if it's been refined / explored in more detail by others.
The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bi

... (read more)

Rob Bensinger3y55

I kind of like the analogous idea of an alignment target as a repeller cone / dome.

Corrigibility is a repeller. Human values aren't a repeller, but they're a very narrow target to hit.

AI ALIGNMENT FORUM
AF

All of Joel Burget's Comments + Replies

1. Detail of specification