AI ALIGNMENT FORUM
AF

All of abramdemski's Comments + Replies

My guess is that we want to capture those differences with the time&date meta-data instead (and to some extent, location and other metadata). That way, we can easily query what you-in-particular would say at other periods in your life (such as the future). However, I agree that this is at least not obvious.

Maybe a better way to do it would be to explicitly take both approaches, so that there's an abstract-you vector which then gets mapped into a particular-you author space via combination with your age (ie with date&time). This attempts to ex... (read more)

Have LLMs Generated Novel Insights?

Abram Demski16d30

Yeah, that makes sense.

Have LLMs Generated Novel Insights?

Abram Demski17d92

For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as

"has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes),

& I created the question to see if we could substantiate the "yes" here with evidence.

It makes somewhat more sense to me for your timeline crux to be "can we do this reliably" as opposed to "has this literally ever happened" -- but the claim in your post was quite explicit about t... (read more)

4Cole Wyeth16d

There is a specific type of thinking, which I tried to gesture at in my original post, which I think LLMs seem to be literally incapable of. It’s possible to unpack the phrase “scientific insight” in more than one way, and some interpretations fall on either side of the line.

Dream, Truth, & Good

Abram Demski17d30

My idea is very similar to paragraph vectors: the vectors are trained to be useful labels for predicting the tokens.

To differentiate author-vectors from other types of metadata, the author vectors should be additionally trained to predict author labels, with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author. There's also the author-vector-to-text-author-attribution network, which should be pre-trained to have a good "prior" over author-names (so we're not getting a bunch of nonsense strings out).... (read more)

1David Manheim16d

This seems reasonable, though efficacy of the learning method seems unclear to me. But: This seems wrong. To pick on myself, my peer reviewed papers, my substack, my lesswrong posts, my 1990s blog posts, and my twitter feed are all substantively different in ways that I think the author vector should capture.

Dream, Truth, & Good

Abram Demski17d20

Yeah, this is effectively a follow-up to my recent post on anti-slop interventions, detailing more of what I had in mind. So, the dual-use idea is very much what I had in mind.

Judgements: Merging Prediction & Evidence

Abram Demski17d30

Yeah, for better or worse, the logical induction paper is probably the best thing to read. The idea is actually to think of probabilities as prediction-market prices; the market analogy is a very strong one, not an indirect way of gesturing at the idea.

Dream, Truth, & Good

Abram Demski17d30

Yeah. I'm saying that the "good machine" should be trained on all three; it should be honest, but, constrained by helpfulness and harmlessness. (Or, more realistically, a more complicated constitution with more details.)

Anti-Slop Interventions?

Abram Demski1mo20

Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.

Yeah, basically everything I'm saying is an extension of this (but obviously, I'm extending it much further than you are). We don't exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. ... (read more)

Anti-Slop Interventions?

Abram Demski1mo20

Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”

I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.

Anti-Slop Interventions?

Abram Demski1mo43

So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward.

Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-s... (read more)

2Steve Byrnes1mo

I’m still curious about how you’d answer my question above. Right now, we don't have ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”. If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to ASI. So we might as well sit tight and find other useful things to do, until such time as the AI capabilities researchers figure it out. …Furthermore, I don’t think the number of months or years between “AIs that are ‘better at extrapolating’” and ASI is appreciably larger if the “AIs that are ‘better at extrapolating’” arrive tomorrow, versus if they arrive in 20 years. In order to believe that, I think you would need to expect some second bottleneck standing between “AIs that are ‘better at extrapolating’”, and ASI, such that that second bottleneck is present today, but will not be present (as much) in 20 years, and such that the second bottleneck is not related to “extrapolation”. I suppose that one could argue that availability of compute will be that second bottleneck. But I happen to disagree. IMO we already have an absurdly large amount of compute overhang with respect to ASI, and adding even more compute overhang in the coming decades won’t much change the overall picture. Certainly plenty of people would disagree with me here. …Although those same people would probably say that “just add more compute” is actually the only way to make AIs that are “better at extrapolation”, in which case my point would still stand. I don’t see any other plausible candidates for the second bottleneck. Do you? Or do you disagree with some other part of that? Like, do you think it’s possible to get all the way to ASI without ever maki

Anti-Slop Interventions?

Abram Demski1mo30

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.
I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?

Maybe "some relatively short time later" was confusing. I mean long enou... (read more)

Anti-Slop Interventions?

Abram Demski1mo20

Concrete (if extreme) story:

World A:

Invent a version of "belief propagation" which works well for LLMs. This offers a practical way to ensure that if an LLM seems to know something in one context, it can & will fluently invoke the same knowledge in almost all appropriate contexts.

Keep the information secret in order to avoid pushing capabilities forward.

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" ... (read more)

1Towards_Keeperhood1mo

Thanks for providing a concrete example! Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time. I also think the "drowned out in the noise" isn't that realistic. You ought to be able to show some quite impressive results relative to computing power used. Though when you maybe should try to convince the AI labs of your better paradigm is going to be difficult to call. It's plausible to me we won't see signs that make us sufficiently confident that we only have a short time left, and it's plausible we do. In any case before you publish something you can share it with trustworthy people and then we can discuss that concrete case in detail. Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.

Anti-Slop Interventions?

Abram Demski1mo30

Hmmm. I'm not exactly sure what the disconnect is, but I don't think you're quite understanding my model.

I think anti-slop research is very probably dual-use. I expect it to accelerate capabilities. However, I think attempting to put "capabilities" and "safety" on the same scale and maximize differential progress of safety over capabilities is an oversimplistic model which doesn't capture some important dynamics.

There is not really a precise "finish line". Rather, we can point to various important events. The extinction of all humans lies down a path where... (read more)

Anti-Slop Interventions?

Abram Demski1mo43

I'm not sure I can talk about this effectively in the differential progress framework. My argument is that if we expect to die to slop, we should push against slop. In particular, if we expect to die to slop-at-big-labs, we should push against slop-at-big-labs. This seems to suggest a high degree of information-sharing about anti-slop tech.

Anti-slop tech is almost surely also going to push capabilities in general. If we currently think slop is a big source of risk, it seems worth it.

Put more simply: if someone is already building superintelligence & de... (read more)

Anti-Slop Interventions?

Abram Demski1mo30

Do you not at all buy John's model, where there are important properties we'd like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?

1Towards_Keeperhood1mo

Can you link me to what you mean by John's model more precisely? If you mean John's slop-instead-scheming post, I agree with that with the "slop slightly more likely than scheming" part. I might need to reread John's post to see what the concrete suggestions for what to work on might be. Will do so tomorrow. I'm just pessimistic that we can get any nontrivially useful alignment work out of AIs until a few months before the singularity, at least besides some math. Or like at least for the parts of the problem we are bottlenecked on. So like I think it's valuable to have AIs that are near the singularity be more rational. But I don't really buy the differentially improving alignment thing. Or like could you make a somewhat concrete example of what you think might be good to publish? Like, all capabilities will help somewhat with the AI being less likely to make errors that screw its alignment. Which ones do you think are more important than others? There would have to be a significant difference in usefulness pf some capabilities, because else you could just do the same alignment work later and still have similarly much time to superintelligence (and could get more non-timeline-upspeeding work done).

Anti-Slop Interventions?

Abram Demski1mo31

I think there is both important math work and important conceptual work. Proving new theorems involves coming up with new concepts, but also, formalizing the concepts and finding the right proofs. The analogy to robots handling the literal heavy lifting part of a job seems apt.

Anti-Slop Interventions?

Abram Demski1mo30

Yeah, my sense is that modern AI could be useful to tiling agent stuff if it were less liable to confabulate fake proofs. This generalizes to any technical branch of AI safety where AI could help come up with formalizations of ideas, proofs of conjectures, etc. My thinking suggests there is something of an "overhang" here at present, in the sense that modern AI models are worse-than-useless due to the way that they try to create good-looking answers at the expense of correctness.

I disagree with the statement "to some extent the goal of tiling-agents-like w... (read more)

1Towards_Keeperhood1mo

Thanks. True, I think your characterization of tiling agents is better. But my impression was sorta that this self-trust is an important precursor for the dynamic self-modification case where alignment properties need to be preserved through the self-modification. Yeah I guess calling this AI solving alignment is sorta confused, though maybe there's sth into this direction because the AI still does the search to try to preserve the alignment properties? Hm I mean yeah if the current bottleneck is math instead of conceptualizing what math has to be done then it's a bit more plausible. Like I think it ought to be feasible to get AIs that are extremely good at proving theorems and maybe also formalizing conjectures. Though I'd be a lot more pessimistic about finding good formal representations for describing/modelling ideas. Do you think we are basically only bottlenecked on math so sufficient math skill could carry us to aligned AI, or only have some alignment philosophy overhang you want to formalize but then more philosophy will be needed?

o1 is a bad idea

Abram Demski3mo42

Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.

One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into.

5 ways to improve CoT faithfulness

Abram Demski4mo40

See here for an updated version of my thinking based on discussion with Daniel:

https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser

5 ways to improve CoT faithfulness

Abram Demski4mo20

Yeah, I agree that's at least somewhat true. So, given that, is it a good move or not? I don't see much of an upside, since there's a heavy financial incentive for the big labs to do little about concerning results observed in such models. IE, when it comes to the question of whether frontier labs training o1-like models should follow OpenAI's example of avoiding safety training for the CoT, I think it's right to discourage them from doing this rather than encourage them... although, I say this with very low confidence.

4Abram Demski4mo

See here for an updated version of my thinking based on discussion with Daniel: https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser

5 ways to improve CoT faithfulness

Abram Demski4mo20

I don't think it is worth getting into all the stuff you're mentioning here, but I think a key crux is that I'm expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).

I'm reacting to this whole thing as a possible default configuration going forward, rather than as it exists today. All of my concerns are about what happens when you scale it up. For example, I don't find o1's current level of deception hugely concerning in its own right; rather, I see it as a bad sign for the technology as it continues to develop.

I guess one way of fr... (read more)

2Ryan Greenblatt4mo

Sadly, gathering evidence of misalignment in deployment seems likely to me to be one of the most effective strategies for gathering legible evidence (at least for early systems) given likely constraints. (E.g., because people won't believe results in text beds and because RL might be too expensive to run twice.)

2Ryan Greenblatt4mo

Sure, I mean that I expect that for the critical regime for TAI, we can literally have the Face be 3.5 sonnet.

5 ways to improve CoT faithfulness

Abram Demski4mo30

Ah, yep, this makes sense to me.

5 ways to improve CoT faithfulness

Abram Demski4mo20

Yeah, this is a good point, which doesn't seem addressed by any idea so far.

o1 is a bad idea

Abram Demski4mo20

Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates).

LLMs touch on the real world far more than that, such that MCTS-like skill at navigating "the LLM world" in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.

5 ways to improve CoT faithfulness

Abram Demski4mo30

Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.

Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it so... (read more)

5 ways to improve CoT faithfulness

Abram Demski4mo20

Yeah, you're right, I no longer think it's an interesting proposal.

5 ways to improve CoT faithfulness

Abram Demski4mo20

I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining.
And, these should clearly be separate models such that the training of one doesn't generalize to the other.
So, making the Face more deceptive doesn't kill the canary?

Ah, yes, I had neglected this important distinction.

So what do you do in the end, throw away the face?

It seems worth pointing out that although the CoT isn't trained to be deceptive, it is trained to think thoughts which help out a deceptive face. So it seems plausible that a mul... (read more)

2Ryan Greenblatt4mo

I agree there are various ways the Face could cause problems via influencing the training of the actual policy. (Both more wild influence that would require clever manipulation and a more mundane reduction of disincentive for problematic reasoning (though I'd note this mundane issue is the core goal of the original proposal!).) I don't think it is worth getting into all the stuff you're mentioning here, but I think a key crux is that I'm expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).

o1 is a bad idea

Abram Demski4mo20

Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won't work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.

Ah, I wasn't thinking "sparse" here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence).

I still think o1 is moving towards chess on this spectrum.

1Noosphere894mo

Oh, now I understand. And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence. I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.

5 ways to improve CoT faithfulness

Abram Demski4mo20

Async is fine with me. Please feel encouraged to start it.

5 ways to improve CoT faithfulness

Abram Demski4mo20

My current epistemic state is pretty "I'm not sure at all, it's a complicated subject" fwiw. I'll try to generate some responses anyway.

make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate)
I thought the point was that we wouldn't train against the CoT basically at all with the baseline proposal? (Similar to o1.)

Sure. Therefore, you can't check for deceptive behavior using the CoT when you're trying to train against deceptive behavior. Therefore, Hubinger et al's conclusions... (read more)

2Ryan Greenblatt4mo

I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining. And, these should clearly be separate models such that the training of one doesn't generalize to the other. So, making the Face more deceptive doesn't kill the canary? The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1). Maybe your concern is that it is plausible that the policy won't be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn't smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too? (It might also be possible to reduce the extent to which the initial Face has control over the world. E.g., you can initialize one Face to use for the first part of training (including the proposal I was mentioning) and then train another later on to imitate the earlier Face on a subpart of the distribution you inspect with the hope that it doesn't inherit learned misalignment. You could also deploy a different Face than the one you use for training if you aren't doing online training.)

5 ways to improve CoT faithfulness

Abram Demski4mo20

Would you be interested in having a (perhaps brief) LW Dialogue about it where you start with a basic intro to your shoggoth/mask division-of-labor, and I then present my critique?

2Daniel Kokotajlo4mo

Potentially... would it be async or would we need to coordinate on a time?

o1 is a bad idea

Abram Demski4mo20

My point here is that at the capability level of GPT4, this distinction isn't very important. There's no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn't cleverly scheming. It is merely human-level at deception, and doesn't pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn't even adequately coherent to make a crisp distinction between whether it's honestly trying to answer the ques... (read more)

o1 is a bad idea

Abram Demski4mo52

I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have

[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]

But what lesson do you think you can generalize, and why do you think you can generalize that?

I think this is a crux, in that I don't buy o1 as progressing to a regime where we lose so much dense feedback that it's alignment relevant, because I th

... (read more)

1Noosphere894mo

To answer the question: Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won't work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is. Other AIs relying on much denser feedback will already rule the world before that happens. Alright, I'll give 2 lessons that I do think generalize to superintelligence: 1. The data is a large factor in both it's capabilities and alignment, and alignment strategies should not ignore the data sources when trying to make predictions or trying to intervene on the AI for alignment purposes. 2. Instrumental convergence in a weak sense will likely exist, because having some ability to get more resources are useful for a lot of goals, but the extremely unconstrained versions of instrumental convergence often assumed where an AI will grab so much power as to effectively control humanity is unlikely to exist, given the constraints and feedback given to the AI. For 1, the basic answer for why is because a lot of AI success in fields like Go and language modeling etc was jumpstarted by good data. More importantly, I remember this post, and while I think it overstates things in stating that an LLM is just the dataset (it probably isn't now with o1), it does matter that LLMs are influenced by their data sources. https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/ For 2, the basic reason for this is that the strongest capabilities we have seen that come out of RL either require immense amounts of data on pretty narrow tasks, or non-instrumental world models. This is because constraints prevent you from having to deal with the problem where you produce completely useless RL artifacts, and evolution got around this constraint by accepting far longer timelines and far more com

o1 is a bad idea

Abram Demski4mo30

Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing.

One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not.

A lot of possible process-... (read more)

o1 is a bad idea

Abram Demski4mo30

I more-or-less agree with Eliezer's comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn't know him in 1996). I have a small beef with his bolded "MIRI is always in every instance" claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they've ever said while employed at MIRI).

What I am trying to say is something looser and more gestalt. I do thi... (read more)

o1 is a bad idea

Abram Demski4mo40

I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).

To put it a different way: human imitation is a relatively safe outer-optimization target, so we should be concerned to the extent that the industry is moving away ... (read more)

1Noosphere894mo

I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have (and note that I'm also using evidence from non-LLM sources like MCTS algorithm that was used for AlphaGo), and I also don't believe interpretability is why LLMs are mostly safe at all, but rather I think they're safe due to a combo of incapacity, not having extreme instrumental convergence, and the ability to steer them with data.\ Language is a simple example, but one that is generalizable pretty far. Note that the primary points would apply to basically a whole lot of AI designs like MCTS for AlphaGo or a lot of other future architecture designs which don't imitate humans, barring ones which prevent you from steering it at all with data, or have very sparse feedback, which translates into weakly constraining instrumental convergence. I think this is a crux, in that I don't buy o1 as progressing to a regime where we lose so much dense feedback that it's alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research. Also, AIs will still have instrumental convergence, it's just that their goals will be more local and more focused around the training task, so unless the training task rewards global power-seeking significantly, you won't get it.

5 ways to improve CoT faithfulness

Abram Demski4mo2-1

I don't get why you think actively training the model to be deceptive is "worth the cost". The deceptive behavior can generalize to other cases, and make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate). It seems like a choice between two bad options: actively deceptive CoT or actively deceptive Face.

2Ryan Greenblatt4mo

You can still inspect the underlying CoT to know if you have big issues. It is worth noting that the baseline proposal was already incentivizing the Face to be deceptive to the extent to which this proposal is important (the whole hope was to allow for hiding deception). So, I agree that my proposed change trades off "less deceptive CoT" for "more deceptive Face" relative to the baseline proposal, but this feels like this is mostly a quantitative difference? (My guess is that you have this concern with the baseline protocol and thus don't like moving further in this direction.) It could be that the baseline proposal doesn't make sense because shaping with process based feedback on the CoT is so useful that it is worth taking the risk of incentivizing unfaithful CoT. I thought the point was that we wouldn't train against the CoT basically at all with the baseline proposal? (Similar to o1.)

5 ways to improve CoT faithfulness

Abram Demski4mo40

--...but the reward process (RLHF, RLAIF, whatever) gets to see the outputs of the Face but not the Shoggoth.
--So, the hope is that skills and propensities towards deception and dishonesty will accumulate in the Face rather than the Shoggoth.

It seems pretty important to put some incentives on the CoT to avoid steganography, though. The division of labor here seems aimed at not actively incentivizing instrumentally useful thoughts to get hidden. However, the pre-trained LLM will already be responsive to some steganographic patterns, in addition to the human... (read more)

2Daniel Kokotajlo4mo

I encourage you to make this point into a blog post.

5 ways to improve CoT faithfulness

Abram Demski4mo30

However, with this naive approach, the paraphrasings will probably still consistently contain certain kinds of noise from the initial thought. For instance, maybe the paraphraser never replaces an exclamation point with a period, even though this end-of-sentence punctuation doesn't really change the meaning. By modulating whether the step ends with a period or an exclamation point, the model can encode one bit of information per step, despite the paraphraser.
One way to solve this problem is to map every piece of text with a given meaning onto the same "can

... (read more)

3Caleb Biddulph4mo

What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser." One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples. The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.

5 ways to improve CoT faithfulness

Abram Demski4mo30

Note that the optimized planner is only rewarded for producing steps that help the frozen planner. Since the frozen planner doesn't understand steganography, there is no reason for the optimized planner to learn to use it.

I think this conclusion is overstated. The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples. The optimized planner is very much like an adversarial example, in the way it is trained to generate outputs which nudge a frozen NN in a specific direction.

Imagine reversing the r... (read more)

3Caleb Biddulph4mo

The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion. For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist. However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement. In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased". Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her." At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.

1Caleb Biddulph4mo

The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.

1Caleb Biddulph4mo

This is a good point! Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text. This is especially likely to hold under the following conditions: * The step size should be small. I alluded to this in the post, but it seems very hard to select a single token that conveys some nonobvious information to the frozen planner in a way that somehow improves performance. The more tokens per step, the easier this becomes. * We should apply regularization to the planner, to make it stay somewhat close to the base LLM's logprobs. That can rule out especially crazy sequences of tokens like in the research at https://llm-attacks.org. * We can continue training the frozen planner on human data, where the humans write continuations of CoT prefixes from the optimized planner. That way, if the optimized planner puts the frozen planner off distribution, the humans can provide examples that will put it back on track. This data collection might be pretty expensive though - maybe there's a smart way to choose optimized CoT prefixes that will be most helpful to collect human continuations for.

o1 is a bad idea

Abram Demski4mo20

Thanks for this response! I agree with the argument. I'm not sure what it would take to ensure CoT faithfulness, but I agree that it is an important direction to try and take things; perhaps even the most promising direction for near-term frontier-lab safety research (given the incentives pushing those labs in o1-ish directions).

Complete Feedback

Abram Demski4mo60

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?

You can define it that way, but then I don't think it's highly relevant for this context.

The story I'm telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can't rule out using the restricted feedback mechanism.

... (read more)

2Steve Byrnes4mo

Hmm, I think the point I’m trying to make is: it’s dicey to have a system S that’s being continually modified to systematically reduce some loss L, but then we intervene to edit S in a way that increases L. We’re kinda fighting against the loss-reducing mechanism (be it gradient descent or bankroll-changes or whatever), hoping that the loss-reducing mechanism won’t find a “repair” that works around our interventions. In that context, my presumption is that an AI will have some epistemic part S that’s continually modified to produce correct objective understanding of the world, including correct anticipation of the likely consequences of actions. The loss L for that part would probably be self-supervised learning, but could also include self-consistency or whatever. And then I’m interpreting you (maybe not correctly?) as proposing that we should consider things like making the AI have objectively incorrect beliefs about (say) bioweapons, and I feel like that’s fighting against this L in that dicey way. Whereas your Q-learning example doesn’t have any problem with fighting against a loss function, because Q(S,A) is being consistently and only updated by the reward. The above is inapplicable to LLMs, I think. (And this seems tied IMO to the fact that LLMs can’t do great novel science yet etc.) But it does apply to FixDT. Specifically, for things like FixDT, if there are multiple fixed points (e.g. I expect to stand up, and then I stand up, and thus the prediction was correct), then whatever process you use to privilege one fixed point over another, you’re not fighting against the above L (i.e., the “epistemic” loss L based on self-supervised learning and/or self-consistency or whatever). L is applying no force either way. It’s a wide-open degree of freedom. (If your response is “L incentivizes fixed-points that make the world easier to predict”, then I don’t think that’s a correct description of what such a learning algorithm would do.) So if your feedback propo

Radical Probabilism

Abram Demski4mo20

The thing to prove is precisely convergence and calibration!

Convergence: Given arbitrary observed sequences, the probabilities of the latent bits (program tape and working tape) converge.

(Obviously the non-latent bits converge, if we assume they eventually get observed; hence the focus on the latent bits here.)

Calibration: Considering the sequence of probabilistic predictions has the property that, considering the subsequence of only predictions between 79% and 81% (for example), the empirical frequency of those predictions coming true will eventually be i... (read more)

Complete Feedback

Abram Demski4mo50

Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too?

No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing.

As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't ... (read more)

2Steve Byrnes4mo

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right? [I learned the term teleosemantics from you! :) ] The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false. I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions. And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen. The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update. (I don’t think I’m saying anything you don’t already know well.) Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires. (I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.) …Indeed, what I said above is just a special case. Here’s something more general and elegant. You have the core LI system, and then some

Seeking Collaborators

Abram Demski4mo30

I think I've articulated a number of concrete subgoals that require less philosophical skill (they can be approached as math problems). However, in the big picture, novel tiling theorems require novel ideas. This requires philosophical skill.

Seeking Collaborators

Abram Demski4mo30

I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. "Unknown Algorithm" to me means that we don't know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can't just think about what it "converges to" (convergence time will be too long for interesting learning systems).

1Towards_Keeperhood4mo

Hm interesting. I mean I'd imagine that if we get good heuristic guarantees for a system it would basically mean that all the not-perfectly-aligned subsystems/subsearches are limited and contained enough that they won't be able to engage in RSI. But maybe I misunderstand your point? (Like maybe you have specific reason to believe that it would be very hard to predict reliably that a subsystem is contained enough to not engage in RSI or so?) (I think inner alignment is very hard and humans are currently not (nearly?) competent enough to figure out how to set up training setups within two decades. Like for being able to get good heuristic guarantees I think we'd need to at least figure out at least something sorta like the steering subsystem which tries to align the human brain, only better because it's not good enough for smart humans I'd say. (Though Steven Byrnes' agenda is perhaps a UANFSI approach that might have sorta a shot because it might open up possibilities of studying in more detail how values form in humans. Though it's a central example of what I was imagining when I coined the term.))

Seeking Collaborators

Abram Demski4mo20

How would you become confident that a UANFSI approach was NFSI?

1Towards_Keeperhood4mo

Idk that could be part of finding heuristic arguments for desireable properties for what an UANFSI converges to. Possibly it's easier to provide probabilistic convergence guarantees for systems that don't do FSI so this would already give some implicit evidence. But we could also just say that it's fine if FSI happens as long as we have heuristic convergence arguments - like that UANFSI is just allowing for a broader class of algorithms which might make stuff easier - though i mostly don't expect we'd get FSI alignment through this indirect alignment path from UANFSI but that we'd get an NFSI AI if we get some probabilistic convergence guarantees. (Also I didn't think much about it at all. As said I'm trying KANSI for now.)

Seeking Collaborators

Abram Demski4mo50

I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumpt... (read more)

Complete Feedback

Abram Demski4mo20

Yeah, I totally agree. This was initially a quick private message to someone, but I thought it was better to post it publicly despite the inadequate explanations. I think the idea deserves a better write-up.

Why is o1 so deceptive?

Abram Demski5mo21

I agree, but this doesn't explain why it would (seemingly) encourage itself to hallucinate.