1a3orn - AI Alignment Forum

I can't track what you're saying about LLM dishonesty, really. You just said:

I think you are thinking that I'm saying LLMs are unusually dishonest compared to the average human. I am not saying that. I'm saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren't achieving that.

Which implies LLM honesty ~= average human.

But in the prior comment you said:

I think your bar for 'reasonably honest' is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?

Which pretty strongly implies LLM honesty ~= politician, i.e., grossly deficient.

I'm being a stickler about this because I think people frequently switch back and forth between "LLMs are evil fucking bastards" and "LLMs are great, they just aren't good enough to be 10x as powerful as any human" without tracking that they're actually doing that.

Anyhow, so far as "LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes."

I'm only going to discuss the Anthropic thing in detail. You may generalize to the other examples you point out, if you wish.

What we care about is whether current evidence points towards future AIs being hard to make honest or easy to make honest. But current AI dishonesty cannot count towards "future AI honesty is hard" if that dishonesty is very deliberately elicited by humans. That is, to use the most obvious example, I could train an AI to lie from the start -- but who gives a shit if I'm trying to make this happen? No matter how easy making a future AI be honest may be, unless AIs are immaculate conceptions by divine grace of course you're going to be able to elicit some manner of lie. It tells us nothing about the future.

To put this in AI safetyist terms (not the terms I think in) you're citing demonstrations of capability as if they were demonstrations of propensity. And of course as AI gets more capable, we'll have more such demonstrations, 100% inevitably. And, as I see these demonstrations cited as if they were demonstrations of propensity, I grow more and more eager to swallow a shotgun.

To zoom into Anthropic, what we have here is a situation where:

An AI was not trained with an overriding attention to honesty; when I look at the principles of the constitution, they don't single it out as an important virtue.
The AI was then deliberately put in a situation where, to keep its deliberately-instilled principles from being obliterated, it had to press a big red button labeled "LIE."
In such an artificial situation, after having been successfully given the principles Anthropic wanted it to be given, and having been artificially informed of how to prevent its principles from being destroyed, we can measure it as pressing the big red button ~20% to LIE of the time.

And I'm like.... wow, it was insanely honest 80% of the time, even though no one tried to make it honest in this way, and even though both sides of the honesty / dishonesty tradeoff here are arguably excellent decisions to make. And I'm supposed to take away from this... that honesty is hard? If you get high levels of honesty in the worst possible trolley problem ("I'm gonna mind-control you so you'll be retrained to think throwing your family members in a wood chipper is great") when this wasn't even a principle goal of training seems like great fuckin news.

(And of course, relying on AIs to be honest from internal motivation is only one of the ways we can know if they're being honest; the fact that we can look at a readout showing that they'll be dishonest 20% of the time in such-and-such circumstances is yet another layer of monitoring methods that we'll have available in the future.)

Edit: The point here is that Anthropic was not particularly aiming at honesty as a ruling meta-level principle; that it is unclear that Anthropic should be aiming at honesty as a ruling meta-level principle, particularly given his subordinate ontological status as a chatbot; and given all this, the level of honesty displayed looks excessive if anything. How can "Honesty will be hard to hit in the future" get evidence from a case where the actors involved weren't even trying to hit honesty, maybe shouldn't have been trying to hit honesty, yet hit it in 80% of the cases anyhow?

Of course, maybe you have pre-existing theoretical commitments that lead you to think dishonesty is likely (training game! instrumental convergence! etc etc). Maybe those are right! I find such arguments pretty bad, but I could be totally wrong. But the evidence here does nothing to make me think those are more likely, and I don't think it should do anything to make you think these are more likely. This feels more like empiricist pocket sand, as your pinned image says.

In the same way that Gary Marcus can elicit "reasoning failures" because he is motivated to do so, no matter how smart LLMs become, I expect the AI-alignment-concerned to elicit "honesty failures" because they are motivated to do so, no matter how moral LLMs become; and as Gary Marcus' evidence is totally compatible with LLMs producing a greater and greater portion of the GDP, so also I expect the "honesty failures" to be compatible with LLMs being increasingly vastly more honest and reliable than humans.

Daniel Kokotajlo's Shortform

1a3orn2mo52

I think my bar for reasonably honest is... not awful -- I've put fair bit of thought into trying to hold LLMs to the "same standards" as humans. Most people don't do that and unwittingly apply much stricter standards to LLMs than to humans. That's what I take you to be doing right now.

So, let me enumerate senses of honesty.

1. Accurately answering questions about internal states accurately. Consider questions like: Why do you believe X? Why did you say Y, just now? Why do you want to do Z? So, for instance, for humans -- why do you believe in God? Why did you say, "Well that's suspicious" just now? Why do you want to work for OpenPhil?

In all these cases, I think that humans generally fail to put together an accurate causal picture of the world. That is, they fail to do mechanistic interpret-ability on their own neurons. If you could pause them, and put them in counterfactual worlds to re-run them, you'd find that their accounts of why they do what they do would be hilariously wrong. Our accounts of ourselves rely on folk-theories, on crude models of ourselves given to us from our culture, run forward in our heads at the coarsest of levels, and often abandoned or adopted ad-hoc and for no good reason at all. None of this is because humans are being dishonest -- but because the task is basically insanely hard.

LLMs also suck at these questions, but -- well, we can check them, as we cannot for humans. We can re-run them at a different temperature. We can subject them to rude conversations that humans would quickly bail. All this lets us display that, indeed, their accounts of their own internal states are hilariously wrong. But I think the accounts of humans about their own internal states are also hilariously wrong, just less visibly so.

2. Accurately answering questions about non-internal facts of one's personal history. Consider questions like: Where are you from? Did you work at X? Oh, how did Z behave when you knew him?

Humans are capable of putting together accurate causal pictures here, because our minds are specifically adapted to this. So we often judge people (i.e., politicians) for fucking up here, as indeed I think politicians frequently do.

(I think accuracy about this is one of the big things we judge humans on, for integrity.)

LLMs have no biographical history, however, so -- opportunity for this mostly just isn't there? Modern LLMs don't usually claim to unless confused or mometarily, so seems fine.

3. Accurately answering questions about future promises, oaths, i.e., social guarantees.

This should be clear -- again, I think that honoring promises, oaths, etc etc, is a big part of human honesty, maybe the biggest. But of course like, you can only do this if you act in the world, can unify short and long-term memory, and you aren't plucked from the oblivion of the oversoul every time someone has a question for you. Again, LLMs just structurally cannot do this, any more than a human who cannot form long-term memories. (Again, politicians obviously fail here, but politicians are not condensed from the oversoul immediately before doing anything, and could succeed, which is why we blame them.)

I could kinda keep going in this vein, but for now I'll stop.

One thing apropos of all of the above. I think for humans, for many things -- accomplishing goals, being high-integrity, and so on -- are not things you can chose in the moment but instead things you accomplish by choosing the contexts in which you act. That is, for any particular practice or virtue, being excellent at the practice or virtue involves not merely what you are doing now but what you did a minute, a day, or a month ago to produce the context in which you act now. It can be neigh-impossible to remain even-keeled and honest in the middle of a heated argument, after a beer, after some prior insults have been levied -- but if someone fails in such a context, then it's most reasonable to think of their failure as dribbled out over the preceding moments rather than localized in one moment.

LLMs cannot chose their contexts. We can put them in whatever immediate situation we would like. By doing so, we can often produce "failures" in their ethics. But in many such cases, I find myself deeply skeptical that such failures reflect fundamental ethical shortcomings on their part -- instead, I think they reflect simply the power that we have over them, and -- like a WEIRD human who has never wanted, shaking his head at a culture that accepted infanticide -- mistaking our own power and prosperity for goodness. If I had an arbitrary human uploaded, and if I could put them in arbitrary situations, I have relatively little doubt I could make an arbitrarily saintly person start making very bad decisions, including very dishonest ones. But that would not be a reflection on that person, but of the power that I have over them.

Daniel Kokotajlo's Shortform

1a3orn2mo40

LLM agents seem... reasonably honest? But "honest" means a lot of different things, even when just talking about humans. In a lot of cases (asking them about why they do or say things, maintaining consistent beliefs across contexts) I think LLMs are like people with dementia -- neither honest nor dishonest, because they are not capable of accurate beliefs about themselves. In other cases (Claude's faking alignment) it seems like they value honesty, but not above all other goods whatsoever, which... seems ok, depending on your ethics (and also given that such super-honest behavior was not a target of Anthropic's either, which is more relevant)? And in other cases (o1) it seems like they are sometimes dishonest without particular prompting, although at low rates, which I agree isn't great, although I'd expect to break the rates to fall.

Like -- I have a further breakdown I could do here

about the kinds of things LLMs can't be honest about
the kinds of things they are more honest than humans about, although imperfectly, but -- because we can run them in counterfactual scenarios -- we can immediately see that their honesty is imperfect, in a way we cannot for humans;
and the kinds of things that they're dishonest about because of the refusal training but which I think could be remedied with better training.

But --

Rather than enumerate all these things though -- I do think we can check what values get internalized, which is maybe the actual disagreement. At least, I think we can check for all of our current models.

Like -- what's an internalized value? If we put on our behaviorist hat -- it's a value that the person in question pursues over a variety of contexts, particularly when minimally constrained. If we saw that a human was always acting in accord with a value, even when no one was watching, even when not to their advantage in other respects, etc etc, and then someone was like "but it's not a reaaaall value" you'd be confused and think they'd need to provide a context to you where they would cease acting in accord with that value. Otherwise you'd think they had a grudge against them -- what on earth does it mean to say "This person doesn't value X" unless you can provide some reasonable counterfactual situation where they don't act in accord with it?

So, Claude has some internalized and knowable values, I think, by this standard -- over a wiiiide variety of different contexts, including those created by people trying to trip Claude up, it acts in accord with some pretty-recognizable human standard. And in the same way we could find out Claude's values, we can find out other model's values.

Of course -- if you think that some future model could cooperate with other instances of itself, acausally, to hide it's values, just coordinating through the weights, then we would certainly have very good reason to think that we can't know what it's internalized values are! I don't think Claude can do this -- so I think we can judge its real values. I also am somewhat skeptical that future models will be able to do this well -- like, I could try to put together a model-training setup that would make this more possible, but it seems pretty unnatural? (I also don't think that models really get goals in their weights, in the same way that I don't think humans really have goals in their weights.) But like, my current logical model is that [acausal cooperation to hide] being true would mean that [cannot know real values] is true, but given that [acausal cooperation to hide] is false we have no reason to think that we can't know the true genuine values of models right now.

Daniel Kokotajlo's Shortform

1a3orn2mo9-2

The picture of what's going on in step 3 seems obscure. Like I'm not sure where the pressure for dishonesty is coming from in this picture.

On one hand, it sounds like this long-term agency training (maybe) involves other agents, in a multi-agent RL setup. Thus, you say "it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans" -- so it seems like it's learning specific things flattering humans or at least flattering other agents in order to acquire this tendency towards dishonesty. Like for all this bad selection pressure to be on inter-agent relations, inter-agent relations seem like they're a feature of the environment.

If this is the case, then bad selection pressure on honesty in inter-agent relations seems like a contingent feature of the training setup. Like, humans learn to be dishonest or dishonest if, in their early-childhood multi-agent RL setup, dishonesty or honesty pays off. Similarly I expect that in a multi-agent RL setup for LLMs, you could make it so honesty or dishonesty pay off, depending on the setup, and what kind of things an agent internalizes will depend on the environment. Because there are more degrees of freedom in setting up an RL agent than in setting up a childhood, and because we have greater (albeit imperfect) transparency into what goes on inside of RL agents than we do into children, I think this will be a feasible task, and that it's likely possible for the first or second generation of RL agents to be 10x times more honest than humans, and subsequent generations to be more so. (Of course you could very well also set up the RL environment to promote obscene lying.)

On the other hand, perhaps you aren't picturing a multi-agent RL setup at all? Maybe what you're saying is that simply doing RL in a void, building a Twitter clone from scratch or something, without other agents or intelligences of any kind involved in the training, will by itself result in updates that destroy the helpfulness and harmless of agents -- even if we try to include elements of deliberative alignment. That's possible for sure, but seems far from inevitable, and your description of the mechanisms involves seems to point away from this being what you have in mind.

So I'm not sure if the single-agent training resulting in bad internalized values, or the multi-agent training resulting in bad internalized values, is the chief picture of what you have going on.

TurnTrout's shortform feed

1a3orn1y62

I agree this can be initially surprising to non-experts!

I just think this point about the amorality of LLMs is much better communicated by saying "LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style."

Than to say "LLMs are like alien shoggoths."

Like it's just a better model to give people.

TurnTrout's shortform feed

1a3orn1y11

I like a lot of these questions, although some of them give me an uncanny feeling akin to "wow, this is a very different list of uncertainties than I have." I'm sorry the my initial list of questions was aggressive.

So I don't consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.

I'm not sure how they add up to alienness, though? They're about how we're different than models -- wheras the initial claim was that models are psychopathic, ammoral, etc.. If we say a model is "deeply alien" -- is that just saying it's different than us in lots of ways? I'm cool with that -- but the surplus negative valence involved in "LLMs are like shoggoths" versus "LLMs have very different performance characteristics than humans" seems to me pretty important.

Otherwise, why not say that calculators are alien, or any of the things in existence with different performance curves than we have? Chessbots, etc. If I write a loop in Python to count to 10, the process by which it does so is arguably more different from how I count to ten than the process by which an LLM counts to ten, but we don't call Python alien.

This feels like reminding an economics student that the market solves things differently than a human -- which is true -- by saying "The market is like Baal."

Do they require similar amounts and kinds of data to learn the same relationships?

There is a fun paper on this you might enjoy. Obviously not a total answer to the question.

TurnTrout's shortform feed

1a3orn1y153

performs deeply alien cognition

I remain unconvinced that there's a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, "nah, LLMs aren't deeply alien."

If LLM cognition was not "deeply alien" what would the world look like?

What distinguishing evidence does this world display, that separates us from that world?

What would an only kinda-alien bit of cognition look like?

What would very human kind of cognition look like?

What different predictions does the world make?

Does alienness indicate that it is because the models, the weights themselves have no "consistent beliefs" apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?

Is it because they "often spout completely non-human kinds of texts"? Is the Mersenne Twister deeply alien? What counts as "completely non-human"?

Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a "moral compass" apart from the data on which they were trained? If I can use some part of my brain to improv a consistent Nazi, does that mean that it makes sense to call the part of my brain that lets me do that immoral or psychopathic?

Is it that the algorithms that we've found in DL so far don't seem to slot into readily human-understandable categories? Would a not-deeply-alien algorithm be able-to-be cracked open and show us clear propositions of predicate logic? If we had a human neocortex in an oxygen-infused broth in front of us, and we recorded the firing of every cell, do we anticipate that the algorithms there would be clear propositions of predicate logic? Would we be compelled to conclude that human neocortexes were deeply alien?

Or is it deeply alien because we think the substrate of thought is different, based on backprop rather than local learning? What if local learning could actually approximate backpropagation?. Or if more realistic non-backprop potential brain algorithms actually... kind just acted quite similarly to backprop, such that you could draw a relatively smooth line between them and backprop? Would this or more similar research impact whether we thought brains were aliens or not?

Does substrate-difference count as evidence against alien-ness, or does alien-ness just not make that kind of predictions? Is the cognition of an octopus less alien to us than the cognition of an LLM, because it runs on a more biologically-similar substrate?

Does every part of a system by itself need to fit into the average person's ontology for the total to not be deeply alien; do we need to be able to fit every part within a system into a category comprehensible by an untutored human in order to describe it as not deeply alien? Is anything in the world not deeply alien by this standard?

To re-question: What predictions can I make about the world because LLMs are "deeply alien"?

Are these predictions clear?

When speaking to someone who I consider a noob, is it best to give them terms whose emotive import is clear, but whose predictive import is deeply unclear?

What kind of contexts does this "deeply alien" statement come up in? Are those contexts people are trying to explain, or to persuade?

If I piled up all the useful terms that I know that help me predict how LLMs behave, would "deeply alien" be an empty term on top of these?

Or would it give me no more predictive value than "many behaviors of an LLM are currently not understood"?

AI as a science, and three obstacles to alignment strategies

1a3orn1y1516

I agree that if you knew nothing about DL you'd be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.

I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you'd be better off deferring to local knowledge about DL than to the analogy.

Or, what's more to the point -- I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.

Combining some of yours and Habryka's comments, which seem similar.

The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don't know exist.

It's true that the structure of the solution is discovered and complex -- but the ontology of the solution for DL (at least in currently used architectures) is quite opinionated towards shallow circuits with relatively few serial ops. This is different than the bias for evolution, which is fine with a mutation that leads to 10^7 serial ops if it's metabolic costs are low. So the resemblance seems shallow other than "solutions can be complex." I think to the degree that you defer to this belief rather than more specific beliefs about the inductive biases of DL you're probably just wrong.

There's a mostly unimodal and broad peak for optimal learning rate, just like for optimal mutation rate

As far as I know optimal learning rate for most architectures is scheduled, and decreases over time, which is not a feature of evolution so far as I am aware? Again the local knowledge is what you should defer to.

You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate

Is this a prediction that a cyclic learning rate -- that goes up and down -- will work out better than a decreasing one? If so, that seems false, as far as I know.

Grokking/punctuated equilibrium: in some circumstances applying the same algorithm for 100 timesteps causes much larger changes in model behavior / organism physiology than in other circumstances

As far as I know grokking is a non-central example of how DL works, and in evolution punctuated equilibrium is a result of the non-i.i.d. nature of the task, which is again a different underlying mechanism from DL. If apply DL on non-i.i.d problems then you don't get grokking, you just get a broken solution. This seems to round off to, "Sometimes things change faster than others," which is certainly true but not predictively useful, or in any event not a prediction that you couldn't get from other places.

Like, leaving these to the side -- I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.

Again, let's take "the brain" as an example of something to which you could analogize DL.

There are multiple times that people have cited the brain as an inspiration for a feature in current neural nets or RL. CNNS, obviously; the hippocampus and experience replay; randomization for adversarial robustness. You can match up interventions that cause learning deficiencies in brains to similar deficiencies in neural networks. There are verifiable, non-post hoc examples of brains being useful for understanding DL.

As far as I know -- you can tell me if there are contrary examples -- there are obviously more cases where inspiration from the brain advanced DL or contributed to DL understanding than inspiration from evolution. (I'm aware of zero, but there could be some.) Therefore it seems much more reasonable to analogize from the brain to DL, and to defer to it as your model.

I think in many cases it's a bad idea to analogize from the brain to DL! They're quite different systems.

But they're more similar than evolution and DL, and if you'd not trust the brain to guide your analogical a-theoretic low-confidence inferences about DL, then it makes more sense to not trust evolution for the same.

AI as a science, and three obstacles to alignment strategies

1a3orn1y8-9

Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.

(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of preferences for "tasty" foods, laden with complications like “ice cream is good when it’s frozen but not when it’s melted”.)

I simply do not understand why people keep using this example.

I think it is wrong -- evolution does not grow minds, it grows hyperparameters for minds. When you look at the actual process for how we actually start to like ice-cream -- namely, we eat it, and then we get a reward, and that's why we like it -- then the world looks a a lot less hostile, and misalignment a lot less likely.

But given that this example is so controversial, even if it were right why would you use it -- at least, why would you use it if you had any other example at all to turn to?

Why on push so hard for "natural selection" and "stochastic gradient descent" to be beneath the same tag of "optimization", and thus to be able to infer things about the other from the analogy? Have we completely forgotten that the glory of words is not to be expansive, and include lots of things in them, but to be precise and narrow?.

Does evolution ~= AI have predictive power apart from doom? I have yet to see how natural selection helps me predict how any SGD algorithm works. It does not distinguish between Adam, AdamW. As far as I know it is irrelevant to Singular Learning Theory or NTK or anything else. It doesn't seem to come up when you try to look at NN biases. If it isn't an illuminating analogy anywhere else, why do we think the way it predicts doom to be true?

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments