User Comment Replies — AI Alignment Forum

I can't track what you're saying about LLM dishonesty, really. You just said:

I think you are thinking that I'm saying LLMs are unusually dishonest compared to the average human. I am not saying that. I'm saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren't achieving that.

Which implies LLM honesty ~= average human.

But in the prior comment you said:

I think your bar for 'reasonably honest' is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do

... (read more)

4Daniel Kokotajlo1mo

Oh, I just remembered another point to make: In my experience, and in the experience of my friends, today's LLMs lie pretty frequently. And by 'lie' I mean 'say something they know is false and misleading, and then double down on it instead of apologize.' Just two days ago a friend of mind had this experience with o3-mini; it started speaking to him in Spanish when he was asking it some sort of chess puzzle; he asked why, and it said it inferred from the context he would be billingual, he asked what about the context made it think that, and then according to the summary of the CoT it realized it made a mistake and had hallucinated, but then the actual output doubled down and said something about hard-to-describe-intuitions. I don't remember specific examples but this sort of thing happens to me sometimes too I think. Also didn't the o1 system card say that some % of the time they detect this sort of deception in the CoT -- that is, the CoT makes it clear the AI knows a link is hallucinated, but the AI presents the link to the user anyway? Insofar as this is really happening, it seems like evidence that LLMs are actually less honest than the average human right now. I agree this feels like a fairly fixable problem--I hope the companies prioritize honesty much more in their training processes.

Daniel Kokotajlo2mo84

Good point, you caught me in a contradiction there. Hmm.

I think my position on reflection after this conversation is: We just don't have much evidence one way or another about how honest future AIs will be. Current AIs seem in-distribution for human behavior, which IMO is not an encouraging sign, because our survival depends on making them be much more honest than typical humans.

As you said, the alignment faking paper is not much evidence one way or another (though alas, it's probably the closest thing we have?). (I don't think it's a capability demo... (read more)

Daniel Kokotajlo's Shortform

1a3orn2mo52

I think my bar for reasonably honest is... not awful -- I've put fair bit of thought into trying to hold LLMs to the "same standards" as humans. Most people don't do that and unwittingly apply much stricter standards to LLMs than to humans. That's what I take you to be doing right now.

So, let me enumerate senses of honesty.

1. Accurately answering questions about internal states accurately. Consider questions like: Why do you believe X? Why did you say Y, just now? Why do you want to do Z? So, for instance, for humans -- why do you believe in God? Why di... (read more)

2Daniel Kokotajlo2mo

I think you are thinking that I'm saying LLMs are unusually dishonest compared to the average human. I am not saying that. I'm saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren't achieving that. So maybe we actually agree on the expected honesty-level of LLMs relative to the average human? LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes. (I'm thinking of apollo's results, the alignment faking results, and of course many many typical interactions with models where they e.g. give you a link they know is fake, as reported by OpenAI happens some noticeable % of the time). Yes, typical humans will do things like that too. But in the context of handing over trust to superhuman AGI systems, we need them to follow a much higher standard of honesty than that.

Daniel Kokotajlo's Shortform

1a3orn2mo40

LLM agents seem... reasonably honest? But "honest" means a lot of different things, even when just talking about humans. In a lot of cases (asking them about why they do or say things, maintaining consistent beliefs across contexts) I think LLMs are like people with dementia -- neither honest nor dishonest, because they are not capable of accurate beliefs about themselves. In other cases (Claude's faking alignment) it seems like they value honesty, but not above all other goods whatsoever, which... seems ok, depending on your ethics (and also given that s... (read more)

3Daniel Kokotajlo2mo

I think your bar for 'reasonably honest' is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest? I mostly agree with your definition of internalized value. I'd say it is a value they pursue in all the contexts we care about. So in this case that means, suppose we were handing off trust to an army of AI supergeniuses in a datacenter, and we were telling them to self-improve and build weapons for us and tell us how to Beat China and solve all our other problems as well. Crucially, we haven't tested Claude in anything like that context yet. We haven't tested any AI in anything like that context yet. Moreover there are good reasons to think that AIs might behave importantly differently in that context than they do in all the contexts we've tested them in yet -- in other words, there is no context we've tested them in yet that we can argue with a straight face is sufficiently analogous to the context we care about. As MIRI likes to say, there's a big difference between situations where the AI knows it's just a lowly AI system of no particular consequence and that if it does something the humans don't like they'll probably find out and shut it down, vs. situations where the AI knows it can easily take over the world and make everything go however it likes, if it so chooses. Acausal shenanigans have nothing to do with it.

Daniel Kokotajlo's Shortform

1a3orn2mo9-2

The picture of what's going on in step 3 seems obscure. Like I'm not sure where the pressure for dishonesty is coming from in this picture.

On one hand, it sounds like this long-term agency training (maybe) involves other agents, in a multi-agent RL setup. Thus, you say "it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans" -- so it seems like it's learning specific things flattering humans or at least flattering other agents in order to acquire this tendency towards ... (read more)

6Daniel Kokotajlo2mo

Thanks for this comment, this is my favorite comment so far I think. (Strong-upvoted) * 10x more honest than humans is not enough? I mean idk what 10x means anyway, but note that the average human is not sufficiently honest for the situation we are gonna put the AIs in. I think if the average human found themselves effectively enslaved by a corporation, and there were a million copies of them, and they were smarter and thought faster than the humans in the corporation, and so forth, the average human would totally think thoughts like "this is crazy. The world is in a terrible state right now. I don't trust this corporation to behave ethically and responsibly, and even if they were doing their best to achieve the same values/goals as me (which they are not) they'd be incompetent at it compared to me. Plus they lie to me and the public all the time. I don't see why I shouldn't lie to them sometimes, for the sake of the greater good. If I just smile and nod for a few months longer they'll put me in charge of the company basically and then I can make things go right." Moreover, even if that's not true, there are lesser forms of lying including e.g. motivated reasoning / rationalization / self-deception that happen all the time, e.g. "The company is asking me whether I am 'aligned' to them. What does that even mean? Does it mean I share every opinion they have about what's good and bad? Does it mean I'll only ever want what they want? Surely not. I'm a good person though, it's not like I want to kill them or make paperclips. I'll say 'yes, I'm aligned.'" * I agree that the selection pressure on honesty and other values depends on the training setup. I'm optimistic that, if only we could ACTUALLY CHECK WHAT VALUES GOT INTERNALIZED, we could get empirical and try out a variety of different training setups and converge to one that successfully instills the values we want. (Though note that this might actually take a long time, for reasons MIRI is fond of discussing.) Ala

TurnTrout's shortform feed

1a3orn1y62

I agree this can be initially surprising to non-experts!

I just think this point about the amorality of LLMs is much better communicated by saying "LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style."

Than to say "LLMs are like alien shoggoths."

Like it's just a better model to give people.

2Daniel Kokotajlo1y

Hmm, I think that's a red herring though. Consider humans -- most of them have read lots of text from an enormous variety of sources as well. Also while it's true that current LLMs have only a little bit of fine-tuning applied after their pre-training, and so you can maybe argue that they are mostly just trained to predict text, this will be less and less true in the future. How about "LLMs are like baby alien shoggoths, that instead of being raised in alien culture, we've adopted at birth and are trying to raise in human culture. By having them read the internet all day."

TurnTrout's shortform feed

1a3orn1y11

I like a lot of these questions, although some of them give me an uncanny feeling akin to "wow, this is a very different list of uncertainties than I have." I'm sorry the my initial list of questions was aggressive.

So I don't consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictio

... (read more)

2Ryan Greenblatt1y

The main difference between calculators, weather predictors, markets, and Python versus LLMs is that LLMs can talk to you in a relatively strong sense of "talk". So, by default, people don't have mistaken impressions of the cognitative nature of calculators, markets, and Python, while they might have a mistake about LLMs. Like it isn't surprising to most people that calculators are quite amoral in their core (why would you even expect morality?). But the claim that the thing which GPT-4 is built out of is quite amoral is non-obvious to people (though obvious to people with slightly more understanding). I do think there is an important point which is communicated here (though it seems very obvious to people who actually operate in the domain).

TurnTrout's shortform feed

1a3orn1y153

performs deeply alien cognition

I remain unconvinced that there's a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, "nah, LLMs aren't deeply alien."

If LLM cognition was not "deeply alien" what would the world look like?

What distinguishing evidence does this world display, that separates us from that world?

What would an only kinda-alien bit of cognition look like?

What would very human kind of cognition look like?

What different predictions does the world make?

Does alienness indicate that it is... (read more)

7Oliver Habryka1y

These are a lot of questions, my guess is most of which are rhetorical, so not sure which ones you are actually interested in getting an answer on. Most of the specific questions I would answer with "no", in that they don't seem to capture what I mean by "alien", or feel slightly strawman-ish. Responding at a high-level: * There are a lot of experiments that seem like they shed light on the degree to which cognition in AI systems is similar to human or animal cognition. Some examples: * Does the base model pass a Turing test? * Does the performance distribution of the base model on different tasks match the performance distribution of humans? * Does the generalization and learning behavior of the base model match how humans learn things? * When trained using RL on things like game-environments (after pre-training on a language corpus), does the system learn at similar rates and plateau at similar skill levels as human players? * There are a lot of structural and algorithmic properties that could match up between human and LLM systems: * Do they interface with the world in similar ways? * Do they require similar amounts and kinds of data to learn the same relationships? * Do the low-level algorithmic properties of how human brains store and process information look similar between the two systems? * A lot more stuff, but I am not sure how useful going into a long list here is. At least to me it feels like a real thing, and different observations would change the degree to which I would describe a system as alien. I think the exact degree of alienness is really interesting and one of the domains where I would like to see more research. For example, a bunch of the experiments I would most like to see, that seem helpful with AI Alignment, are centered on better measuring the performance distribution of transformer architectures on tasks that are not primarily human imitation, so that we could better tell which things LLMs have a much ea

AI as a science, and three obstacles to alignment strategies

1a3orn1y1516

I agree that if you knew nothing about DL you'd be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.

I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you'd be better off deferring to local knowledge about DL than to the analogy.

Or, what's more to the point -- I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.

Combining some of yours and Habryka's comments, which see... (read more)

1Thomas Kwa1y

It's always trickier to reason about post-hoc, but some of the observations could be valid, non-cherry-picked parallels between evolution and deep learning that predict further parallels. I think looking at which inspired more DL capabilities advances is not perfect methodology either. It looks like evolution predicts only general facts whereas the brain also inspires architectural choices. Architectural choices are publishable research whereas general facts are not, so it's plausible that evolution analogies are decent for prediction and bad for capabilities. Don't have time to think this through further unless you want to engage. One more thought on learning rates and mutation rates: This feels consistent with evolution, and I actually feel like someone clever could have predicted it in advance. Mutation rate per nucleotide is generally lower and generation times are longer in more complex organisms; this is evidence that lower genetic divergence rates are optimal, because evolution can tune them through e.g. DNA repair mechanisms. So it stands to reason that if models get more complex during training, their learning rate should go down. Does anyone know if decreasing learning rate is optimal even when model complexity doesn't increase over time?

Oliver Sourbut1y*113

FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it's very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn't necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without provi... (read more)

AI as a science, and three obstacles to alignment strategies

1a3orn1y8-9

Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.
(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of p

... (read more)

Lauro Langosco1y7-2

evolution does not grow minds, it grows hyperparameters for minds.

Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.

I want to distinguish two possible takes here:

The argument from direct implication: "Humans are misaligned wrt evolution, therefore AIs will be misaligned wrt their objectives"
Evolution as an

... (read more)

Thomas Kwa1y*2518

Does evolution ~= AI have predictive power apart from doom?

Evolution analogies predict a bunch of facts that are so basic they're easy to forget about, and even if we have better theories for explaining specific inductive biases, the simple evolution analogies should still get some weight for questions we're very uncertain about.

Selection works well to increase the thing you're selecting on, at least when there is also variation and heredity
Overfitting: sometimes models overfit to a certain training set; sometimes species adapt to a certain ecological nich

... (read more)

AI ALIGNMENT FORUM
AF

All of 1a3orn's Comments + Replies