I think my bar for reasonably honest is... not awful -- I've put fair bit of thought into trying to hold LLMs to the "same standards" as humans. Most people don't do that and unwittingly apply much stricter standards to LLMs than to humans. That's what I take you to be doing right now.
So, let me enumerate senses of honesty.
1. Accurately answering questions about internal states accurately. Consider questions like: Why do you believe X? Why did you say Y, just now? Why do you want to do Z? So, for instance, for humans -- why do you believe in God? Why di...
LLM agents seem... reasonably honest? But "honest" means a lot of different things, even when just talking about humans. In a lot of cases (asking them about why they do or say things, maintaining consistent beliefs across contexts) I think LLMs are like people with dementia -- neither honest nor dishonest, because they are not capable of accurate beliefs about themselves. In other cases (Claude's faking alignment) it seems like they value honesty, but not above all other goods whatsoever, which... seems ok, depending on your ethics (and also given that s...
The picture of what's going on in step 3 seems obscure. Like I'm not sure where the pressure for dishonesty is coming from in this picture.
On one hand, it sounds like this long-term agency training (maybe) involves other agents, in a multi-agent RL setup. Thus, you say "it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans" -- so it seems like it's learning specific things flattering humans or at least flattering other agents in order to acquire this tendency towards ...
I agree this can be initially surprising to non-experts!
I just think this point about the amorality of LLMs is much better communicated by saying "LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style."
Than to say "LLMs are like alien shoggoths."
Like it's just a better model to give people.
I like a lot of these questions, although some of them give me an uncanny feeling akin to "wow, this is a very different list of uncertainties than I have." I'm sorry the my initial list of questions was aggressive.
...So I don't consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictio
performs deeply alien cognition
I remain unconvinced that there's a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, "nah, LLMs aren't deeply alien."
If LLM cognition was not "deeply alien" what would the world look like?
What distinguishing evidence does this world display, that separates us from that world?
What would an only kinda-alien bit of cognition look like?
What would very human kind of cognition look like?
What different predictions does the world make?
Does alienness indicate that it is...
I agree that if you knew nothing about DL you'd be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.
I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you'd be better off deferring to local knowledge about DL than to the analogy.
Or, what's more to the point -- I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.
Combining some of yours and Habryka's comments, which see...
FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it's very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn't necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without provi...
...Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.
(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of p
evolution does not grow minds, it grows hyperparameters for minds.
Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.
I want to distinguish two possible takes here:
Does evolution ~= AI have predictive power apart from doom?
Evolution analogies predict a bunch of facts that are so basic they're easy to forget about, and even if we have better theories for explaining specific inductive biases, the simple evolution analogies should still get some weight for questions we're very uncertain about.
I can't track what you're saying about LLM dishonesty, really. You just said:
Which implies LLM honesty ~= average human.
But in the prior comment you said:
... (read more)Good point, you caught me in a contradiction there. Hmm.
I think my position on reflection after this conversation is: We just don't have much evidence one way or another about how honest future AIs will be. Current AIs seem in-distribution for human behavior, which IMO is not an encouraging sign, because our survival depends on making them be much more honest than typical humans.
As you said, the alignment faking paper is not much evidence one way or another (though alas, it's probably the closest thing we have?). (I don't think it's a capability demo... (read more)