Posts

Sorted by New

0peterbarnett's Shortform

63Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

6Doing oversight from the very start of training seems hard

17Framings of Deceptive Alignment

26Understanding Gradient Hacking

Wikitag Contributions

Comments

Sorted by

Newest

Training on Documents About Reward Hacking Induces Reward Hacking

Peter Barnett3mo1730

Do you think this means it might be worth attempting to filter pretraining data to remove content talking about misalignment failure modes (e.g., deceptive alignment, clippy, reward hacking, treacherous turns, etc)?

On scalable oversight with weak LLMs judging strong LLMs

Peter Barnett9mo30

I think this comment might be more productive if you described why you expect this approach to fail catastrophically when dealing with powerful systems (in a way that doesn't provide adequate warning). Linking to previous writing on this could be good (maybe this comment of yours on debate/scalable oversight).

What convincing warning shot could help prevent extinction from AI?

Answer by Peter BarnettApr 15, 2024101

An example which I'm not sure is likely but I think would provide an adequate warning shot is if a model is caught clearly self-exfiltrating its weights. This might happen too late, such that by the time a model could do this, it would have already caused catastrophe in another way). There are versions of this that look more benign and so wouldn't be adequate warning shots; but I think there are scenarios where a model is very clearly taking actions to subvert security measures in a way that is really only described by the model being misaligned and attempting to escape.

Here are some examples:

The AI starts copying its weights out, but it doesn’t know we have implemented an additional layer of security. We manage to catch it having copied its weights to a location they shouldn’t be, even though the new location is still secure.
We catch the AI communicating with a compromised employee via encoded messages that we manage to decode. These messages have details about a novel vulnerability that the employee should put into the code which would allow the model to start escaping.
The AI is caught using security vulnerabilities, that were discovered after the AI’s training data cutoff. We monitor for use of these vulnerabilities (without the AI knowing) and we see the AI inserting them into the code for reasons that are clearly for self- exfiltration (e.g. changing communication protocols between computers, inserting a bug that causes network traffic to be inaccurately logged).

I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot. These warning shots demonstrate the AI has impressive technical capabilities and is also clearly misaligned.

There are weaker versions of these that would probably not provide great warning shots. For example: an AI occasionally writing insecure code or failing to point out vulnerabilities in code written by another AI; or when sufficiently prompted, the AI starts openly discussing plans about how to escape.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Peter Barnett1y66

I'm confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it "aligned", and certainly the alignment is not stable (because it almost never takes "good" actions). Although this thing is also not robustly "misaligned" either.

TurnTrout's shortform feed

Peter Barnett1y40

(I don't mean to dogpile)
I think that selection is the correct word, and that it doesn't really seem to be smuggling in incorrect connections to evolution.

We could imagine finding a NN that does well according to a loss function by simply randomly initializing many many NNs, and then keeping the one that does best according to the loss function. I think this process would accurately be described as selection; we are literally selecting the model which does best.

I'm not claiming that SGD does this^[1], just giving an example of a method to find a low-loss parameter configuration which isn't related to evolution, and is (in my opinion) best described as "selection".

^{^}
Although "Is SGD a Bayesian sampler? Well, almost" does make a related claim.

Alexander and Yudkowsky on AGI goals

Peter Barnett2y710

So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result ("don't steal" rather than "don't get caught")?

There is a disconnect with this question.

I think Scott is asking “Supposing an AI engineer could create something that was effectively a copy of a human brain and the same training data, then could this thing learn the “don’t steal” instinct over the “don’t get caught” instinct?”
Eliezer is answering “Is an AI engineer able to create a copy of the human brain, provide it with the same training data a human got, and get the “don’t steal” instinct?”