User Comment Replies — AI Alignment Forum

Why imperfect adversarial robustness doesn't doom AI control

If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.

Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.

4Zach Stein-Perlman5mo

It would be crazy to show the untrusted model the old CoT when resampling.

Buck Shlegeris5mo1116

It would be crazy to hide that CoT from the monitor.

On “first critical tries” in AI alignment

faul_sname9mo44

Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control".

3Joe Collman9mo

It may be better to think about it that way, yes - in some cases, at least. Probably it makes sense to throw in some more variables. Something like: * To stand x chance of property p applying to system s, we'd need to apply resources r. In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].

Non-myopia stories

faul_sname1y10

A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations

Why do you think that a model-based approach will outperform a model-free approach? We may just be using words differently here: I'm using "model-based" to mean "maintains an explicit model of its environment, its available actions, and the anticipated effects of those actions, and then performs whichever action its world model anti... (read more)

4gwern1y

For the reason I just explained. Planning can optimize without actually requiring experiencing all states beforehand. A learned model lets you plan and predict things before they have happened {{citation needed}}. This is also a standard test of model-free vs model-based behavior in both algorithms and animals: model-based learning can update much faster, and being able to update after one reward, or to learn associations with other information (eg. following a sign) is considered experimental evidence for model-based learning rather than a model-free approach which must experience many episodes before it can reverse or undo all of the prior learning. As Turntrout has already noted, that does not apply to model-based algorithms, and they 'do optimize the reward': Should a model-based algorithm be trainable by the outer loop, it will learn to optimize for the reward. There is no reason you cannot use, say, ES or PBT for hyperparameter optimization of a model-based algorithm like AlphaZero, or even training the parameters too. They are operating at different levels. If you used those to train an AlphaZero, it doesn't somehow stop being model-based RL: it continues to do planning over a simulator of Go expanding out the game tree out to terminal nodes with a hardwired reward function determined by whether it won or lost the game and its optimization target is the reward of winning/loosing. That remains true no matter what the outer loop is, whether it was evolution strategies or the actual Bayesian hyperparameter optimization that DM used. As for your 3 conditions: they are either much easier than you make them sound or equally objections to model-free algorithms. #1 is usually irrelevant because you cannot apply unbounded optimization pressure (like in the T-maze experiments - there's just nothing you can arbitrarily maximize, you go left, or you go right, end of story), and such overestimates can be useful for exploration. (Nor does model-freeness mean you can hav

Non-myopia stories

faul_sname1y1-2

Imagine our stamp collector is trained using meta-learning. 100 stamp collectors are trained in parallel and the inner loop, which uses gradient descent, updates their weights every 10 days. Every 50 days, the outer loop takes the 50 best-performing stamp collectors and copies their weights over to the 50 worst-performing stamp collectors. In doing so, the outer loop selects non-myopic models that maximize stamps over all days.

Nit (or possibly major disagreement): I don't think this training regime gets you a stamp maximizer. I think this regim... (read more)

gwern1y68

Why do you think that? A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations to immediately maximize future reward - rather than being forced to wait for rewards/selection to happen and incurring predictable losses before it can finally stop executing behaviors-that-used-to-stamp-maximize-but-now-no-longer-do-so-for-easily-predicted-reasons.

Steering GPT-2-XL by adding an activation vector

faul_sname2y121

I found an even dumber approach that works. The approach is as follows:

Take three random sentences of Wikipedia.
Obtain a French translation for each sentence.
Determine the boundaries corresponding phrases in each English/French sentence pair.
Mark each boundary with "|"
Count the "|"s, call that number n.
For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
The album received mixed to positive reviews, with critics commending the production de nombreu

... (read more)

2Alex Turner2y

This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget. You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations: * Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output. * Layer 16: Noticeable tendency to mention French, and even talk in "French" a bit. * Layer 20: Switches to French relatively quickly. Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we're adding a "coefficient 56" steering vector to forward passes. This should probably be substantially smaller. I haven't examined this yet. EDIT: Setting each activation addition to about .8 still works, but .5 doesn't. At this scale, most (>90%) of the affected residual stream content should be about the activation additions. It seems to me like this will overwrite the existing content in those streams. This makes me more skeptical of this schema. However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models. In sum, we don't actually yet have a demonstration of "switches fluently to French and keeps making sense", but this schema seems very promising. Great work again. Your colab's "Check it can speak French" section seems to be a stub.

Response to Blake Richards: AGI, generality, alignment, & loss functions

faul_sname3y00

For me at least, I mentally model the right column as something "a better GPT" could probably describe how to do, if given the correct prompt. For example, let's say that I have just sequenced a sample of Equine herpesvirus 1, and I want to know what the protein encoded by ORF-46 does. I feed GPT the following

Our team identified the likely function of the protein encoded by ORF46 of the Equine herpesvirus 1 (EHV-11) through the use of machine-learning based tools. Specifically, we

and then GPT spits out

used the protein sequence of ORF46 to predict its

... (read more)

2Steve Byrnes3y

Thanks for your comment! I think it’s slightly missing the point though. Let me explain. One silly argument would be: “GPT-3 is pretty ‘general’, so we should we should call it ‘AGI’. And GPT-3 is not dangerous. Ergo ‘AGI’ is not dangerous”. This is a silly argument because it’s just semantics. Agent-y-John-von-Neumann-AGI is possible, and it’s dangerous (i.e. prone to catastrophic out-of-control-misaligned-AGI accidents), and by default sooner or later somebody is going to build it (because it’s scientifically exciting, and there are many actors all over the world who can do so, etc.). That’s a real problem. Whether or not GPT-3 qualifies as “general” has nothing to do with that problem! In right-column-vs-left-column terms, I claim there are systems (e.g. agent-y-John-von-Neumann-AGI) that are definitely firmly 100% in the right column in every respect, and I claim that such systems are super-dangerous, and that people will nevertheless presumably start messing around with them anyway at some point. Meanwhile, in other news, we can also imagine systems that are both safe and arguably have certain right-column aspects. Maybe language models are an example. OK sure, that’s possible. But those aren’t the systems I want to talk about here. OK, then a more sophisticated argument would be: “Future language models will be both safe and super-duper-powerful, indeed so powerful that they will change the world, and indeed they’ll change it so much that it’s no use thinking ahead further than that step. Instead, we can basically delegate the problem of ‘what is to be done about people making dangerous agent-y-John-von-Neumann-AGI’ to our AI-empowered descendants [or AI-empowered future selves, depending on your preferred timelines]. Let them figure it out!” A priori, this could be true, but I happen to think it’s false, for reasons that I won’t get into here. Instead, I think future language models will be moderately useful for future humans—just as computers and zoom a

AI ALIGNMENT FORUM
AF

All of faul_sname's Comments + Replies