User Comment Replies — AI Alignment Forum

Inverse Scaling Prize: Second Round Winners

Re "Prompt Injection, by Derik Kauffman, Aaron Kirtland, Andrew Gritsevskiy, and Joe Cavanagh (Third Prize)": text-davinci-003 works perfectly if the last input's first letter is lower case. It correctly predicts "Input" as the first token with >99% prob.

It seems that the last input being out-of-distribution wrt. to the few shot examples makes the model default to its priors and follow the instruction.

Also, interestingly, replacing the "following sentences" in the instruction with "following inputs" also fixes the model's behavior. I felt the term "sent... (read more)

How I think about alignment

Peter Hroššo3y21

I expect there to be too much happenstance encoded in my values.

I believe this is a bug, not a feature that we would like to reproduce.

I think that the direction you described with the AI analysing how you acquired your values is important, because it shouldn't be mimicking just your current values. It should be able to adapt the values to new situations the way you'd do (distributional shift). Think all the books / movies where people get to unusual situations and have to make tough moral calls. Like plane crashing in the middle of nowhere with 20 surv... (read more)

Gradient hacking

Peter Hroššo6y30

If your model is deceptive, though, then it might know all of that

Could you please describe your intuition behind how the model could know the meta-optimizer is going to perform checks on deceptive behavior?

Evan Hubinger6y60

Sure. My best guess for how we will train AGI is via giving it access to huge amounts of data—e.g. every outbound Reddit link as in GPT-2. Given that sort of a massive, rich dataset, I think that for a model to determine the particular training procedure that it's undergoing won't be very difficult. If it just reads the Wikipedia pages on ML, gradient descent, etc. and some arXiv papers on transparency techniques, for example, then that should give it everything it needs to know.

AI ALIGNMENT FORUM
AF

All of Peter Hroššo's Comments + Replies