User Comment Replies — AI Alignment Forum

Steering GPT-2-XL by adding an activation vector

"party", "ceremony", "dress", "with", "photographer"

While these aren't syntactically valid continuations of the prompt, they are highly likely (and syntactically valid) continuations for "wedding ". More than just being wedding-related, these seem like direct continuations.

3Alex Turner2y

Agreed. This is an important clue that I forgot to mention in the text. I'll update that now.

Redwood Research’s current project

Measure4y00

Similarly, you might think that a promising approach is to look for snippets which cause the generator to generate violent completions with particularly high probability.

It seems like there could be a problem if you're working with "prompts that are especially likely to have a violent continuation" where non-violent continuations would consistently seem unrealistic and thus fail the 50% goal.

The Blackwell order as a formalization of knowledge

Measure4y00

Blackwell’s theorem says that the conditions under which κ1 can be said to be more generally useful than κ2 are precisely the situations where κ1 is a post-garbling of κ2.

Is this reversed?

1Alex Flint4y

Yes. Thank you. Fixed.

Where are intentions to be found?

Measure4y30

It's not obvious to me that the information you're looking for is not present in a single toe. In the same way that an advanced AI could discover General Relativity by carefully examining a few frames of a falling apple, couldn't it infer something about human/rabbit/rainforest values by observing the behavior of a toe? My concern would instead be that there is too much information and that the AI would pick out some values but not necessarily the ones you expect.

3Alex Flint4y

But the question is whether values are even the right way to go about this problem. That's the kind of information we're seeking: information about how even to go about being beneficial, and what beneficial really means. Does it really make sense to model a rainforest as an agent and back out a value function for it? If we did that, would it work out in a way that we could look back on and be glad about? Perhaps it would, perhaps it wouldn't, but the hard problem of AI safety is this question of what even is the right frame to start thinking about this in, and how we can even begin to answer such a question. Now perhaps it's still true that the information we seek can be found in a human toe. But just beware that we're not talking about anything so concrete as values here.

Donald Hobson's Shortform

Measure4y00

It seems like the button-works action will usually be some variety of "take preemptive action to ensure the button won't be pressed" and so the AI will have a high chance to shut down at each decision step.

1Donald Hobson4y

Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make "just do your task" the best strategy. If the box is less secure than we thought, or various other things go wrong, the AI will just shut down. This approach could fail if the easiest way to cause trouble is nothing to do with the stop button. (Say it can send its code elsewhere over the internet. ) A generalization. You have some real world variable X. You run a bunch of AI's that are identical, except for different dogmatic beliefs about X. If they ever disagree on which action to take, the AI's are using the value of X in the optimization. Shutdown. This acts as a kind of fuse. Have one AI confidant the door password is 1234, a copy that is confidant it is 5678. If these AI's ever disagree on what to do next then at least one was trying to open the door. Shutdown. A slightly counter-intuitive situation here is you want to make sure that if you did know the password, the easiest way to leave would be the door. Otherwise both AI will agree to climb out a window.

Limiting Causality by Complexity Class

Measure4y00

I guess that makes sense. Thanks for clarifying!

Limiting Causality by Complexity Class

Measure4y00

Computational complexity only makes sense in terms of varying sizes of inputs. Are some Y events "bigger" than others in some way so that you can look at how the program runtime depends on that "size"?

0Bunthut4y

What I had in mind was increasing precision of Y.

Limiting Causality by Complexity Class

Measure4y00

What do X and Y represent in this construction? What is the scaling parameter used to define the complexity class?

0Bunthut4y

X and Y are variables for events. By complexity class I mean computational complexity, not sure what scaling parameter is supposed to be there?

AI ALIGNMENT FORUM
AF

All of Measure's Comments + Replies