All of tailcalled's Comments + Replies

The singular vectors of the Jacobian between two layers seems more similar to what you're doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don't change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.

Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy ... (read more)

1Andrew Mack
Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.

I think one characteristic of steering vectors constructed this way is that they are allowed to be off-manifold, so they don't necessarily tell us how the networks currently work, rather than how they can be made to work with adaptations.

For the past for weeks, I've been thinking about to interpret networks on-manifold. The most straightforward approach I could come up with was to restrict oneself to the space of activations that actually occur for a given prompt, e.g. by performing SVD of the token x activation matrix for a given layer, and then restricti... (read more)

04gate
Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I'm guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense "real-world." However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn't seem like a realistic estimate of what should be expected of an on-manifold vector since it's likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things  happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense "off-distribution" but still in the real world and I don't think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I'm not sure so I thought I might ask. I am new to this field. I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness. Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever. Curious to hear thoughts

How important is it to use full-blown gradient descent to train them? Could one instead take the first singular vector for the Jacobian between the neural network layers, and get something that works similarly well?

1Andrew Mack
I haven't tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around θ=0) on the bomb-making prompt for Qwen-1.8B. These didn't appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.

In fact, PPO is essentially a tweaked version of REINFORCE,

Valid point.

Beyond PPO and REINFORCE, this "x as learning rate multiplier" pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver's RL course:

Critically though, neither Q, A or delta denote reward. Rather they are quantities which are supposed to estimate the effect of an action on the sum of future rewards; hence while pure REINFORCE doesn't really maximize the sum of rewards, these other algorithms are attempts to more consistently do so, and the ... (read more)

2Alex Turner
I'm sympathetic to this argument (and think the paper overall isn't super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That's something new.

I get that a lot of AI safety rhetoric is nonsensical, but I think your strategy of obscuring technical distinctions between different algorithms and implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.

After making a false claim, Bostrom goes on to dismiss RL approaches to creating useful, intelligent, aligned systems. But, as a point of further fact, RL approaches constitute humanity's current best tools for aligning AI systems today! Those approaches are pretty awesome. No RLHF, then no GPT-4 (as we

... (read more)

I was under the impression that PPO was a recently invented algorithm

Well, if we're going to get historical, PPO is a relatively small variation on Williams's REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don't offhand know of any ways in which PPO's tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why... (read more)

RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world).

This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms. Online learning has long been known to be less stable than offline learning. That's what's primarily responsible for most "reward hacking"-esque results, such as the CoastRunne... (read more)

(A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.

I don't confidently disagree with this statement, but it occurs to me that I haven't tried it myself and haven't followed it very closely, and have sometimes heard claims that there are promising methods.

A lot of people trying to come up with answers try to do it w... (read more)

6Ryan Greenblatt
Perhaps you're thinking of the recent influence function work from Anthropic? I don't think that this paper either shows or claims that "LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions". But they do find that there are influential training examples from sci-fi stories and AI safety discussion when asking the model questions about topics like this.

I think discussions about capabilities raise the question "why create AI that is highly capable at deception etc.? seems like it would be safer not to".

The problem that occurs here is that some ways to create capabilities are quite open-ended, and risk accidentally creating capabilities for deception due to instrumental convergence. But at that point it feels like we are getting into the territory that is best thought of as "intelligence", rather than "capabilities".

3Victoria Krakovna
I agree that a possible downside of talking about capabilities is that people might assume they are uncorrelated and we can choose not to create them. It does seem relatively easy to argue that deception capabilities arise as a side effect of building language models that are useful to humans and good at modeling the world, as we are already seeing with examples of deception / manipulation by Bing etc.  I think the people who think we can avoid building systems that are good at deception often don't buy the idea of instrumental convergence either (e.g. Yann LeCun), so I'm not sure that arguing for correlated capabilities in terms of intelligence would have an advantage. 

Nice, I was actually just thinking that someone needed to respond to LeCun's proposal.

That said, I think you may have gotten some of the details wrong. I don't think the intrinsic cost module gets raw sensory data as input, but instead it gets input from the latent variables of the world model as well as the self-supervised perception module. This complicates some of the safety problems you suggest.

3Steve Byrnes
Thanks. That did actually occur to me, but I left it out because I wasn’t sure and didn’t want to go on an exhausting chase down every possible interpretation of the paper. Anyway, if the input to the Prosociality Score Model is a set of latent variables rather than a set of pixels then: * My OP claim that there are two adversarial out-of-distribution generalization problems (in the absence of some creative solution not in the paper) is still true. * One of those two problems (OOD generalization of the Prosociality Score Model) might get less bad, although I don’t see why it would go away altogether. * …But only if the labels are correct, and the labeling problem is potentially much harder now, because the latent variables include inscrutable information about “how the AI is thinking about / conceptualizing the things that it’s seeing / doing”. I think. And if they do, then how are the humans supposed to label them as good or bad? Like, if the AI notices someone feeling physically good but psychologically distressed, we want to label it as low-energy when the AI is thinking about the former aspect and high-energy if the AI is thinking about the latter aspect, I imagine. And then we start getting into nasty neural net interpretability challenges. Also, aren't the latent variables changing as we go, thanks to self-supervised learning? But the Intrinsic Cost Module is supposed to be immutable. I'm confused about how this is supposed to work.
2Steve Byrnes
Thanks! I certainly expect future AGI to have “learned meta-cognitive strategies” like “when I see this type of problem, maybe try this type of mental move”, and even things like “follow the advice in Cal Newport and use Anki and use Bayesian reasoning etc.” But I don’t see those as particularly relevant for alignment. Learning meta-cognitive strategies are like learning to use a screwdriver—it will help me accomplish my goals, but won’t change my goals (or at least, it won’t change my goals beyond the normal extent to which any new knowledge and experience could potentially cause goal drift.) I do think that the “source code” for a human brains has a rather different set of affordances for search / planning than you’ll find in AlphaZero or babble-and-prune or PPO, but I’m not sure how relevant that is. In particular, can you say more about why you believe “The human source code won’t contain general-purpose search”?

Did you tell your friend the premise behind this interaction out of band?

1Alex Flint
I asked a group of friends for "someone to help me with an AI experiment" and then I gave this particular friend the context that I wanted her help guiding me through a task via text message and that she should be in front of her phone in some room that was not the kitchen.

I think the key question is how much policy-caused variance in the outcome there is.

That is, juice can chain onto itself because we assume there will be a bunch of scenarios where reinforcement-triggering depends a lot on the choices made. If you are near a juice but not in front of the juice, you can walk up to the juice, which triggers reinforcement, or you can not walk up to the juice, which doesn't trigger reinforcement. The fact that these two plausible actions differ in their reinforcement is what I am referring to with "policy-caused variance".

If yo... (read more)

I don't think so.

Like technically yes, it shows that there is an internal optimization process that is running in the networks, but much of the meat of optimization such as instrumental convergence/power-seeking depends the structure of the function one is optimizing over.

If the function is not consequentialist - if it doesn't attempt to compute the real-world consequences of different outputs and grade things based on those consequences - then much of the discussion about optimizers does not apply.

2Lawrence Chan
Well, no, that's not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer: And the following definition of a mesa-optimizer: In this paper, the authors show that transformers gradient descent (an optimization algorithm) to optimize a particular objective (ℓ2 loss). (This is very similar to the outer optimization loop that's altering the network parameters.) So the way davidad uses the word "mesa-optimization" is consistent with prior work. I also think that it's pretty bad to claim that something is only an optimizer if it's a power-seeking consequentialist agent. For example, this would imply that the outer loop that produces neural network policies (by gradient descent on network parameters) is not an optimizer! Of course, I agree that it's not the case that these transformers are power-seeking consequentialist agents. And so this paper doesn't provide direct evidence that transformers contain power-seeking consequentialist agents (except for people who disbelieved in power-seeking consequentialist agents because they thought NNs are basically incapable of implementing any optimizer whatsoever).

I would very much assume that you have a strong genetic disposition to be smart and curious.

Do you think unschooling would work acceptably well for kids who are not smart and curious?

2Ulisse Mini
I think school is huge in preventing people from becoming smart and curious. I spent 1-2years where I hardly studied at all and mostly played videogames - I wish I hadn't wasted that time, but when I quit I did so of my own free will. I think there's a huge difference between discipline imposed from the outside vs the inside, and getting to the latter is worth a lot. (though I wish I hadn't wasted all that time now haha)   I'm unsure which parts of my upbringing were cruxes for unschooling working. You should probably read a book or something rather than taking my (very abnormal) opinion. I just know how it went for me :)

In your view, are values closer to recipes/instruction manuals than to utility functions?

2Alex Turner
Yeah, kinda, although "recipe" implies there's something else deciding to follow the recipe. Values are definitely not utility functions, on my view.

For your quiz, could you give an example of something that is grader-optimization but which is not wireheading?

2Alex Turner
Alignment with platonic grader-output isn't wireheading. (I mentioned this variant in the second spoiler, for reference.)

This is a nice frame of the problem.

In theory, at least. It's not so clear that there are any viable alternatives to argmax-style reasoning that will lead to superhuman intelligence.

3Steve Byrnes
I agree—I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans, and I don’t think it’s feasible to make an AGI that doesn’t do that. But it sounds like this will be the topic of Alex’s next essay. So I’m expecting to criticize Alex’s next essay by commenting on it along the lines of: “You think you just wrote an essay about something which is totally different from “Optimizing for the output of a grader which evaluates plans”, but I disagree; the thing you’re describing in this essay is in that category too.” But that’s just a guess; I will let Alex write the essay before I criticize it. :-P

I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they're from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don't think I've properly communicated my feelings in this comment, but hopefully it's better than nothing))

My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of "iteration" system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.

2Alex Turner
What about this post?

A nuclear bomb steers a lot of far-away objects into a high-entropy configuration, and does so very robustly, but that perhaps is not a "small part of the state space"

This example reminds me of a thing I have been thinking about, namely that it seems like optimization can only occur in cases where the optimization produces/is granted enough "energy" to control the level below. In this example, the model works in a quite literal way, as a nuclear bomb floods an area with energy, and I think this example generalizes to e.g. markets with Dutch disease.

Floo... (read more)

3Alex Flint
Yeah I think you're right. I have the sense that the pure algorithmic account of optimization -- that optimization is about algorithms that do search over plans using models derived from evidence to evaluate each plan's merit -- doesn't quite account for what an optimizer really is in the physical world. The thing is that I can implement some very general-purpose modelling + plan-search algorithm on my computer (for example, monte carlo versions of AIXI) and hook it up to real sensors and actuators and it will not do anything interesting much at all. It's too slow and unreflective to really work. Therefore, an object running a consequentialist computation is definitely not a sufficient condition for remote control as per John's conjecture, but perhaps it is a necessary condition -- that's what the OP is asking for a proof or disproof of.

The data may itself have a privileged basis, but this should be lost as soon as the first linear layer is reached.

Not totally lost if the layer is e.g. a convolutional layer, because while the pixels within the convolutional window can get arbitrarily scrambled, it is not possible for a convolutional layer to scramble things across different windows in different parts of the picture.

3Jacob Hilton
Agreed. Likewise, in a transformer, the token dimension should maintain some relationship with the input and output tokens. This is sometimes taken for granted, but it is a good example of the data preferring a coordinate system. My remark that you quoted only really applies to the channel dimension, across which layers typically scramble everything.

I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment. 

How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.

Are you saying "The AI says something which makes us erroneously believe it saved a person's life

... (read more)
4Alex Turner
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say "RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation." And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I've updated down hard on alignment difficulty when I've run this exercise in the past. 1. I don't see why this is worth granting the connotations we usually associate with "deception" 2. I think that if the AI just repeats "I saved someone's life" without that being true, we will find out and stop rewarding that?  1. Unless the AI didn't just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it's already deceptive in a much worse way.  2. But somehow the AI has to get to that cognitive state, first. I think it's definitely possible, but not at all clearly the obvious outcome.  That wasn't the part of the post I meant to point to. I was saying that just because we externally observe something we would call "deception/misleading task completion" (e.g. getting us to reward the AI for prosociality), does not mean that "deceptive thought patterns" get reinforced into the agent! The map is not the territory of the AI's updating process. The reward will, I think, reinforce and generalize the AI's existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don't necessarily have anything to do with explicit deception (as you noted). Is a donut "unaligned by default"? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I'm not assuming the model starts out deceptive, nor that it will become that with high

So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are "easy" in the sense that they seem correctable by taking more context into consideration.

One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too "big" for humans to ha... (read more)

4Alex Turner
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.  Like, you said: So, if some alignment theory says "this approach (e.g. RLHF) is flawed and probably won't produce human-compatible values", and we notice "shard theory resembles RLHF", then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I'd update against the alignment theory / reasoning which called RLHF flawed. (Of course, there are reasons -- like inductive biases -- that RLHF-like processes could work in humans but not in AI, but any argument against RLHF would have to discriminate between the human/AI case in a way which accounts for those obstructions.) On the object level: Are you saying "The AI says something which makes us erroneously believe it saved a person's life, and we reward it, and this can spawn a deception-shard"? If so -- that's not (necessarily) how credit assignment works. The AI's credit assignment isn't necessarily running along the lines of "people were deceived, so upweight computations which deceive people."  * Perhaps the AI thought the person would approve of that statement, and so it did it, and got rewarded, which reinforces the approval-seeking shard? (Which is bad news in a different way!) * Perhaps the AI was just exploring into a statement suggested by a self-supervised pretrained initialization, off of an already-learned general heuristic of "sometimes emit completions from the self-supervised pretrained world model." Then the AI reinforces this heuristic (among other changes from the gradient). I don't know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.

I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).

So for instance learning values by reinforcement events seems likely to lead to deception. If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.

This doesn't become much of a problem in practice among humans (or well, it actually does seem to be a fairl... (read more)

4Alex Turner
I think a lot (but probably not all) of the standard objections don't make much sense to me anymore. Anyways, can you say more here, so I can make sure I'm following? (A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)

I'm basically re-raising the point I asked about in your linked post; the alignability of sharded humans seems to be due to people living in a society that gives them feedback on their behavior that they have to follow. This allows cooperative shards to grow. It doesn't seem like it would generalize to more powerful beings.

2Steve Byrnes
I think I sorta disagree in the sense that high-functioning sociopaths live in the same society as neurotypical people, but don’t wind up “aligned”. I think the innate reward function is playing a big role. (And by the way, nobody knows what that innate human reward function is or how it works, according to me.)  That said, maybe the innate reward function is insufficient and we also need multi-agent dynamics. I don’t currently know. I’m sympathetic to your broader point, but until somebody says exactly what the rewards (a.k.a. “reinforcement events”) are, I’m withholding judgment. I’m open to the weaker argument that there are kinda dumb obvious things to try where we don’t have strong reason to believe that they will create friendly AGI, but we also don’t have strong reason to believe that they won’t create friendly AGI. See here. This is a less pessimistic take than Eliezer’s, for example.

What do power differentials have to do with the kind of mechanistic training story posited by shard theory? 

The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person's brain, such that the post-reinforcement person is incrementally "more prosocial." But the important part isn't "feedback signals from other people with ~equal power", it's the transduced reinforcement events which increase prosociality. 

So let's figure out how to supply good reinforce... (read more)

But how does this help with alignment? Sharded systems seem hard to robustly align outside of the context of an entity who participates on equal footing with other humans in society.

3Alex Turner
Are you asking about the relevance of understanding human value formation? If so, see Humans provide an untapped wealth of evidence about alignment. We know of exactly one form of general intelligence which grows human-compatible values: humans. So, if you want to figure out how human-compatible values can form at all, start by understanding how they have formed empirically.  But perhaps you're asking something like "how does this perspective imply anything good for alignment?" And that's something we have deliberately avoided discussing for now. More in future posts.

I think if you pretrained it on all of YouTube, you could get explanations and illustrations of people doing basic tasks. I think this would (if used with appropriate techniques that can be developed on a short notice) make it need very little data for basic tasks, because it can just interpolate from its previous experiences.

1Donald Hobson
Sure, probably. 

I don't think I've ever seen a truly mechanistic, play-by-play and robust explanation of how anything works in human psychology. At least not by how I would label things, but maybe you are using the labels differently; can you give an example?

5Alex Turner
"Humans are nice because they were selected to be nice" -- non-mechanistic. "Humans are nice because their contextually activated heuristics were formed by past reinforcement by reward circuits A, B, C; this convergently occurs during childhood because of experiences D, E, F; credit assignment worked appropriately at that time because their abstraction-learning had been mostly taken care of by self-supervised predictive learning, as evidenced by developmental psychology timelines in G, H, I, and also possibly natural abstractions." -- mechanistic (although I can only fill in parts of this story for now) Although I'm not a widely-read scholar on what theories people have for human values, of those which I have read, most (but not all) are more like the first story than the second.

Could you elaborate?

One factor I think is relevant is:

Suppose you are empowered in some way, e.g. you are healthy and strong. In that case, you could support systems that grant preference to the empowered. But that might not be a good idea, because you could become disempowered, e.g. catch a terrible illness, and in that case the systems would end up screwing you over.

In fact, it is particularly in the case where you become disempowered that you would need the system's help, so you would probably weight this priority more strongly than would be implied by ... (read more)

2Logan Riggs Smith
Oh, you're stating potential mechanisms for human alignment w/ humans that you don't think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.  Turntrout's other post claims that the genome likely doesn't directly specify rewards for everything humans end up valuing. People's specific families aren't encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.  A related point: humans don't maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn't push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use.  There are other mechanisms, and I believe it's imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems. 

The principles from the post can still be applied. Some humans do end up aligned to animals - particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.

Sure, if you've got some example of a mechanism for this that's likely to scale, it may be worthwhile. I'm just pointing out that a lot of people have already thought about mechanisms and concluded that the mechani... (read more)

I'm just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.

In my experience, researchers tend to stop at "But humans are hacky kludges" (what do they think they know, and why do they think they know it?). Speaking for myself, I viewed humans as complicated hacks which didn't offer substantial evidence about alignment questions. This "humans as alignment-magic" or "the selection pressure down the street did it" view seems quite common (but not universa... (read more)

I think it can be worthwhile to look at those mechanisms, in my original post I'm just pointing out that people might have done so more than you might naively think if you just consider whether their alignment approaches mimic the human mechanisms, because it's quite likely that they've concluded that the mechanisms they've come up with for humans don't work.

Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come "for free" with model-based agents.

3Logan Riggs Smith
On your first point, I do think people have thought about this before and determined it doesn't work. But from the post: Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?  If you're convinced by them, then you'll understand the reaction of "Fuck, we've been wasting so much time and studying humans makes so much sense" which is described in this post (e.g. Turntrout's idea on corrigibility and statement "I wrote this post as someone who previously needed to read it."). I'm stating here that me arguing "you should feel this way now before being convinced of specific mechanistic understandings" doesn't make sense when stated this way. Link? I don't think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I'm open to being wrong.

To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.

Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates. 

Well, the power relations thing was one example of one mechanism. There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.

1Logan Riggs Smith
Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:) Could you elaborate?

I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn't generalize. E.g. the fact that different humans have relative similar levels of power to each other seems important; we aren't very aligned to agents much less powerful than us like animals, and I wouldn't expect a human who had been given all the power in the world all their life such that they've learned they can solve any conflict by destroying their opposition to be very aligned.

7Alex Turner
I think that past investigators didn't have good guesses of what the mechanisms are. Most reasoning about human values seems to be of the sort "look at how contextually dependent these 'values' things are, and the biases are also a huge mess, I doubt there are simple generators of these preferences", or "Evolution caused human values in an unpredictable way, and that doesn't help us figure out alignment." This reasoning is not about mechanisms. It is analogical. You might still believe the reasoning, and I think it's at least an a priori relevant observation, but let's call a spade a spade. This is analogical reasoning to AGI by drawing inferences from select observations (some humans don't care about less powerful entities) and then inferring that AGI will behave similarly.  (Edited this comment to reduce unintended sharpness)

I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn't generalize.

I disagree both with this conclusion and the process that most people use to reach it. 

The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences. 

E.g., there a... (read more)

2Logan Riggs Smith
There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post's point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post: Humans can provide a massive amount of info on how highly intelligent systems value things in the real world. There are guaranteed-to-exist mechanisms behind why humans value real world things and mechanisms behind the variance in human values, and the post argues we should look at these mechanisms first (if we're able to). I predict that a mechanistic understanding will enable the below knowledge: 
3Logan Riggs Smith
To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans. Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.  If we understand the mechanisms behind why some people e.g. terminally value animal happiness and some don't, then we can apply these mechanisms to other learning systems.  I agree this is likely.

The principles from the post can still be applied. Some humans do end up aligned to animals - particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.

Also, remember that the problem is not to align an entire civilization of naturally evolved organisms to weaker entities. The problem is to align exactly one entirely artificial organism to weaker entities. This is... (read more)

If you can come up with a prior that can learn human preferences, why put that prior into a superintelligent agent instead of first updating it to match human preferences? It seems like the latter could be safer as one could then investigate the learned preferences directly, and as one then doesn't have to deal with it making mistakes before it has learned much.

Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).

🤔 I wonder if I should talk with Tan Zhi-Xuan.

I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.


I got the phrase "ontological lock" from adamShimi's post here, but it only comes up very briefly, so it is not helpful for understanding what I mean and is sort of also me assuming that adamShimi... (read more)

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothe

... (read more)
0Steve Byrnes
Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation). I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.

Avoiding a weak wireheading drive seems quite tricky. Maybe we could minimize it using timing and priors (Section 9.3.3 above), but avoiding it altogether would, I presume, require special techniques—I vaguely imagine using some kind of interpretability technique to find the RPE / feeling good concept in the world-model, and manually disconnecting it from any Thought Assessors, or something like that.

Here's a hacky patch that doesn't entirely solve it, but might help:

Presumably for humans, the RPE/reward is somehow wired into the world-model, since we h... (read more)

If I understand the problem statement correctly, I think I could take a stab at easier versions of the problem, but that the current formulation is too much to swallow in one bite. In particular I am concerned about the following parts:

Setting

We start with an unaligned benchmark:


* An architecture Mθ

<snip>

Goal

To solve ELK in this case we must:


* Supply a modified architecture Mθ+ which has the same inputs and outputs as Mθ <snip>

Does this mean that the method needs to work for ~arbitrary architectures, and that the solution must use substantially... (read more)

1Mark Xu
Yes, approximately. If you can do it for only e.g. transformers, but not other things, that would be interesting. Yes, approximately. Thinking about how to get one question right might be a productive way to do research. However, if you have a strategy for answering 1 question right, it should also work for other questions.

The idea is that manipulation "overrides'' the human policy regardless of whether that's good for the goal the human is pursuing (where the human goal presumably affects what  is selected). While here the override is baked into the dynamics, in realistic settings it occurs because the AI exploits the human decision-making process: by feeding them biased information, through emotional manipulation, etc. 

I think this skips over the problems with baking it into the dynamics. Baking manipulation into the dynamics requires us to define manipula... (read more)

2Alex Turner
I skip over those pragmatic problems because this post is not proposing a solution, but rather a measurement I find interesting. 

I like this post, it seems to be the same sort of approach that I suggested here. However, your proposal seems to have a number of issues; some of which you've already discussed, some of which are handled in my proposal, and some of which I think are still open questions. Presumably a lot of it is just because it's still a toy model, but I wanted to point out some things.

Starting here:

Definition: Corrigibility, formal.

Let  be a time step which is greater than . The policy-modification corrigibility of  from starting state&n

... (read more)
2Alex Turner
The biggest disconnect is that this post is not a proposal for how to solve corrigibility. I'm just thinking about what corrigibility is/should be, and this seems like a shard of it—but only a shard. I'll edit the post to better communicate that.  So, your points are good, but they run skew to what I was thinking about while writing the post.

Actually upon thinking further I don't think this argument works, at least not as it is written right now.

Imagine that policies decompose into two components, . For instance, they may be different sets of parameters in a neural network. We can then talk about the effect of one of the components by considering how it influences the power/injectivity of the features with respect to the other component.

Suppose, for instance, that  is such that the policy just ends up acting in a completely random-twitching way. Technically  has a lot of effect too, in that it chaotically controls the pattern of the twitching, but in terms of the fe... (read more)

1tailcalled
Actually upon thinking further I don't think this argument works, at least not as it is written right now.

Since you can convert a utility function over states or observation-histories into a utility function over policies (well, as long as you have a model for measuring the utility of a policy), and since utility functions over states/observation-histories do satisfy instrumental convergence, yes you are correct.

I feel like in a way, one could see the restriction to defining it in terms of e.g. states as a definition of "smart" behavior; if you define a reward in terms of states, then the policy must "smartly" generate those states, rather than just yield some... (read more)

2Alex Turner
As an aside, can you link to/say more about this? Do you mean that there exists a faithful functor from Set to Vect (the category of vector spaces)? If you mean that, then every concrete category can be embedded into Vect, no? And if that's what you're saying, maybe the functor Set -> Vect is something like the "Group to its group algebra over field k" functor. 
0tailcalled
Imagine that policies decompose into two components, π=ρ⊗σ. For instance, they may be different sets of parameters in a neural network. We can then talk about the effect of one of the components by considering how it influences the power/injectivity of the features with respect to the other component. Suppose, for instance, that ρ is such that the policy just ends up acting in a completely random-twitching way. Technically σ has a lot of effect too, in that it chaotically controls the pattern of the twitching, but in terms of the features f, σ is basically constant. This is a low power situation, and if one actually specified what f would be, then a TurnTrout-style argument could probably prove that such values of ρ would be avoided for power-seeking reasons. On the other hand, if ρ made the policy act like an optimizer which optimizes a utility function over the features of f with the utility function being specified by σ, then that would lead to a lot more power/injectivity. On the other hand, I wonder if there's a limit to this style of argument. Too much noninjectivity would require crazy interaction effects to fill out the space in a Hilbert-curve-style way, which would be hard to optimize?

🤔 I was about to say that I felt like my approach could still be done in terms of state rewards, and that it's just that my approach violates some of the technical assumptions in the OP. After all, you could just reward for being in a state such that the various counterfactuals apply when rolling out from this state; this would assign higher utility to the blue states than the red states, encouraging corrigibility, and contradicting TurnTrout's assumption that utility would be assigned solely based on the letter.

But then I realized that this introduces a ... (read more)

2Alex Turner
Note that we can get a u-AOH which mostly solves ABC-corrigibility: u(history):={0if disable taken in historyR(last state)else (Credit to AI_WAIFU on the EleutherAI Discord) Where R is some positive reward function over terminal states. Do note that there isn't a "get yourself corrected on your own" incentive. EDIT note that manipulation can still be weakly optimal. This seems hacky; we're just ruling out the incorrigible policies directly. We aren't doing any counterfactual reasoning, we just pick out the "bad action."
0Charlie Steiner
I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce "smart" or "powerful" behavior from simple rules. But I don't know how to formalize this or if anyone else has.

Appendix: tracking key limitations of the power-seeking theorems

I want to say that there's another key limitation: 

Let  be a set of utility functions which is closed under permutation.

It seems like a rather central assumption to the whole approach, but in reality people seem to tend to specify "natural" utility functions in some sense (e.g. generally continuous, being functions of only a few parameters, etc.). I feel like for most forms of natural utility functions, the basic argument will still hold, but I'm not sure how far it generali... (read more)

3Alex Turner
Right, I was intending "3. [these results] don't account for the ways in which we might practically express reward functions" to capture that limitation.

As I understand it, Google's proposed model is a MoE model, and I've heard MoE models achieve poorer understanding for equivalent parameter count than classical transformer models do.

It might be more useful to discuss Google's dense GPT-like LaMDA-137b instead, because there's so little information about Pathways or MUM. (We also know relatively little about the Wu Dao series of competing multimodal sparse models.) Google papers refuse to name it when they use LaMDA, for unclear reasons (it's not like they're fooling anyone), but they've been doing interesting OA-like research with it: eg "Program Synthesis with Large Language Models", "Finetuned Language Models Are Zero-Shot Learners", or text style transfer.

So I see two possible interpretations of traditional Dutch books:

I disagree, I don't think it's a simple binary thing. I don't think Dutch book arguments in general never apply to recursive things, but it's more just that the recursion needs to be modelled in some way, and since your OP didn't do that, I ended up finding the argument confusing.

The standard dutch book arguments would apply to the imp. Why should you be in such a different position from the imp? 

I don't think your argument goes through for the imp, since it never needs to decide its act... (read more)

2Abram Demski
Replied.
2Abram Demski
But what does that look like? How should it make a difference? (This isn't a rhetorical question; I would be interested in a positive position. My lack of interest is, significantly, due to a lack of positive positions in this direction.) Ah, true, but the imp will necessarily just make EDT-type predictions anyway. So the imp argument reaches a similar conclusion. But I'm not claiming the imp argument is very strong in any case, it's just an intuition pump.

This, again, seems plausible if the payoff is made sufficiently small.

How do you make the payoff small?

This is actually very similar to traditional Dutch-book arguments, which treat the bets as totally independent of everything.

Isn't your Dutch-book argument more recursive than standard ones? Your contract only pays out if you act, so the value of the dutch book causally depends on the action you choose.

2Abram Demski
Sure, do you think that's a concern? I was noting the similarity in this particular respect (pretending that bets are independent of everything), not in all respects. Note, in particular, that traditional dutch book arguments make no explicit assumption one way or the other about whether the propositions have to do with actions under the agent's control. So I see two possible interpretations of traditional Dutch books: 1. They apply to "recursive" stuff, such as things you have some influence over. For example, I can bet on a presidential election, even though I can also vote in a presidential election. In this case, what we have here is not weirder. This is the position I prefer. 2. They can't apply to "recursive" stuff. In this case, presumably we don't think standard probability theory applies to stuff we have influence over. This could be a respectable position, and I've seen it discussed. However, I don't buy it. I've seen philosophers answer this kind of think with the following argument: what if you had a little imp on your shoulder, who didn't influence you in any way but who watched you and formed predictions? The imp could have probabilistic beliefs about your actions. The standard dutch book arguments would apply to the imp. Why should you be in such a different position from the imp?  For example, multiply the contract payoff by 0.001.  Think of it this way. Making bets about your actions (or things influenced by your actions) can change your behavior. But if you keep the bets small enough, then you shouldn't change your behavior; the bets are less important than other issues. (Unless two actions are exactly tied, in terms of other issues.) I will concede that this isn't 100% convincing. Perhaps different laws of probability should apply to actions we can influence. OTOH, I'm not sure what laws those would be.

So the overall expectation is .

Wouldn't it be P(Act=a|do(buy B)) rather than P(Act=a)? Like my thought would be that the logical thing for CDT would be to buy the contract and then as a result its expected utilities change, which leads to its probabilities changing, and as a result it doesn't want to sell the contract. I'd think this argument only puts a bound on how much cdt and edt can differ, rather than on whether they can differ at all. Very possible I'm missing something though.

2Abram Demski
I agree with this, but I was assuming the CDT agent doesn't think buying B will influence the later decision. This, again, seems plausible if the payoff is made sufficiently small. I believe that there are some other points in my proof which make similar assumptions, which would ideally be made clearer in a more formal write-up. However, I think CDT advocates will not generally take this to be a sticking point. The structure of my argument is to take a pre-existing scenario, and then add bets. For my argument to work, the bets need to be "independent" of critical things (causally and/or evidentially independent) -- in the example you point out, the action taken later needs to be causally independent of the bet made earlier (more specifically, causal-conditioning on the bet should not change beliefs about what action will be taken). This is actually very similar to traditional Dutch-book arguments, which treat the bets as totally independent of everything. I could argue that it's just part of the thought experiment; if you concede that there could be a scenario like that, then you concede that CDT gets dutch-booked. If you don't buy that, but you do buy Dutch Books as a methodology more generally, then I think you have to claim there's some rule which forbids "situations like this" (so CDT has to think the bets are not independent of everything else, in such a way as to spoil my argument). I would be very interested if you could propose a sensible view like this. However, I think not: there doesn't seem to be anything about the scenario which violates some principle of causality or rationality. If you forbid scenarios like this, you seem to be forbidding a very reasonable scenario, for no good reason (other than to save CDT).