I think one characteristic of steering vectors constructed this way is that they are allowed to be off-manifold, so they don't necessarily tell us how the networks currently work, rather than how they can be made to work with adaptations.
For the past for weeks, I've been thinking about to interpret networks on-manifold. The most straightforward approach I could come up with was to restrict oneself to the space of activations that actually occur for a given prompt, e.g. by performing SVD of the token x activation matrix for a given layer, and then restricti...
In fact, PPO is essentially a tweaked version of REINFORCE,
Valid point.
Beyond PPO and REINFORCE, this "x as learning rate multiplier" pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver's RL course:
Critically though, neither Q, A or delta denote reward. Rather they are quantities which are supposed to estimate the effect of an action on the sum of future rewards; hence while pure REINFORCE doesn't really maximize the sum of rewards, these other algorithms are attempts to more consistently do so, and the ...
I get that a lot of AI safety rhetoric is nonsensical, but I think your strategy of obscuring technical distinctions between different algorithms and implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.
...After making a false claim, Bostrom goes on to dismiss RL approaches to creating useful, intelligent, aligned systems. But, as a point of further fact, RL approaches constitute humanity's current best tools for aligning AI systems today! Those approaches are pretty awesome. No RLHF, then no GPT-4 (as we
I was under the impression that PPO was a recently invented algorithm
Well, if we're going to get historical, PPO is a relatively small variation on Williams's REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don't offhand know of any ways in which PPO's tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why...
RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world).
This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms. Online learning has long been known to be less stable than offline learning. That's what's primarily responsible for most "reward hacking"-esque results, such as the CoastRunne...
(A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.
I don't confidently disagree with this statement, but it occurs to me that I haven't tried it myself and haven't followed it very closely, and have sometimes heard claims that there are promising methods.
A lot of people trying to come up with answers try to do it w...
I think discussions about capabilities raise the question "why create AI that is highly capable at deception etc.? seems like it would be safer not to".
The problem that occurs here is that some ways to create capabilities are quite open-ended, and risk accidentally creating capabilities for deception due to instrumental convergence. But at that point it feels like we are getting into the territory that is best thought of as "intelligence", rather than "capabilities".
Nice, I was actually just thinking that someone needed to respond to LeCun's proposal.
That said, I think you may have gotten some of the details wrong. I don't think the intrinsic cost module gets raw sensory data as input, but instead it gets input from the latent variables of the world model as well as the self-supervised perception module. This complicates some of the safety problems you suggest.
But I’m expecting that AGI will look like model-based RL, in which case, we don’t have to search for search, the search is right there in the human source code.
I think the key question is how much policy-caused variance in the outcome there is.
That is, juice can chain onto itself because we assume there will be a bunch of scenarios where reinforcement-triggering depends a lot on the choices made. If you are near a juice but not in front of the juice, you can walk up to the juice, which triggers reinforcement, or you can not walk up to the juice, which doesn't trigger reinforcement. The fact that these two plausible actions differ in their reinforcement is what I am referring to with "policy-caused variance".
If yo...
I don't think so.
Like technically yes, it shows that there is an internal optimization process that is running in the networks, but much of the meat of optimization such as instrumental convergence/power-seeking depends the structure of the function one is optimizing over.
If the function is not consequentialist - if it doesn't attempt to compute the real-world consequences of different outputs and grade things based on those consequences - then much of the discussion about optimizers does not apply.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they're from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don't think I've properly communicated my feelings in this comment, but hopefully it's better than nothing))
My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of "iteration" system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.
A nuclear bomb steers a lot of far-away objects into a high-entropy configuration, and does so very robustly, but that perhaps is not a "small part of the state space"
This example reminds me of a thing I have been thinking about, namely that it seems like optimization can only occur in cases where the optimization produces/is granted enough "energy" to control the level below. In this example, the model works in a quite literal way, as a nuclear bomb floods an area with energy, and I think this example generalizes to e.g. markets with Dutch disease.
Floo...
The data may itself have a privileged basis, but this should be lost as soon as the first linear layer is reached.
Not totally lost if the layer is e.g. a convolutional layer, because while the pixels within the convolutional window can get arbitrarily scrambled, it is not possible for a convolutional layer to scramble things across different windows in different parts of the picture.
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
...Are you saying "The AI says something which makes us erroneously believe it saved a person's life
So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are "easy" in the sense that they seem correctable by taking more context into consideration.
One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too "big" for humans to ha...
I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).
So for instance learning values by reinforcement events seems likely to lead to deception. If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
This doesn't become much of a problem in practice among humans (or well, it actually does seem to be a fairl...
I'm basically re-raising the point I asked about in your linked post; the alignability of sharded humans seems to be due to people living in a society that gives them feedback on their behavior that they have to follow. This allows cooperative shards to grow. It doesn't seem like it would generalize to more powerful beings.
What do power differentials have to do with the kind of mechanistic training story posited by shard theory?
The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person's brain, such that the post-reinforcement person is incrementally "more prosocial." But the important part isn't "feedback signals from other people with ~equal power", it's the transduced reinforcement events which increase prosociality.
So let's figure out how to supply good reinforce...
I think if you pretrained it on all of YouTube, you could get explanations and illustrations of people doing basic tasks. I think this would (if used with appropriate techniques that can be developed on a short notice) make it need very little data for basic tasks, because it can just interpolate from its previous experiences.
Could you elaborate?
One factor I think is relevant is:
Suppose you are empowered in some way, e.g. you are healthy and strong. In that case, you could support systems that grant preference to the empowered. But that might not be a good idea, because you could become disempowered, e.g. catch a terrible illness, and in that case the systems would end up screwing you over.
In fact, it is particularly in the case where you become disempowered that you would need the system's help, so you would probably weight this priority more strongly than would be implied by ...
The principles from the post can still be applied. Some humans do end up aligned to animals - particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Sure, if you've got some example of a mechanism for this that's likely to scale, it may be worthwhile. I'm just pointing out that a lot of people have already thought about mechanisms and concluded that the mechani...
I'm just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.
In my experience, researchers tend to stop at "But humans are hacky kludges" (what do they think they know, and why do they think they know it?). Speaking for myself, I viewed humans as complicated hacks which didn't offer substantial evidence about alignment questions. This "humans as alignment-magic" or "the selection pressure down the street did it" view seems quite common (but not universa...
I think it can be worthwhile to look at those mechanisms, in my original post I'm just pointing out that people might have done so more than you might naively think if you just consider whether their alignment approaches mimic the human mechanisms, because it's quite likely that they've concluded that the mechanisms they've come up with for humans don't work.
Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come "for free" with model-based agents.
To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.
Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.
Well, the power relations thing was one example of one mechanism. There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.
I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn't generalize. E.g. the fact that different humans have relative similar levels of power to each other seems important; we aren't very aligned to agents much less powerful than us like animals, and I wouldn't expect a human who had been given all the power in the world all their life such that they've learned they can solve any conflict by destroying their opposition to be very aligned.
I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn't generalize.
I disagree both with this conclusion and the process that most people use to reach it.
The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences.
E.g., there a...
The principles from the post can still be applied. Some humans do end up aligned to animals - particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Also, remember that the problem is not to align an entire civilization of naturally evolved organisms to weaker entities. The problem is to align exactly one entirely artificial organism to weaker entities. This is...
If you can come up with a prior that can learn human preferences, why put that prior into a superintelligent agent instead of first updating it to match human preferences? It seems like the latter could be safer as one could then investigate the learned preferences directly, and as one then doesn't have to deal with it making mistakes before it has learned much.
Yeah, you were one of the “couple other people” I alluded to. The other was Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).
🤔 I wonder if I should talk with Tan Zhi-Xuan.
I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.
I got the phrase "ontological lock" from adamShimi's post here, but it only comes up very briefly, so it is not helpful for understanding what I mean and is sort of also me assuming that adamShimi...
...Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothe
Avoiding a weak wireheading drive seems quite tricky. Maybe we could minimize it using timing and priors (Section 9.3.3 above), but avoiding it altogether would, I presume, require special techniques—I vaguely imagine using some kind of interpretability technique to find the RPE / feeling good concept in the world-model, and manually disconnecting it from any Thought Assessors, or something like that.
Here's a hacky patch that doesn't entirely solve it, but might help:
Presumably for humans, the RPE/reward is somehow wired into the world-model, since we h...
If I understand the problem statement correctly, I think I could take a stab at easier versions of the problem, but that the current formulation is too much to swallow in one bite. In particular I am concerned about the following parts:
Setting
We start with an unaligned benchmark:
* An architecture Mθ<snip>
Goal
To solve ELK in this case we must:
* Supply a modified architecture Mθ+ which has the same inputs and outputs as Mθ <snip>
Does this mean that the method needs to work for ~arbitrary architectures, and that the solution must use substantially...
The idea is that manipulation "overrides'' the human policy regardless of whether that's good for the goal the human is pursuing (where the human goal presumably affects what is selected). While here the override is baked into the dynamics, in realistic settings it occurs because the AI exploits the human decision-making process: by feeding them biased information, through emotional manipulation, etc.
I think this skips over the problems with baking it into the dynamics. Baking manipulation into the dynamics requires us to define manipula...
I like this post, it seems to be the same sort of approach that I suggested here. However, your proposal seems to have a number of issues; some of which you've already discussed, some of which are handled in my proposal, and some of which I think are still open questions. Presumably a lot of it is just because it's still a toy model, but I wanted to point out some things.
Starting here:
...Definition: Corrigibility, formal.
Let be a time step which is greater than . The policy-modification corrigibility of from starting state&n
Actually upon thinking further I don't think this argument works, at least not as it is written right now.
Imagine that policies decompose into two components, . For instance, they may be different sets of parameters in a neural network. We can then talk about the effect of one of the components by considering how it influences the power/injectivity of the features with respect to the other component.
Suppose, for instance, that is such that the policy just ends up acting in a completely random-twitching way. Technically has a lot of effect too, in that it chaotically controls the pattern of the twitching, but in terms of the fe...
Since you can convert a utility function over states or observation-histories into a utility function over policies (well, as long as you have a model for measuring the utility of a policy), and since utility functions over states/observation-histories do satisfy instrumental convergence, yes you are correct.
I feel like in a way, one could see the restriction to defining it in terms of e.g. states as a definition of "smart" behavior; if you define a reward in terms of states, then the policy must "smartly" generate those states, rather than just yield some...
🤔 I was about to say that I felt like my approach could still be done in terms of state rewards, and that it's just that my approach violates some of the technical assumptions in the OP. After all, you could just reward for being in a state such that the various counterfactuals apply when rolling out from this state; this would assign higher utility to the blue states than the red states, encouraging corrigibility, and contradicting TurnTrout's assumption that utility would be assigned solely based on the letter.
But then I realized that this introduces a ...
Appendix: tracking key limitations of the power-seeking theorems
I want to say that there's another key limitation:
Let be a set of utility functions which is closed under permutation.
It seems like a rather central assumption to the whole approach, but in reality people seem to tend to specify "natural" utility functions in some sense (e.g. generally continuous, being functions of only a few parameters, etc.). I feel like for most forms of natural utility functions, the basic argument will still hold, but I'm not sure how far it generali...
It might be more useful to discuss Google's dense GPT-like LaMDA-137b instead, because there's so little information about Pathways or MUM. (We also know relatively little about the Wu Dao series of competing multimodal sparse models.) Google papers refuse to name it when they use LaMDA, for unclear reasons (it's not like they're fooling anyone), but they've been doing interesting OA-like research with it: eg "Program Synthesis with Large Language Models", "Finetuned Language Models Are Zero-Shot Learners", or text style transfer.
So I see two possible interpretations of traditional Dutch books:
I disagree, I don't think it's a simple binary thing. I don't think Dutch book arguments in general never apply to recursive things, but it's more just that the recursion needs to be modelled in some way, and since your OP didn't do that, I ended up finding the argument confusing.
The standard dutch book arguments would apply to the imp. Why should you be in such a different position from the imp?
I don't think your argument goes through for the imp, since it never needs to decide its act...
This, again, seems plausible if the payoff is made sufficiently small.
How do you make the payoff small?
This is actually very similar to traditional Dutch-book arguments, which treat the bets as totally independent of everything.
Isn't your Dutch-book argument more recursive than standard ones? Your contract only pays out if you act, so the value of the dutch book causally depends on the action you choose.
So the overall expectation is .
Wouldn't it be P(Act=a|do(buy B)) rather than P(Act=a)? Like my thought would be that the logical thing for CDT would be to buy the contract and then as a result its expected utilities change, which leads to its probabilities changing, and as a result it doesn't want to sell the contract. I'd think this argument only puts a bound on how much cdt and edt can differ, rather than on whether they can differ at all. Very possible I'm missing something though.
The singular vectors of the Jacobian between two layers seems more similar to what you're doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don't change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.
Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy ... (read more)