All of Ofer's Comments + Replies

Did OpenAI/Anthropic allow you to evaluate smaller scale versions* of GPT4/Claude before training the full-scale model?

* [EDIT: and full-scale models in earlier stages of the training process]

2Hjalmar Wijk
ARC evals has only existed since last fall, so for obvious reasons we have not evaluated very early versions. Going forward I think it would be valuable and important to evaluate models during training or to scale up models in incremental steps.

Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic.

Have you discussed this point with other relevant researchers before deciding to publish this post? Is there a wide agreement among relevant researchers that a public, unrestricted discussion about this topic is net-positive? Have you considered the unilateralist's curse and biases that you may have (in terms of you gaining status/prestige from publishing this)?

(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)

2Rohin Shah
It's still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don't expect will actually happen, so I think I'm fine with that.

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

  1. Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.
  2. Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

This categorizatio... (read more)

2Rohin Shah
You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the "training data" and the future inputs are the "test data". In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is "training" and everything after is "test"), and then apply the categorization as usual.

The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.

If we're trying to predict abrupt changes in the accuracy of output token sequences, the per-token log-likelihood can be a useful signal. What's the analogous signal when we're talking about abrupt changes in a model's ability to deceptively conceal capabilities, hack GPU firmware, etc.? What log-likelihood plots can we use to predict those types of abrupt changes in behavior?

1Ethan Perez
Here, I think we'll want to look for suspicious changes in the log-likelihood trends. E.g., it's a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.

Sorry, that text does appear in the linked page (in an image).

The Partnership may never make a profit

I couldn't find this quote in the page that you were supposedly quoting from. The only google result for it is this post. Am I missing something?

[This comment is no longer endorsed by its author]Reply
1Ofer
Sorry, that text does appear in the linked page (in an image).

That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important.

That consideration seems relevant only for language models that will be doing/supporting alignment work.

Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.

The relevant texts I'm thinking about here are:

  1. Descriptions of certain tricks to evade our safety measures.
  2. Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might "hijack" the model's logic).

Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?

4Richard Ngo
Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won't have them changed by small amounts of training text.

I think this comment is lumping together the following assumptions under the "continuity" label, as if there is a reason to believe that either they are all correct or all incorrect (and I don't see why):

  1. There is large distance in model space between models that behave very differently.
  2. Takeoff will be slow.
  3. It is feasible to create models that are weak enough to not pose an existential risk yet able to sufficiently help with alignment.

I bet more on scenarios where we get AGI when politics is very different compared to today.

I agree that just before... (read more)

Even with adequate closure and excellent opsec, there can still be risks related to researchers on the team quitting and then joining a competing effort or starting their own AGI company (and leveraging what they've learned).

Do you generally think that people in the AI safety community should write publicly about what they think is "the missing AGI ingredient"?

It's remarkable that this post was well received on the AI Alignment Forum (18 karma points before my strong downvote).

Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.

If the model that is used as a Microscope AI does not use any optimization (search), how will it compute the probability that, say, Apple's engineers will overcome a certain technical challenge?

0Evan R. Murphy
That's a good question. Perhaps it does make use of optimization but the model still has an overall passive relationship to the world compared to an active mesa-optimizer AI. I'm thinking about the difference between say, GPT-3 and the classic paperclip maximizer or other tiling AI. This is just my medium-confidence understanding and may be different from what Evan Hubinger meant in that quote.

What I can do is point to my history of acting in ways that, I hope, show my consistent commitment to doing what is best for the longterm future (even if of course some people with different models of what is “best for the longterm future” will have legitimate disagreements with my choices of past actions), and pledge to remain in control of Conjecture and shape its goals and actions appropriately.

Sorry, do you mean that you are actually pledging to "remain in control of Conjecture"? Can some other founder(s) make that pledge too if it's necessary for m... (read more)

Your website says: "WE ARE AN ARTIFICIAL GENERAL INTELLIGENCE COMPANY DEDICATED TO MAKING AGI SAFE", and also "we are committed to avoiding dangerous AI race dynamics".

How are you planning to avoid exacerbating race dynamics, given that you're creating a new 'AGI company'? How will you prove to other AI companies—that do pursue AGI—that you're not competing with them?

Do you believe that most of the AI safety community approves of the creation of this new company? In what ways (if any) have you consulted with the community before starting the company?

To address the opening quote - the copy on our website is overzealous, and we will be changing it shortly. We are an AGI company in the sense that we take AGI seriously, but it is not our goal to accelerate progress towards it. Thanks for highlighting that.

We don’t have a concrete proposal for how to reliably signal that we’re committed to avoiding AGI race dynamics beyond the obvious right now. There is unfortunately no obvious or easy mechanism that we are aware of to accomplish this, but we are certainly open to discussion with any interested parties ab... (read more)

I'm late to the party by a month, but I'm interested in your take (especially Rohin's) on the following:

Conditional on an existential catastrophe happening due to AI systems, what is your credence that the catastrophe will occur only after the involved systems are deployed?

5Rohin Shah
Idk, 95%? Probably I should push that down a bit because I haven't thought about it very hard. It's a bit fuzzy what "deployed" means, but for now I'm going to assume that we mean that we put inputs into the AI system for the primary purpose of getting useful outputs, rather than for seeing what the AI did so that we can make it better. Any existential catastrophe that didn't involve a failure of alignment seems like it had to involve a deployed system. For failures of alignment, I'd expect that before you get an AI system that can break out of the training process and kill you, you get an AI system that can break out of deployment and kill you, because there's (probably) less monitoring during deployment. You're also just running much longer during deployment -- if an AI system is waiting for the right opportunity, then even if it is equally likely to happen for a training vs deployment input (i.e. ignoring the greater monitoring during training), you'd still expect to see it happen at deployment since >99% of the inputs happen at deployment.

The two pieces of logic can use the same activation values as their input. For example, suppose they both (independently) cause failure if a certain activation value is above some threshold. (In which case each piece of logic "ruins" a different critical activation value).

2Richard Ngo
In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.
1Not Relevant
I confess I’m confused as to how the network gradient hacks to create these redundant mechanisms in the first place? Since one of them seems completely useless unless designed deliberately.

Regarding the following part of the view that you commented on:

But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI.

Just wanted to add: It may be important to consider potential downside risks of such work. It may be important to be vigilant when working on certain topics in game theory and e.g. make certain binding commitments before investigating certain issues, because otherwise one might lose a commitment race in logical time. (I think this is a special case of a... (read more)

To make sure I understand your notation, is some set of weights, right? If it's a set of multiple weights I don't know what you mean when you write .

There should also exist at least some f1,f2 where C(f_1,f_1)≠C(f_2,f_2), since otherwise C no longer depends on the pair of redundant networks at all

(I don't yet understand the purpose of this claim, but it seems to me wrong. If for every , why is it true that does not depend on and when ?)

But gradient descent doesn’t modify a neural network one weight at a time

Sure, but the gradient component that is associated with a given weight is still zero if updating that weight alone would not affect loss.

This post is essentially the summary of a long discussion on the EleutherAI discord about trying to exhibit gradient hacking in real models by hand crafting an example.

I wouldn't say that this work it attempting to "exhibit gradient hacking". (Succeeding in that would require to create a model that can actually model SGD.) Rather, my understanding is that this work is trying to demonstrate techniques that might be used in a gradient hacking scenario.

There are a few ways to protect a subnetwork from being modified by gradient descent that I can think

... (read more)
0Stella Biderman
You seem to be under the impression that the goal is to make the NN robust to single-weight perturbation. But gradient descent doesn’t modify a neural network one weight at a time, and so being robust to single-weight modification doesn’t come with any real guarantees. The backward pass could result in weights of both forks being updated.
1leogao
The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights (f,g) completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven't succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking.  I don't think redundancy will work. Suppose you have some continuous everywhere, differentiable countably-almost everywhere combining function y=C(f1,f2) that takes the outputs from two redundant copies of f and outputs some kind of combined output. (If you're allowed functions that don't meet the "continuous everywhere, differentiable countably-almost everywhere" requirement, you might as well just skip the whole redundancy thing and just use a staircase.) Since this function prevents any gradients to f1 and f2 when they are equal, then it must be that at all points where f1=f2, ∂y∂f1=∂y∂f2=0. There should also exist at least some f1,f2 where C(f1,f1)≠C(f2,f2), since otherwise C no longer depends on the pair of redundant networks at all which means that those networks can't actually affect what the network does which defeats the whole point of this in the first place. Let us then define g(x)=C(x,x). Then, dgdx=∂y∂f1+∂y∂f2=0 for all x. This implies that g is a constant function. Therefore, there do not exist f1,f2 where C(f1,f1)≠C(f2,f2). This is a contradiction, and therefore C cannot exist.

In the bandits example, it seems like the caravan can unilaterally employ SPI to reduce the badness of the bandit's threat. For example, the caravan can credibly commit that they will treat Nerf guns identically to regular guns, so that (a) any time one of them is shot with a Nerf gun, they will flop over and pretend to be a corpse, until the heist has been resolved, and (b) their probability of resisting against Nerf guns will be the same as the probability of resisting against actual guns. In this case the bandits might as well use Nerf guns (perhaps be

... (read more)
4Rohin Shah
Yeah, this seems right. I'll note though that you may want to make it at least somewhat easier to make the new threat, so that the other player has an incentive to use the new threat rather than the old threat, in cases where they would have used the old threat (rather than being indifferent between the two). This does mean it is no longer a Pareto improvement, but it still seems like this sort of unilateral commitment can significantly help in expectation.
2Vojtech Kovarik
I would say that my main objection is that if you know that you will encounter SPI in situation X, you have an incentive to alter the policy that you will be using in X. Which might cause other agents to behave differently, possibly in ways that lead to the threat being carried out (which is precisely the thing that SPI aimed to avoid). In the bandit case, suppose the caravan credibly commits to treating nerf guns identically to regular runs. And suppose this incentivizes the bandits to avoid regular guns. Then you are incentivized to self-modify to start resisting more. (EG, if you both use CDT and the "logical time" is "self modify?" --> "credibly commit?" --> "use nerf?" .) However, if the bandits realize this --- i.e., if the "logical time" is "use nerf?" --> "self modify?" --> "credibly commit?" --- then the bandits will want to not use nerf guns, forcing you to not self-modify. And if you each think that you are "logically before" the other party, you will make incompatible comitments (use regular guns & self-modify to resist) and people get shot with regular guns. So, I agree that credible unilateral commitments can be useful and they can lead to guaranteed Pareto improvements. It's just that I don't think it addresses my main objection against the proposal. Yup, I fully agree.

But if the agent is repeatedly carrying out its commitment to fail, then there’ll be pretty strong pressure from gradient descent to change that. What changes might that pressure lead to? The two most salient options to me:

  1. The agent’s commitment to carrying out gradient hacking is reduced.
  2. The agent’s ability to notice changes implemented by gradient descent is reduced.

In a gradient hacking scenario, we should expect the malicious conditionally-fail-on-purpose logic to be optimized for such outcomes not to occur. For example, the malicious logic may ... (read more)

2Richard Ngo
What mechanism would ensure that these two logic pieces only fire at the same time? Whatever it is, I expect that mechanism to be changed in response to failures.
2[comment deleted]

That quote does not seem to mention the "stochastic sensitivity issue". In the post that you linked to, "(3)" refers to:

  1. Not all environments have the right symmetries
    • But most ones we think about seem to

So I'm still not sure what you meant when you wrote "The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads."

(Again, I'm not aware of any previous mention of the "stochastic sensitivity issue" other than in my comment here.)

Thanks for the figure. I'm afraid I didn't understand it. (I assume this is a gridworld environment; what does "standing near intact vase" mean? Can the robot stand in the same cell as the intact vase?)

&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.

I don't follow (To be clear, I was not trying to apply any theorem from the paper via that involution). But does this mean you are NOT making that claim ("most agents wil... (read more)

The claim should be: most agents will not immediately break the vase.

I don't see why that claim is correct either, for a similar reason. If you're assuming here that most reward functions incentivize avoiding immediately breaking the vase then I would argue that that assumption is incorrect, and to support this I would point to the same involution from my previous comment.

2Alex Turner
I‘m not assuming that they incentivize anything. They just do! Here’s the proof sketch (for the full proof, you’d subtract a constant vector from each set, but not relevant for the intuition).    &You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning. 

The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn't about the stochastic sensitivity issue, and I don't think it should have to talk about the sensitivity issue.

I noticed that after my previous comment you've edited your comment to include the page number and the link. Thanks.

I still couldn't find in the paper (top of page 9) an explanation for the "stochastic sensitivity issue". Perhaps you were referring to the following:

randomly generat

... (read more)
2Alex Turner

As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more "options" available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.

Also, this claim is missing the "disjoint requirement" and so it is incorrect even without the "they show that" part (i.e. it's not just that the theorem... (read more)

Thanks.

We can construct an involution over reward functions that transforms every state by switching the is-the-vase-broken bit in the state's representation. For every reward function that "wants to preserve the vase" we can apply on it the involution and get a reward function that "wants to break the vase".

(And there are the reward functions that are indifferent about the vase which the involution map to themselves.)

2Alex Turner
Gotcha. I see where you're coming from.  I think I underspecified the scenario and claim. The claim wasn't supposed to be: most agents never break the vase (although this is sometimes true). The claim should be: most agents will not immediately break the vase.  If the agent has a choice between one action ("break vase and move forwards") or another action ("don't break vase and more forwards"), and these actions lead to similar subgraphs, then at all discount rates, optimal policies will tend to not break the vase immediately. But they might tend to break it eventually, depending on the granularity and balance of final states.  So I think we're actually both making a correct point, but you're making an argument for γ=1 under certain kinds of models and whether the agent will eventually break the vase. I (meant to) discuss the immediate break-it-or-not decision in terms of option preservation at all discount rates. [Edited to reflect the ancestor comments]

The phenomena you discuss are explainted in the paper, and in other posts, and discussed at length in other comment threads.

I haven't found an explanation about the "stochastic sensitivity issue" in the paper, can you please point me to a specific section/page/quote? All that I found about this in the paper was the sentence:

Our theorems apply to stochastic environments, but we present a deterministic case study for clarity.

(I'm also not aware of previous posts/threads that discuss this, other than my comment here.)

I brought up this issue as a demons... (read more)

Are you saying that my first sentence ("Most of the reward functions are either indifferent about the vase or want to break the vase") is in itself factually wrong, or rather the rest of the quoted text?

2Alex Turner
The first sentence

So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.

I think the quoted description is not at all what the theorems in the paper show, no matter what concept the word "options" (in scare quotes) refers to. In order to apply the theorems we need to show that an involution with certain properties exist; not that <some set of things after action 1> is larger than <some set of things after action 2>.

To be more specific, the concept that the word "options" refers to here is ... (read more)

5Alex Turner
You're being unhelpfully pedantic. The quoted portion even includes the phrase "As a quick summary (read the paper and sequence if you want more details)"! This reads to me as an attempted pre-emption of "gotcha" comments.   The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn't about the stochastic sensitivity issue, and I don't think it should have to talk about the sensitivity issue.

As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more "options" available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.

That is not what the theorems in the paper show at all (it's not just a matter of details). The relevant theorems require a much stronger and more com... (read more)

0Ofer
Also, this claim is missing the "disjoint requirement" and so it is incorrect even without the "they show that" part (i.e. it's not just that the theorems in the paper don't show the thing that is being claimed, but rather the thing that is being claimed is incorrect). Consider the following example where action 1 leads to more "options" but most optimal policies choose action 2:
2Alex Turner
The point of using scare quotes is to abstract away that part. So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.

In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.

How does such a scenario (in which "automating auditing fails") look like? The alignment researchers who will work on this will always be able to say: "Our current M... (read more)

2Evan Hubinger
Sure, but presumably they'll also say what particular attacks are so hard that current ML models aren't capable of solving them—and I think that's a valuable piece of information to have.

I still don't see how this works. The "small constant" here is actually the length of a program that needs to contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation). So it's not a constant; it's an unbounded integer.

Even if we restrict the discussion to a given very-simple-MDP, the program needs to contain way more than 100 bits (just to represent the MDP + the logic that checks whether a given permutation satisfies the relevant condition). So the probability of the POWER-seeking reward func... (read more)

5Rohin Shah
Yes, good point. I retract the original claim. (You're right, what you do here is you search for the kth permutation satisfying the theorem's requirements, where k is the specified number.)
2Alex Turner
We aren't talking about MDPs, we're talking about a broad class of environments which are represented via joint probability distributions over actions and observations. See post title. I don't follow. See the arguments in the post Rohin linked for why this argument is gesturing at something useful even if takes some more bits. But IMO the basic idea in this case is, you can construct reasonably simple utility functions like "utility 1 if history has the agent taking action ai at time step t given action-observation history prefix h1:t−1, and 0 otherwise." This is reasonably short, and you can apply it for all actions and time steps.  Sure, the complexity will vary a little bit (probably later time steps will be more complex), but basically you can produce reasonably simple programs which make any sequence of actions optimal. And so I agree with Rohin that simplicity priors on u-AOH will quantitatively - but not qualitatively affect the conclusions for the generic u-AOH case. [EDIT: this reasoning is different from the one Rohin gave, TBC]

They would change quantitatively, but the upshot would probably be similar. For example, for the Kolmogorov prior, you could prove theorems like "for every reward function that <doesn't do the thing>, there are N reward functions that <do the thing> that each have at most a small constant more complexity" (since you can construct them by taking the original reward function and then apply the relevant permutation / move through the orbit, and that second step has constant K-complexity). Alex sketches out a similar argument in this post.

I don'... (read more)

2Rohin Shah
You're right, in my previous comment it should be "at most a small constant more complexity + O(logN)", to specify the number of times that the permutation has to be applied. I still think the upshot is similar; in this case it's "power-seeking is at best a little less probable than non-power-seeking, but you should really not expect that the bound is tight".

The incentive of social media companies to invest billions into training competitive RL agents that make their users spend as much time as possible in their platform seem like an obvious reason to be concerned. Especially when such RL agents plausibly already select a substantial fraction of the content that people in developed countries consume.

2Rohin Shah
I don't trust this sort of armchair reasoning. I think this is sufficient reason to raise the hypothesis to attention, but not enough to conclude that it is likely a real concern. And the data I have seen does not seem kind to the hypothesis (though there may be better data out there that does support the hypothesis).

I think that most of the citations in Superintelligence are in endnotes. In the endnote that follows the first sentence after the formulation of instrumental convergence thesis, there's an entire paragraph about Stephen Omohundro's work on the topic (including citations of Omohundro's "two pioneering papers on this topic").

Bostrom's original instrumental convergence thesis needs to be applied carefully. The danger from power-seeking is not intrinsic to the alignment problem. This danger also depends on the structure of the agent's environment

This post uses the phrase "Bostrom's original instrumental convergence thesis". I'm not aware of there being more than one instrumental convergence thesis. In the 2012 paper that is linked here the formulation of the thesis is identical to the one in the book Superintelligence (2014), except that the paper... (read more)

1Charlie Steiner
Weird coincidence, but I just read Superintelligence for the first time, and I was struck by the lack of mention of Steve Omohundro (though he does show up in endnote 8). My citation for instrumental convergence would be Omohundro 2008.
2Alex Turner
Right. But I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail. (And note that the AGI could still hurt us in a sense, by simulating and torturing humans using its compute. And some decision theories do seem to have it do that kind of thing.) (Edited post to clarify)

Because you can do "strictly more things" with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.

Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don't "tend to avoid breaking the vase". Those optimal policies don't behave as if they care about the 'strictly more states' that can... (read more)

2Alex Turner
This is factually wrong BTW. I had just explained why the opposite is true.

That one in particular isn't a counterexample as stated, because you can't construct a subgraph isomorphism for it.

Probably not an important point, but I don't see why we can't use the identity isomorphism (over the part of the state space that a1 leads to).

2Rohin Shah
This particular argument is not talking about farsightedness -- when we talk about having more options, each option is talking about the entire journey and exact timesteps, rather than just the destination. Since all the "journeys" starting with the S --> Z action go to Z first, and all the "journeys" starting with the S --> A action go to A first, the isomorphism has to map A to Z and vice versa, so that ϕ(T(S,a1))=T(S,a2). (What assumption does this correspond to in the theorem? In the theorem, the involution has to map Fa to a subset of Fa′; every possibility in Fa1 starts with A, and every possibility in Fa2 starts with Z, so you need to map A to Z.)

I was referring to the claim being made in Rohin's summary. (I no longer see counter examples after adding the assumption that "a1 and a2 lead to disjoint sets of future options".)

(we’re going to ignore cases where a1 or a2 is a self-loop)

I think that a more general class of things should be ignored here. For example, if a2 is part of a 2-cycle, we get the same problem as when a2 is a self-loop. Namely, we can get that most reward functions have optimal policies that take the action a1 over a2 (when the discount rate is sufficiently close to 1), which contradicts the claim being made.

2Rohin Shah
Thanks for the correction. That one in particular isn't a counterexample as stated, because you can't construct a subgraph isomorphism for it. When writing this I thought that actually meant I didn't need more of a caveat (contrary to what I said to you earlier), but now thinking about it a third time I really do need the "no cycles" caveat. The counterexample is: Z <--> S --> A --> B With every state also having a self-loop. In this case, the involution {S:B, Z:A}, would suggest that S --> Z would have more options than S --> A, but optimal policies will take S --> A more often than S --> Z. (The theorem effectively says "no cycles" by conditioning on the policy being S --> Z or S --> A, in which case the S --> Z --> S --> S --> ... possibility is not actually possible, and the involution doesn't actually go through.) EDIT: I've changed to say that the actions lead to disjoint parts of the state space.
2Alex Turner
I don't understand what you mean. Nothing contradicts the claim, if the claim is made properly, because the claim is a theorem and always holds when its preconditions do. (EDIT: I think you meant Rohin's claim in the summary?) I'd say that we can just remove the quoted portion and just explain "a1 and a2 lead to disjoint sets of future options", which automatically rules out the self-loop case. (But maybe this is what you meant, ofer?)

Suppose we train a model, and at some point during training the inference execution hacks the computer on which the model is trained, and the computer starts doing catastrophic things via its internet connection. Does the generalization-focused approach consider this to be an outer alignment failure?

Optimal policies will tend to avoid breaking the vase, even though some don't. 

Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?

This is just making my point - Blackwell optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. If D1 is {the first four cycles} and D2 is {the last cycle}, then optimal policies tend to end up in D1 instead of D2. Most optimal policies will avoi

... (read more)
2Alex Turner
Because you can do "strictly more things" with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives. Right, good question. I'll explain the general principle (not stated in the paper - yes, I agree this needs to be fixed!), and then answer your question about your environment. When the agent maximizes average reward, we know that optimal policies tend to seek power when there's something like:  "Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1." (This follows by combining proposition 6.12 and theorem 6.13) From start, let a1 go to candy and let a2 go to wait. This satisfies the criterion: Wait! tends to be optimal and it tends to have more POWER, as expected. Let's reconsider your example: Let the discount rate equal 1. If action a1 goes to the right and a2 stays put at state 1, then the cycle sets are not disjoint (that's the violated condition). Prop 6.12 correctly tells us that state 1 tends to have more POWER than state 2, but theorem 6.13 tells us that average-optimal policies tend to go to the right. This highlights how power-seeking and optimality probability can come apart.   Again, I very much agree that this part needs more explanation. Currently, the main paper has this to say:  Throughout the paper, I focused on the survival case because it automatically satisfies the above criterion (death is definitionally disjoint from non-death, since we assume you can't do other things while dead), without my having to use limited page space explaining the nuances of this criterion.  Yes, although SafeLife requires a bit of squinting (as I noted in the

The paper supports the claim with:

  • Embodied environment in a vase-containing room (section 6.3)

I think this refers to the following passage from the paper:

Consider an embodied navigation task through a room with a vase. Proposition 6.9 suggests that optimal policies tend to avoid breaking the vase, since doing so would strictly decrease available options.

This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.

Regarding your next bullet point:

  • Pac-Man (fig
... (read more)
7Alex Turner
There are fewer ways for vase-breaking to be optimal. Optimal policies will tend to avoid breaking the vase, even though some don't.  This is just making my point - average-optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. If D1 is {the first four cycles} and D2 is {the last cycle}, then average-optimal policies tend to end up in D1 instead of D2. Most average-optimal policies will avoid entering the final state, just as section 7 claims. (EDIT: Blackwell -> average-) (And I claim that the whole reason you're able to reason about this environment is because my theorems apply to them - you're implicitly using my formalisms and frames to reason about this environment, while seemingly trying to argue that my theorems don't let us reason about this environment? Or something? I'm not sure, so take this impression with a grain of salt.) Why is it interesting to prove things about this set of MDPs? At this point, it feels like someone asking me "why did you buy a hammer - that seemed like a waste of money?". Maybe before I try out the hammer, I could have long debates about whether it was a good purchase. But now I know the tool is useful because I regularly use it and it works well for me, and other people have tried it and say it works well for them.  I agree that there's room for cleaner explanation of when the theorems apply, for those readers who don't want to memorize the formal conditions. But I think the theory says interesting things because it's already starting to explain the things I built it to explain (e.g. SafeLife). And whenever I imagine some new environment I want to reason about, I'm almost always able to reason about it using my theorems (modulo already flagged issues like partial observability etc). From this, I infer that the set of MDPs is "interesting enough." 

For my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paper v.7?).

I did read the "Note of caution" section in the OP. It says that most of the environments we think about seem to "have the right symmetries", which may be true, but I haven't seen the paper support that claim.

Maybe I just missed ... (read more)

I haven't seen the paper support that claim.

The paper supports the claim with:

  • Embodied environment in a vase-containing room (section 6.3)
  • Pac-Man (figure 8)
    • And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed)
  • Average-optimal robots not idling in a particular spot (beginning of section 7)

This post supports the claim with:

  • Tic-Tac-Toe
  • Vase gridworld
  • SafeLife

So yes, this is sufficient support for speculation that most relevant environments have these symmetries. 

Maybe I just missed it, but I

... (read more)

No worries, thanks for the clarification.

[EDIT: the confusion may have resulted from me mentioning the LW username "adamShimi", which I'll now change to the display name on the AF ("Adam Shimi").]

Load More