All of michaelcohen's Comments + Replies

I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.

I agree with this in a sense, although I may be quite a bit a more harsh about what counts as "executing an action". For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as "executing the action" in the overseer-conversation environment, even if the action looks like it's for some other environment, like a plan to launch a new product in the market. I do ... (read more)

What is process-based RL?

I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the pote... (read more)

2Matthew Barnett
I think this comparison is imperfect. Standard economic models predict an acceleration in the growth rate by at least an order of magnitude, and usually more. Over one decade, an increase in economic capacity by 1-4 orders of magnitude seems probable. By contrast, my understanding was that the models of open borders roughly predict a one-time doubling of world GDP over several decades, and for housing, it's something like a 50% increase in GDP over decades. Perhaps a better way to put this is that if AI is developed anywhere, even in a small country, that country could soon (within 10 years) grow to be the world's foremost economic power. Nothing comparable seems true for other policies. There only really needs to be be one successful defecting nation for this coordination to fall apart.

By process-based RL, I mean: the reward for an action doesn't depend on the consequences of executing that action. Instead it depends on some overseer's evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.

I'm generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn't that much harder than nuclear non-proliferation, t... (read more)

I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,[4] and in my view this is looking more and more plausible over time.

I agree whole-heartedly with the first sentence. I'm not sure why you understand it to support the second sentence; I feel the first sentence supports my disagreement with the second sentence! Long-horizon RL is a differen... (read more)

3Paul Christiano
It would be safest of all to just not build powerful AI for a very long time. But alas, that seems wildly uncompetitive and so would require some kind of strong global coordination (and would create considerable instability and leave significant value on the table for other worldviews). It's possible that "human-level AI with CoT" will be competitive enough, but I would guess not. So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model. You could try to do distillation with imitation learning. This is way more likely to be competitive then with no distillation at all. But it still seems like it has a very good chance of being uncompetitive because the imitation objective significantly impairs performance and creates all kinds of artifacts. Using process-based RL for distillation seems like it has essentially the same safety profile to using imitation learning, while avoiding the obvious pathologies and having a much higher probability of being competitive. (People give various reasons that RL in the distillation step is less safe than imitation learning in the distillation step, but so far I haven't found anything at all persuasive.) I think there's still a good chance that process-based RL in the distillation step still can't be competitive and so you need to talk about how to develop new techniques or prudently incorporate outcomes. But I think it's at least much more likely to be competitive than CoT-only, or imitation learning in the distillation step. (Perhaps it cuts down the probability of deal-breaking uncompetitiveness by 30%, compared to using imitation learning alone for distillation.)

Me: Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.

You: No it doesn't. They just care about interestingness.

Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?

You:  Yes, but...

If you can admit that we agree on this basic point, I'm happy to discuss further about how good they are at what they aim to do.

1: If retractions were common, surely you would have said that was evidence peer review didn't accomplish much! If academics were only equally ... (read more)

Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.

"As a reviewer, I only recommend for acceptance papers that appear to be both valid and interesting."

Strongly agree - ... - Strongly Disagree

"As a reviewer, I would sooner recommend for acceptance a paper that was valid, but not incredibly interesting, than a paper that was interesting, but the conclusions weren't fully supported by the analysis."

Strongly agree - ... - Strongly Disagree

1mwacksen
No, no more than I would bet on a survey of <insert religious group here> whether they think <religious group> is more virtuous than <non-religious group>. Academics may claim that peer review is to check validity but their actions tell a different story. This is especially true in "hard" fields like mathematics where reviewers may even struggle to follow an argument, let alone check its validity. Given that most papers are never read by others, this is really not a big deal though. But I'll offer three further arguments for why I don't think peer review ensures validity. Argument 1: a) Humans (including reviewers) make mistakes all the time, but b) Retractions/corrections in papers are very rare. Unless academics are better at spotting mistakes immediately when reviewing than everyone else (they are not), we should expect lots of peer-reviewed articles to therefore have mistakes because invalid papers rarely get retracted. Argument 2: Computer science papers don't always include reproducible software, but checking code would absolutely be required to check validity. Argument 3: It is customary to submit papers that are rejected by one journal to another journal. This means that articles that fail "peer review" at one journal can obtain "peer review" at a different journal. PS: For CS it's harder to check "validity", but here's how papers replicate in other fields: https://fantasticanachronism.com/2021/11/18/how-i-made-10k-predicting-which-papers-will-replicate/

I don't understand. Importantly, these are optimistically biased, and you can't assume my true credences are this high. I assign much less than 90% probability to C. But still, they're perfectly consistent. M doesn't say anything about succeeding--only being allowed. M is basically saying: listing the places he'd be willing to live, do they all pass laws which would make building dangerously advanced AI illegal? The only logical connection between C and M is that M (almost definitely) implies C.

Thank you very much for saying that.

I was feeling disappointed about the lack of positive comments, and I realized recently I should probably go around commenting on posts that I think are good, since right now, I mostly only comment on posts when I feel I have an important disagreement. So it's hard to complain when I'm on the receiving end of that dynamic.

On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed ... (read more)

The assumption says "will do" not "will be able to do".  And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.

In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game

... (read more)

Peer review is not a certification of validity,

Do you think the peer reviewers and the editors thought the argument was valid?

Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.

0mwacksen
No it doesn't. It's hard to say what the "aims" of peer-review are, but "ensuring validity" is certainly not one of them. As a first approximation, I'd say that peer-review aims to certify that the author is not an obvious crank, and that the argument being made is an interesting one to someone in the field.

Thank you. I've changed the title.

Not trying to be arrogant. Just trying to present readers who have limited time a quickly digestible bit evidence about the likelihood that the argument is a shambles.

5Charles Foster
It didn't strike me as arrogant. It struck me as misleading in a way that made me doubt the quality of the enclosed argument.

Thank you for this review! A few comments on the weaknesses of my paper.

In particular, it explicitly says the argument does not apply to supervised learning.

Hardly a weakness if supervised learning is unlikely to be an existential threat!

Strength: Does not make very concrete assumptions about the AGI development model.

Weakness: Does not talk much about how AGI is likely to be developed, unclear which of the assumptions are more/less likely to hold for AGI being developed using the current ML paradigm.

The fact that the argument holds equally well no matter ... (read more)

2Zachary Kenton
Thanks for the comment Michael. Firstly, just wanted to clarify the framing of this literature review - when considering strengths and weaknesses of each threat model, this was done in light of what we were aiming to do: generate and prioritise alignment research projects -- rather than as an all-things-considered direct critique of each work (I think that is best done by commenting directly on those articles etc). I'll add a clarification of that at the top. Now to your comments: To your 1st point: I think the lack of specific assumptions about the AGI development model is both a strength and a weakness. Regarding the weakness, we mention it because it makes it harder to generate and prioritize research projects. It could be more helpful to say more explicitly, or earlier in the article what kind of systems you're considering, perhaps pointing to the closest current prosaic system, or explaining why current systems are nothing like what you imagine the AGI development model is like. On your 2nd point: What I meant was more “what about goal misgeneralization? Wouldn’t that mean the agent is likely to not be wireheading, and pursuing some other goal instead?” - you hint at this at the end of the section on supervised learning but that was in the context of whether a supervised learner would develop a misgeneralized long-term goal, and settled on being agnostic there. On your 3rd point: It could have been interesting to read arguments for why would it need all available energy to secure its computer, rather than satisficing at some level. Or some detail on the steps for how it builds the technology to gather the energy, or how it would convert that into defence.  

I assume (6) means that your "anthropic update" scans across possible universes to find those that contain important decisions you might want to influence?

Yes, and then outputs strings from that set with probability proportional to their weight in the universal prior.

By (3) do you mean the same thing as "Simplest output channel that is controllable by advanced civilization with modest resources"?

I would say "successfully controlled" instead of controllable, although that may be what you meant by the term. (I decomposed this as controllable + making good gu... (read more)

Yeah, seems about right.

I think with 4, I've been assuming for the sake of argument that manipulators get free access to the right prior, and I don't have a strong stance on the question, but it's not complicated for a directly programmed anthropic update to be built on that right prior too.

I guess I can give some estimates for how many bits I think are required for each of the rows in the table. I'll give a point estimate, and a range for a 50% confidence interval for what my point estimate would be if I thought about it for an hour by myself and had to w... (read more)

2Paul Christiano
By (3) do you mean the same thing as "Simplest output channel that is controllable by advanced civilization with modest resources"? I assume (6) means that your "anthropic update" scans across possible universes to find those that contain important decisions you might want to influence? If you want to compare most easily to models like that, then instead of using (1)+(2)+(3) you should compare to (6') = "Simplest program that scans across many possible worlds to find those that contain some pattern that can be engineered by consequentialists trying to influence prior." Then the comparison is between specifying "important predictor to influence" and whatever the easiest-to-specify pattern that can be engineered by a consequentialist. It feels extremely likely to me that the second category is easier, indeed it's kind of hard for me to see any version of (6) that doesn't have an obviously simpler analog that could be engineered by a sophisticated civilization. With respect to (4)+(5), I guess you are saying that your point estimate is that only 1/million of consequentialists decide to try to influence the universal prior. I find that surprisingly low but not totally indefensible, and it depends on exactly how expensive this kind of influence is. I also don't really see why you are splitting them apart, shouldn't we just combine them into "wants to influence predictors"? If you're doing that presumably you'd both use the anthropic prior and then the treacherous turn. But it's also worth noting that (6') gets to largely skip (4') if it can search for some feature that is mostly brought about deliberately by consequentialists (who are trying to create a beacon recognizable by some program that scans across possible worlds looking for it, doing the same thing that "predictor that influences the future" is doing in (6)).

Do you have some candidate "directly programmed anthropic update" in mind? (That said, my original claim was just about the universal prior, not about a modified version with an anthropic update)

I’m talking about the weight of an anthropically updated prior within the universal prior. I should have added “+ bits to encode anthropic update directly” to that side of the equation. That is, it takes some number of bits to encode “the universal prior, but conditioned on the strings being important to decision-makers in important worlds”. I don’t know how to enc... (read more)

3Paul Christiano
Here's my current understanding of your position: 1. The easiest way to specify an important prediction problem (in the sense of a prediction that would be valuable for someone to influence) is likely to be by saying "Run the following Turing machine, then pick an important decision from within it." Let's say the complexity of that specification is N bits. 2. You think that if consequentialists dedicate some fraction of their resources to doing something that's easy for the universal prior to output, it will still likely take more than N bits or not much less. 3. [Probably] You think the differences may be small enough that they can be influenced by factors of 1/1000 or 1/billion (i.e. 10-30 bits) of improbability of consequentialists spending significant resources in this task. 4. [Probably] You think the TM-definition update (where the manipulators get to focus on inductors who put high probability on their own universe) or the philosophical sophistication update (where manipulators use the "right" prior over possible worlds rather than choosing some programming language) are small relative to these other considerations. I think the biggest disagreement is about 1+2. It feels implausible to me that "sample a data stream that is being used by someone to make predictions that would be valuable to manipulate" is simpler than any of the other extraction procedures that consequentialists could manipulate (like sample the sequence that appears the most times, sample the highest energy experiments, sample the weirdest thing on some other axis...) I think we're probably on the same page now, but I'd say: the consequentialists can also sample from the "important predictions" prior (i.e. the same thing as that fragment of the universal prior). If "sample output channel controlled by consequentialists" has higher probability than "Sample an important prediction," then the consequentialists control every important prediction. If on the other hand "Sample an important p

I'm using some of the terminology I suggested here.

A factoring is a set of questions such that each signature of possible answers identifies a unique element. In 20 questions, you can tailor the questions depending on the answers to previous questions, and ultimately each element will have a bitstring signature depending on the history of yesses and nos. I guess you can define the question to include xors with previous questions, so that it effectively changes depending on the answers to others. But it's sometimes useful that the bitstrings are allowed to ... (read more)

6Scott Garrabrant
I think that the answers to both the concern about 7 elements, and the desire to have questions depend of previous questions come out of thinking about FFS models, rather than FFS. If you want to have 7 elements in Ω, that just means you will probably have more than 7 elements in S. If I want to model a situation where some questions I ask depend on other questions, I can just make a big FFS that asks all the questions, and have the model hide some of the answers.  For example, Let's say I flip a biased coin, and then if heads I roll a biased 6 sided die, and if tails I roll a biased 10 sided die. There are 16 outcomes in Ω.  I can build a 3 dimensional factored set 2x6x10, which I will imagine as sitting on my table with height 2. heads is on the bottom, and tails is on the top. f:S→Ω will then merge together the rows on the bottom, and the columns on the top, so it will look a little like the game Jenga. In this way, I am imagining there is some hidden data about each world in which I get heads and roll the 6 sided die, which is the answer to the question "what would have happened if I rolled the 10 sided die. Adding in all this counterfactual data gives a latent structure of 120 possible worlds, even though we can only distinguish 16 possible worlds.

I was thinking of some terminology that might make it easier to thinking about factoring and histories and whatnot.

A partition can be thought of as a (multiple-choice) question. Like for a set of words, you could have the partition corresponding to the question "Which letter does the word start with?" and then the partition groups together elements with the same answer.

Then a factoring is set of questions, where the set of answers will uniquely identify an element. The word that comes to mind for me is "signature", where an element's signature is the set o... (read more)

3Scott Garrabrant
Yep, this all seems like a good way of thinking about it.

I was thinking about the difficulty of finite factored sets not understanding the uniform distribution over 4 elements, and it makes me feel like something fundamental needs to be recast. An analogy came to mind about eigenvectors vs. eigenspaces.

What we might like to be true about the unit eigenvectors of a matrix is that they are the unique unit vectors for which the linear transformation preserves direction. But if two eigenvectors have the same eigenvalue, the choice of eigenvectors is not unique--we could choose any pair on that plane. So really, it s... (read more)

3Scott Garrabrant
Hmm, I doubt the last paragraph about sets of partitions is going to be valuable, bet the eigenspace thinking might be useful.  Note that I gave my thoughts about how to deal with the uniform distribution over 4 elements in the thread responding to cousin_it.

I take your point that we are discussing some output rules which add extra computation states, and so some output rules will add fewer computation states than others.

I'm merging my response to the rest with my comment here.

They are using their highest probability guess about the output channel, which will be higher probability than the output channel exactly matching some camera on old earth (but may still be very low probability). I still don't understand the relevance.

I’m trying to find the simplest setting where we have a disagreement. We don’t need to think about cameras on earth quite yet. I understand the relevance isn’t immediate.

They don't care about "their" Turing machine, indeed they live in an infinite number of Turing machines that (among other things) output bit

... (read more)
3Paul Christiano
I think the relevant number is just "log_2 of the number of predictions that the manipulators want to influence." It seems tricky to think about this (rather small) number as the difference between two (giant) numbers. They are just looking at the earth-like Turing machine, looking for the inductors whose predictions are important,  and then trying to copy those input sequences. This seems mostly unrelated to the complexity of adding states to the Turing machine so that it reads data from a particular location on a particular hard drive. It just rests on them being able to look at the simulation and figure out what's going on. On the other hand, the complexity of adding states to the Turing machine so that it reads data from a particular location on a particular hard drive seems very closely related to the complexity of adding states to the Turing machine so that it outputs data encoded by the sophisticated civilization in the format that they thought was easiest for the Turing machine to output. Do you have some candidate "directly programmed anthropic update" in mind? (That said, my original claim was just about the universal prior, not about a modified version with an anthropic update) I still feel like the quantitative question we're discussing is a blow-out and it's not clear to me where we are diverging on that. My main uncertainty about the broader question is about whether any sophisticated civilizations are motivated to do this kind of thing (which may depend on the nature of the inductor and how much reasoning they have time to do, since that determines whether the inductor's prediction is connected in the decision-theoretically relevant way with the civilization's decisions or commitments).

Okay, now suppose they want the first N bits of the output of their Turing machine to obey predicate P, and they assign that a value of 100, and a they assign a value of 0 to any N-bit string that does not obey predicate P. And they don't value anything else. If some actions have a higher value than other actions, what information about the output tape dynamics are they using, and how did they acquire it?

2Paul Christiano
They are using their highest probability guess about the output channel, which will be higher probability than the output channel exactly matching some camera on old earth (but may still be very low probability). I still don't understand the relevance. I'm probably going to give up soon, but there was one hint about a possible miscommunication: They don't care about "their" Turing machine, indeed they live in an infinite number of Turing machines that (among other things) output bits in different ways. They just care about the probability of the bitstring x+1 under the universal prior---they want to make the mass of x+1 larger than the mass of x+0. So they will behave in a way that causes some of the Turing machines containing them to output x+1. And then the question is whether the total mass of Turing machines (i.e. probability of noise strings fed into the UTM) that they are able to get to output x+1 is larger or smaller than the mass of Turing machines that output x for the "intended" reason.

Just look at the prior--for any set of instructions for the work tape heads of the Turing machine, flipping the "write-1" instructions of the output tape with the "write-0" instructions gives an equally probably Turing machine.

Suppose they know the sequence that actually gets fed to the camera.

If you're saying that they know their Turing machine has output x so far, then I 100% agree. What about in the case where they don't know?

2Paul Christiano
I don't think I understand what you mean. Their goal is to increase the probability of the sequence x+1, so that someone who has observed the sequence x will predict 1. What do you mean when you say "What about in the case where they don't know"? I agree that under your prior, someone has no way to increase e.g. the fraction of sequences in the universal prior that start with 1 (or the fraction of 1s in a typical sequence under the universal prior, or any other property that is antisymmetric under exchange of 0 and 1).
1michaelcohen
Just look at the prior--for any set of instructions for the work tape heads of the Turing machine, flipping the "write-1" instructions of the output tape with the "write-0" instructions gives an equally probably Turing machine.

If I flip a coin to randomize between two policies, I don't see how that mixed policy could produce more value for me than the base policies.

(ETA: the logical implications about the fact of my randomization don't have any weird anti-adversarial effects here).

3Paul Christiano
Someone in the basement universe is reasoning about the output of a randomized Turing machine that I'm running on. I care about what they believe about that Turing machine. Namely, I want them to believe that most of the time when the sequence x appears, it is followed by a 1. Their beliefs depend in a linear way on my probabilities of action. (At least if e.g. I committed to that policy at an early enough time for them to reason about it, or if my policy is sufficiently predictable to be correlated with their predictions, or if they are able to actually simulate me in a universe with reflective oracles... If I'm not able to influence their beliefs about me, then of course I can't influence their beliefs about anything and the whole manipulative project doesn't get off the ground.) But my utility is a non-linear function of their beliefs, since P(1|x) is a non-linear function of their beliefs. So my utility is a non-linear function of my policy.

If these consequentialists ascribed a value of 100 to the next output bit being 1, and a value of 0 to the next output bit being 0, and they valued nothing else, would you agree that all actions available to them have identical expected value under the distribution over Turing machines that I have described?

2Paul Christiano
I don't agree, but I may still misunderstand something. Stepping back to the beginning: Suppose they know the sequence that actually gets fed to the camera. It is x= 010...011. They want to make the next bit 1. That is, they want to maximize the probability of the sequence (x+1)=010...0111. They have developed a plan for controlling an output channel to get it to output (x+1). For concreteness imagine that they did this by somehow encoding x+1 in a sequence of ultra high-energy photons sent in a particular direction. Maybe they encode 1 as a photon with frequency A and a 0 as a photon with frequency B. There is no way this plan results in the next bit being 0. If they are wrong about how the output channel encodes photons (i.e. it decodes A as 1 and B as 0) then that channel isn't going to end up with any probability. You don't try to encode 010...0111 and then accidentally end up encoding 010...0110. You end up encoding something like 101...1000, or something totally different.

With randomization, you reduce the cost and the upside in concert. If a pair of shoes costs $100, and that's more than I'm willing to pay, I could buy the shoes with probability 1%, and it will only cost me $1 in expectation, but I will only get the shoes with probability 1/100.

3Paul Christiano
I agree that randomization reduces the "upside" in the sense of "reducing our weight in the universal prior." But utility is not linear in that weight. I'm saying that the consequentialists completely dominate the universal prior, and they will still completely dominate if you reduce their weight by 2x. So either way they get all the influence. (Quantitatively, suppose the consequentialists currently have probability 1000 times greater than the intended model. Then they have 99.9% of the posterior. If they decreased their probability of acting by two, then they'd have 500 times the probability of the intended model, and so have 99.8% of the posterior. This is almost as good as 99.9%.) That could fail if e.g. if there are a bunch of other consequentialists also trying to control the sequence. Or if some other model beyond the intended one has much higher probability. But if you think that the consequentialists are X bits simpler than the intended model, and you are trying to argue that the intended model dominates the posterior, then you need to argue that the consequentialists wouldn't try to grab the universal prior even when doing so only requires acting in 2−X of worlds.

It's definitely not too weird a possibility for me. I'm trying to reason backwards here--the best strategy available to them can't be effective in expectation at achieving whatever their goals are with the output tape, because of information-theoretic impossibilities, and therefore, any given strategy will be that bad or worse, including randomization.

2Paul Christiano
To express my confusion more precisely: I think that's right (other than the fact that they can win simultaneously for many different output rules, but I'm happy ignoring that for now). But I don't see why it contradicts the story at all. In the story the best case is that we know the true distribution of output rules, and then we do the utility-maximizing thing, and that results in our sequence having way more probability than some random camera on old earth. If you want to talk about the information theory, and ignore the fact that we can do multiple things, then we control the single output channel with maximal probability, while the camera is just some random output channel (presumably with some much smaller probability). The information theory isn't very helpful, because actually all of the action is about which output channels are controllable. If you restrict to some subset of "controllable" channels, and believe that any output rule that outputs the camera is controllable, then the conclusion still holds. So the only way it fails is when the camera is higher probability than the best controllable output channels.
2Paul Christiano
I currently don't understand the information-theoretic argument at all (and feels like it must come down to some kind of miscommunication), so it seems easiest to talk about how the impossibility argument applies to the situation being discussed. If we want to instead engage on the abstract argument, I think it would be helpful to me to present it as a series of steps that ends up saying "And that's why the consequentialists can't have any influence." I think the key place I get lost is the connection between the math you are saying and a conclusion about the influence that the consequentialists have.

We can get back to some of these points as needed, but I think our main thread is with your other comment, and I'll resist the urge to start a long tangent about the metaphysics of being "simulated" vs. "imagined".

So we end up with some leading hypotheses about the Turing machine we are running on, the history that gave rise to us, and the output rule used by that Turing machine.

I feel like this story has run aground on an impossibility result. If a random variable’s value is unknowable (but its distribution is known) and an intelligent agent wants to act on its value, and they randomize their actions, the expected log probability of them acting on the true value cannot exceed the entropy of the distribution, no matter their intelligence. (And if they’re wrong about... (read more)

4Paul Christiano
I didn't quite get this, so let me try restating what I mean. Let's say the states and rules for manipulating the worktapes are totally fixed and known, and we're just uncertain about the rules for outputting something to the output tape. Zero of these correspond to reading off the bits from a camera (or dataset) embedded in the world. Any output rule that lets you read off precisely the bits from the camera is going to involving adding a bunch of new states to the Turing machine. So let's instead consider the space of all ways that you can augment a given physics to produce outputs. This will involve adding a ton of states to the Turing machine. Most of them produce really garbage-y outputs, but a tiny fraction do something intelligent that produces a coherent-looking output. Some fraction of those involve reading off the bits from a particular camera embedded in the world. Let's arbitrarily say it's 2−1000? Now consider any given intervention that we can perform to try to manipulate the prior. For example, we can perform a high-energy physics experiment that produces an unprecedented interaction, and control a parameter of how that interaction occurs. We can write the bits in the pattern of giant star-sized clumps of matter. Or we can send the relevant bits out to infinity with unprecedentedly high energies. Or we can build a quadrillion cameras throughout the world. Or we can change the nature of the camera so that more of the possible output rules read off its values. Or we can alter the distribution of matter when the universe comes apart, so that a Turing machine can read it off from that. Or whatever. It seems to me that "random camera on old Earth" is probably less likely to be output by the physics-extension than some of these other ways of encoding data. For example, maybe 2−600 of all output rules end up reading off data from the highest-energy events in the universe, and we can influence that. I think the only thing that really cuts against this i
3Paul Christiano
I'm imagining that the consequentialists care about something, like e.g. human flourishing. They think that they could use their control over the universal prior to achieve more of what they care about, i.e. by achieving a bunch of human flourishing in some other universe where someone thinks about the universal prior. Randomizing is one strategy available to them to do that. So I'm saying that I expect they will do better---i.e. get more influence over the outside world (per unit of cost paid in their world)---than if they had simply randomized. That's because randomizing is one of the strategies available to them and they are trying to pick the best one. (In fact I think they will do many orders of magnitude better than randomizing since they can simultaneously win for many different output methods, and they can ignore the overwhelming majority of output rules which have no chance of describing something interesting about the world). You seem to be saying that they will get less influence than if they randomized. Something about how this behavior is not sensible "goal-oriented behavior," and instead the sensible goal-oriented behavior is something that doesn't get them any influence? In what sense do you think it is sensible goal-oriented behavior, if it doesn't result in getting any influence? Maybe the key difference is that I'm talking about a scenario where the consequentialists have the goal of influencing the universal prior, and that possibility seems so weird to you that you aren't even engaging with it?

A few quick thoughts, and I'll get back to the other stuff later.

To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about.

That's good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms--i.e... (read more)

4Abram Demski
Just want to note that although it's been a week this is still in my thoughts, and I intend to get around to continuing this conversation... but possibly not for another two weeks.

I felt I had remained quiet about my disagreement with you for too long

Haha that's fine. If you don't voice your objections, I can't respond to them!

I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is th... (read more)

3Abram Demski
A couple of separate points: * My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue. * I feel like there's something a bit wrong with the 'theory/practice' framing at the moment. My position is that certain theoretical concerns (eg, embeddedness) have a tendency to translate to practical concerns (eg, approximating AIXI misses some important aspects of intelligence). Solving those 'in theory' may or may not translate to solving the practical issues 'in practice'. Some forms of in-theory solution, like setting the computer outside of the universe, are particularly unrelated to solving the practical problems. Your particular in-theory solution to embeddedness strikes me as this kind. I would contest whether it's even an in-theory solution to embeddedness problems; after all, are you theoretically saying that the computer running the imitation learning has no causal influence over the human being imitated? (This relates to my questions about whether the learner specifically requests demonstrations, vs just requiring the human to do demonstrations forever.) I don't really think of something like that as a "theoretical solution" to the realizability probelm at all. That's reserved for something like logical induction which has unrealistically high computational complexity, but does avoid a realizability assumption. Yeah, this is a fair point. To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about. I just perceive it to have hit diminishing returns. (This doesn't mean no one should ever think about it again, but it does seem worth communicating why the direction hasn't born fruit, at least to the extent that that line of research is happy being public.) Sounds right to me. Ah, ok! Basically this is a new way of thinking about it f

To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we're layering the consensus algorithm on top of

I think you’re imagining deep learning as a MAP-type approach—it just identifies a best hypothesis and does inference with that. Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too. Starting with 1), the bulk of the work for either the consensus algo... (read more)

4Abram Demski
Thanks for the extensive reply, and sorry for not getting around to it as quickly as I replied to some other things! I am sorry for the critical framing, in that it would have been more awesome to get a thought-dumb of ideas for research directions from you, rather than a detailed defense of your existing work. But of course existing work must be judged, and I felt I had remained quiet about my disagreement with you for too long. It's a fair point that it's no slower than idealized MAP. But the most important corner cut by deep learning is that deep learning represents just one hypothesis at a time, searching the space by following a gradient rather than by explicitly comparing options. The question is, how can we cut the same corner for the consensus algorithm, which needs to compare the outputs of many hypotheses? In some settings, this is possible: for sufficiently simple hypothesis spaces, we can check consensus without explicitly computing a bunch of hypotheses. However, for deep learning, it seems rather difficult. So, it seems like the best we can expect to do for deep learning is to train and run 100 hypotheses (or whatever number). This is a huge approximation in terms of MAP (since we have no guarantees that we are finding the 100 most probable, or anything), but we can naturally re-frame the consensus-alg guarantee in terms of frequency-of-malign-results for the NN training (rather than an assumption about at least 1 of the 100 most probable hypotheses being non-malign).  But this still means that, for a consensus of N hypotheses, the consensus algorithm will be N times slower (in terms of both training time and inference time). I expect N to be quite large, for reasons similar to what I said in the post: not only do we have to think N is large enough that at least one of the hypotheses is benign, but also we have to think that the benign hypothesis is at least as capable as any of the malign hypotheses (because otherwise it could get unlucky and be

I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors)

Agree.

There will be plenty of functions  that have fewer bits in their encoding than the real function  used by the demonstrator.

I don't think this is a problem. There will be plenty of them, but when they're wrong they'll get removed from the posterior.

1Koen Holtman
I have seen you mention a number of times in this comment thread that 'this is not a problem because eventually the bad/wrong policies will disappear from the top set'. You have not qualified this statement with 'but we need a very low α like α<1/|Π| to make this work in a safe way', so I remain somewhat uncertain about your views are about how low α needs to go. In any case, I'll now try to convince you that if α>1/|Π|, your statement that 'when they're wrong they'll get removed from the posterior' will not always mean what you might want it to mean. Is the demonstrator policy πd to get themselves killed? The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for. Say that πp∈Π is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where α>1/|Π| and where at time step t, we have the likely case that πp∉Παh<t. In the world I constructed for the counterexample, the remaining top policies Παh<t now perform a synchronized treacherous turn where they kill the demonstrator. In time step t+1 and later, the policies Παh<t+1 diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the null action. This eventually removes all 'wrong' policies from Παh<t+1+i, where 'wrong' means that they do not take the null action at all future time steps. The silver lining is perhaps that at least the agent will eventually stop, perform null actions only, after it has killed the demonstrator. Now. the paper proves that the behavior of the agent policy πiα will approximate that of the true demonstrator policy πd closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy πd had nothing to do with producing paperclips, this was a wrong guess all along. T

A policy outputs a distribution over , and equations 3 and 4 define what this distribution is for the imitator. If it outputs (0, a), that means  and and  and if it outputs (1, a), that means  and . When I say

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,

that's just describing the difference between equations 3 and 4. Look at equation 4 to see that when , the distribution over the action is equal to that of the demonstrator... (read more)

1Koen Holtman
I agree with your description above about how it all works. But I guess I was not explaining well enough why I got confused and why the edits of inserting the a and the bold text above would have stopped me getting confused. So I'll try again. I read the sentence fragment below equation (3) as an explanatory claim that the value defined in equation (3) defines the probability that the imitator is picking the action itself instead of deferring to the demonstrator, the probability given the history |h<t. However, this is not the value being defined by equation (3), instead it defines the probability the imitator is picking the action itself instead of deferring to the demonstrator when the history is |h<t and the next action taken is a. The actual probability of the imitator is picking the action itself under |h<t, is given by ∑a∈Aπiα(0,a|h<t), which is only mentioned in passing in the lines between equations (3) and (4). So when I was reading the later sections in the paper and I wanted to look back at what the probability was that the imitator would pick the action, my eye landed on equation (3) and the sentence below it. When I read that sentence, it stopped me from looking further to find the equation ∑a∈Aπiα(0,a|h<t), which is the equation I was really looking for. Instead my mind auto-completed equation (3) by adding an avga term to it, which makes for a much more conservative querying policy than the one you defined, and this then got me into wondering how you were dealing with learning nondeterminstic policies, if at all, etc. So overall I think you can improve readability by doing some edits to draw attention more strongly to the conditional nature of a, and foregroundig the definition of θq more clearly as a single-line equation.

What's the distinction between training and deployment when the model can always query for more data?

2Vanessa Kosoy
We're doing meta-learning. During training, the network is not learning about the real world, it's learning how to be a safe predictor. It's interacting with a synthetic environment, so a misprediction doesn't have any catastrophic effects: it only teaches the algorithm that this version of the predictor is unsafe. In other words, the malign subagents have no way to attack during training because they can access little information about what the real universe is like. The training process is designed to select predictors that only make predictions when they can be confident, and the training performance allows us to verify this goal has truly been achieved.

It may be helpful to point to specific sections of such a long paper.

(Also, I agree that a neural network trained trained with that reward could produce a deceptive model that makes a well-timed error.)

it certainly should have to consider different models of what the next transition will be like.

Yeah I was agreeing with that.

even though the Q learner is trained using its own point estimate of the next state, it isn't, at inference time, given access to that point estimate.

Right, but one thing the Q-network, in its forward pass, is trying to reproduce is the point of estimate of the Q-value of the next state (since it doesn't have access to it). What it isn't trying to reproduce, because it isn't trained that way, is multiple models of what the Q-value might be at a given possible next state.

I interpreted this bit as talking about RL

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

But taking us back out of RL, in a wide neural network with selective attention that enables many qualitatively different forward passes, gradient descent seems to be training the way different models get proposed (i.e. the way attention is allocated), since this happens in a single forward pass, and what we're left with is a modeling routine th... (read more)

2Evan Hubinger
This seems wrong to me—even though the Q learner is trained using its own point estimate of the next state, it isn't, at inference time, given access to that point estimate. The Q learner has to choose its Q values before it knows anything about what the Q value estimates will be of future states, which means it certainly should have to consider different models of what the next transition will be like.

So, if I understand the paper right, if  was still in the set of top policies at time , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.

This is exactly right.

But I am also reading in the abstract of the paper that

[...] queries to the demonstrator rapidly diminish in frequency

The reason this is consistent is that queries will become infrequent, but they will still be w... (read more)

1Koen Holtman
[long comment, my bigger agenda here is to get to a state where discussions on this forum start using much more math and subcase analysis when they talk about analysing and solving inner alignment problems.] OK, I think you are kind of saved here, in the case of my counter-example, because of the special nature of your equation (2) in the paper that computes the set of policies. To make this more specific: Say I refine the counter-example by adding the prior w(πi)=1/|Π|, so each candidate is considered equally likely from the start. Say that |Π|=1010. The key thing to note in my example is that, before the time-synchronized treacherous turn happens for all policies πi≠πd, all policies πi are equally compatible with events so far, so the posterior w(πi|h<t) is the same for all of them. When I use α=10−4 with these all-the-same-posterior policies, then (if I am reading equation (2) correctly) the set Παh<t of top policies will have size N=104. Ties are broken arbitrarily, so the real πd is unlikely to be among these top N=104<<1010 candidates. In that case, the agent will perform the treacherous turn and kill the operator. Correct me if I am wrong here. However. Case of α<1/|Π|: much better than I thought! However, and this is where things get interesting, if we use α=1011, so α<1/|Π|, then equation (2) will compute Παh<t=Π, the top set of policies is all policies. The correct policy is therefore still in the top set, and we get no treacherous turn. So equation (2) is doing a lot of heavy lifting in making the approach work! This also seems to be a major difference with the β based approach in your earlier paper. Not sure if you made this change specifically to deal with examples like my counter-example above, or if you had a more general reason -- the paper does not say. Pretty interesting! So I would now agree with your statement that your approach provably solves the inner alignment problem, at least for α<1/|Π| and for certain values of 'the inner alignm

That's possible. But it seems like way less of a convergent instrumental goal for agents living in a simulated world-models. Both options--our world optimized by us and our world optimized by a random deceptive model--probably contain very little of value as judged by agents in another random deceptive model.

So yeah, I would say some models would think like this, but I would expect the total weight on models that do to be much lower.

This is very nice and short!

And to state what you left implicit:

If , then in the setting with no malign hypotheses (which you assume to be safe), 0 is definitely the output, since the malign models can only shift the outcome by , so we assume it is safe to output 0. And likewise with outputting 1.

I'm pretty sure removing those is mostly just a technical complication

One general worry I have about assuming that the deterministic case extends easily to the stochastic case is that a sequence of probabilities that tends t... (read more)

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

Okay I think we've switched from talking about Q-learning to talking about policy gradient. (Or we were talking about the latter the whole time, and I didn't notice it). The question that I think is relevant is: how are possible world-models being hypothesized and analyzed? That's something I expect to be done with messy heuristics that sometimes have discontinuities their sequence of o... (read more)

3Evan Hubinger
Hmmm... I don't think I was ever even meaning to talk specifically about RL, but regardless I don't expect nearly as large of a difference between Q-learning and policy gradient algorithms. If we imagine both types of algorithms making use of the same size massive neural network, the only real difference is how the output of that neural network is interpreted, either directly as a policy, or as Q values that are turned into a policy via something like softmax. In both cases, the neural network is capable of implementing any arbitrary policy and should be getting a similar sort of feedback signal from the training process—especially if you're using a policy gradient algorithm that involves something like advantage estimation rather than actual rollouts, since the update rule in that situation is going to look very similar to the Q learning update rule. I do expect some minor differences in the sorts of models you end up with, such as Q learning being more prone to non-myopic behavior across episodes, and I think there are some minor reasons that policy gradient algorithms are favored in real-world settings, since they get to learn their exploration policy rather than having it hard-coded and can handle continuous action domains—but overall I think these sorts of differences are pretty minor and shouldn't affect whether these approaches can reach general intelligence or not.

If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, then you're going to get at least as screwed as by the billionth-best opportunity for treachery.

The treacherous models make their decisions first, and whether we query at a timestep depends on whether they chose that moment to be treacherous. We don't have to say: this timestep is one of the billion best for treachery--guess we had better query. The decision process a treacherous model faces is:

  1. Fro
... (read more)

Does it make sense to talk about , which is like  in being treacherous, but is uses the true model  instead of the honest model ? I guess you would expect  to have a lower posterior than ?

If we ever produce generally intelligent predictors (or "accurate world-models" in the terminology we've been using so far), we will need a process that is much more efficient than evolution.

But also, I certainly don't think that in order to be generally intelligent you need to start with a generally intelligent subroutine. Then you could never get off the ground. I expect good hypothesis-generation / model-proposal to use a mess of learned heuristics which would not be easily directed to solve arbitrary tasks, and I expect the heuristic "look for models n... (read more)

It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science.

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

if local search is this bad, I don't think it is a viable path to AGI

We know that local search processes can produce AGI, so viability is a question of efficiency—and we know that SGD is at least efficient enough to solve a wide variety of prob... (read more)

Specifically, they cooperate in that they perfectly mimic the true model up until the point where...

This thread began by considering deceptive models cooperating with each other in the sense of separating the timing of their treacherous turns in order to be maximally annoying to us. So maybe our discussion on that topic is resolved, and we can move on to this scenario.

up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted

if alpha is low enough, this won't ever happen, and if alpha is t... (read more)

So would you say you disagree with the claim

I think that arbitrary limits on heuristic search of the form "the next model I consider must be fairly close to the last one I did" will not help it very much if it's anywhere near smart enough to merit membership in a generally intelligent predictor.

?

3Evan Hubinger
Yeah; I think I would say I disagree with that. Notably, evolution is not a generally intelligent predictor, but is still capable of producing generally intelligent predictors. I expect the same to be true of processes like SGD.

Well, just like we can write down the defectors, we can also write down the cooperators

If it's only the case that we can write them down, but they're not likely to arise naturally as simple consequentialists taking over simple physics, then that extra description length will be seriously costly to them, and we won't need to worry about any role they might play in p(treacherous)/p(truth). Meanwhile, when I was saying we could write down some defectors, I wasn't making a simultaneous claim about their relative prior weight, only that their existence would sp... (read more)

4Evan Hubinger
Here's the setup I'm imagining, but perhaps I'm still misunderstanding something. Suppose you have a bunch of deceptive models that choose to cooperate and have larger weight in the prior than the true model (I think you believe this is very unlikely, though I'm more ambivalent). Specifically, they cooperate in that they perfectly mimic the true model up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted, at which point they all simultaneously defect. This allows for arbitrarily bad worst-case behavior.
Load More