What is process-based RL?
I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the pote...
By process-based RL, I mean: the reward for an action doesn't depend on the consequences of executing that action. Instead it depends on some overseer's evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.
I'm generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn't that much harder than nuclear non-proliferation, t...
I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,[4] and in my view this is looking more and more plausible over time.
I agree whole-heartedly with the first sentence. I'm not sure why you understand it to support the second sentence; I feel the first sentence supports my disagreement with the second sentence! Long-horizon RL is a differen...
Me: Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.
You: No it doesn't. They just care about interestingness.
Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?
You: Yes, but...
If you can admit that we agree on this basic point, I'm happy to discuss further about how good they are at what they aim to do.
1: If retractions were common, surely you would have said that was evidence peer review didn't accomplish much! If academics were only equally ...
Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.
"As a reviewer, I only recommend for acceptance papers that appear to be both valid and interesting."
Strongly agree - ... - Strongly Disagree
"As a reviewer, I would sooner recommend for acceptance a paper that was valid, but not incredibly interesting, than a paper that was interesting, but the conclusions weren't fully supported by the analysis."
Strongly agree - ... - Strongly Disagree
I don't understand. Importantly, these are optimistically biased, and you can't assume my true credences are this high. I assign much less than 90% probability to C. But still, they're perfectly consistent. M doesn't say anything about succeeding--only being allowed. M is basically saying: listing the places he'd be willing to live, do they all pass laws which would make building dangerously advanced AI illegal? The only logical connection between C and M is that M (almost definitely) implies C.
Thank you very much for saying that.
I was feeling disappointed about the lack of positive comments, and I realized recently I should probably go around commenting on posts that I think are good, since right now, I mostly only comment on posts when I feel I have an important disagreement. So it's hard to complain when I'm on the receiving end of that dynamic.
On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed ...
The assumption says "will do" not "will be able to do". And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
...In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game
Thank you for this review! A few comments on the weaknesses of my paper.
In particular, it explicitly says the argument does not apply to supervised learning.
Hardly a weakness if supervised learning is unlikely to be an existential threat!
Strength: Does not make very concrete assumptions about the AGI development model.
Weakness: Does not talk much about how AGI is likely to be developed, unclear which of the assumptions are more/less likely to hold for AGI being developed using the current ML paradigm.
The fact that the argument holds equally well no matter ...
I assume (6) means that your "anthropic update" scans across possible universes to find those that contain important decisions you might want to influence?
Yes, and then outputs strings from that set with probability proportional to their weight in the universal prior.
By (3) do you mean the same thing as "Simplest output channel that is controllable by advanced civilization with modest resources"?
I would say "successfully controlled" instead of controllable, although that may be what you meant by the term. (I decomposed this as controllable + making good gu...
Yeah, seems about right.
I think with 4, I've been assuming for the sake of argument that manipulators get free access to the right prior, and I don't have a strong stance on the question, but it's not complicated for a directly programmed anthropic update to be built on that right prior too.
I guess I can give some estimates for how many bits I think are required for each of the rows in the table. I'll give a point estimate, and a range for a 50% confidence interval for what my point estimate would be if I thought about it for an hour by myself and had to w...
Do you have some candidate "directly programmed anthropic update" in mind? (That said, my original claim was just about the universal prior, not about a modified version with an anthropic update)
I’m talking about the weight of an anthropically updated prior within the universal prior. I should have added “+ bits to encode anthropic update directly” to that side of the equation. That is, it takes some number of bits to encode “the universal prior, but conditioned on the strings being important to decision-makers in important worlds”. I don’t know how to enc...
I'm using some of the terminology I suggested here.
A factoring is a set of questions such that each signature of possible answers identifies a unique element. In 20 questions, you can tailor the questions depending on the answers to previous questions, and ultimately each element will have a bitstring signature depending on the history of yesses and nos. I guess you can define the question to include xors with previous questions, so that it effectively changes depending on the answers to others. But it's sometimes useful that the bitstrings are allowed to ...
I was thinking of some terminology that might make it easier to thinking about factoring and histories and whatnot.
A partition can be thought of as a (multiple-choice) question. Like for a set of words, you could have the partition corresponding to the question "Which letter does the word start with?" and then the partition groups together elements with the same answer.
Then a factoring is set of questions, where the set of answers will uniquely identify an element. The word that comes to mind for me is "signature", where an element's signature is the set o...
I was thinking about the difficulty of finite factored sets not understanding the uniform distribution over 4 elements, and it makes me feel like something fundamental needs to be recast. An analogy came to mind about eigenvectors vs. eigenspaces.
What we might like to be true about the unit eigenvectors of a matrix is that they are the unique unit vectors for which the linear transformation preserves direction. But if two eigenvectors have the same eigenvalue, the choice of eigenvectors is not unique--we could choose any pair on that plane. So really, it s...
I take your point that we are discussing some output rules which add extra computation states, and so some output rules will add fewer computation states than others.
I'm merging my response to the rest with my comment here.
They are using their highest probability guess about the output channel, which will be higher probability than the output channel exactly matching some camera on old earth (but may still be very low probability). I still don't understand the relevance.
I’m trying to find the simplest setting where we have a disagreement. We don’t need to think about cameras on earth quite yet. I understand the relevance isn’t immediate.
...They don't care about "their" Turing machine, indeed they live in an infinite number of Turing machines that (among other things) output bit
Okay, now suppose they want the first N bits of the output of their Turing machine to obey predicate P, and they assign that a value of 100, and a they assign a value of 0 to any N-bit string that does not obey predicate P. And they don't value anything else. If some actions have a higher value than other actions, what information about the output tape dynamics are they using, and how did they acquire it?
If these consequentialists ascribed a value of 100 to the next output bit being 1, and a value of 0 to the next output bit being 0, and they valued nothing else, would you agree that all actions available to them have identical expected value under the distribution over Turing machines that I have described?
It's definitely not too weird a possibility for me. I'm trying to reason backwards here--the best strategy available to them can't be effective in expectation at achieving whatever their goals are with the output tape, because of information-theoretic impossibilities, and therefore, any given strategy will be that bad or worse, including randomization.
So we end up with some leading hypotheses about the Turing machine we are running on, the history that gave rise to us, and the output rule used by that Turing machine.
I feel like this story has run aground on an impossibility result. If a random variable’s value is unknowable (but its distribution is known) and an intelligent agent wants to act on its value, and they randomize their actions, the expected log probability of them acting on the true value cannot exceed the entropy of the distribution, no matter their intelligence. (And if they’re wrong about...
A few quick thoughts, and I'll get back to the other stuff later.
To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about.
That's good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms--i.e...
I felt I had remained quiet about my disagreement with you for too long
Haha that's fine. If you don't voice your objections, I can't respond to them!
I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is th...
To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we're layering the consensus algorithm on top of
I think you’re imagining deep learning as a MAP-type approach—it just identifies a best hypothesis and does inference with that. Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too. Starting with 1), the bulk of the work for either the consensus algo...
I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors)
Agree.
A policy outputs a distribution over , and equations 3 and 4 define what this distribution is for the imitator. If it outputs (0, a), that means and and and if it outputs (1, a), that means and . When I say
The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,
that's just describing the difference between equations 3 and 4. Look at equation 4 to see that when , the distribution over the action is equal to that of the demonstrator...
it certainly should have to consider different models of what the next transition will be like.
Yeah I was agreeing with that.
even though the Q learner is trained using its own point estimate of the next state, it isn't, at inference time, given access to that point estimate.
Right, but one thing the Q-network, in its forward pass, is trying to reproduce is the point of estimate of the Q-value of the next state (since it doesn't have access to it). What it isn't trying to reproduce, because it isn't trained that way, is multiple models of what the Q-value might be at a given possible next state.
I interpreted this bit as talking about RL
I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.
But taking us back out of RL, in a wide neural network with selective attention that enables many qualitatively different forward passes, gradient descent seems to be training the way different models get proposed (i.e. the way attention is allocated), since this happens in a single forward pass, and what we're left with is a modeling routine th...
So, if I understand the paper right, if was still in the set of top policies at time , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.
This is exactly right.
But I am also reading in the abstract of the paper that
[...] queries to the demonstrator rapidly diminish in frequency
The reason this is consistent is that queries will become infrequent, but they will still be w...
That's possible. But it seems like way less of a convergent instrumental goal for agents living in a simulated world-models. Both options--our world optimized by us and our world optimized by a random deceptive model--probably contain very little of value as judged by agents in another random deceptive model.
So yeah, I would say some models would think like this, but I would expect the total weight on models that do to be much lower.
This is very nice and short!
And to state what you left implicit:
If , then in the setting with no malign hypotheses (which you assume to be safe), 0 is definitely the output, since the malign models can only shift the outcome by , so we assume it is safe to output 0. And likewise with outputting 1.
I'm pretty sure removing those is mostly just a technical complication
One general worry I have about assuming that the deterministic case extends easily to the stochastic case is that a sequence of probabilities that tends t...
I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.
Okay I think we've switched from talking about Q-learning to talking about policy gradient. (Or we were talking about the latter the whole time, and I didn't notice it). The question that I think is relevant is: how are possible world-models being hypothesized and analyzed? That's something I expect to be done with messy heuristics that sometimes have discontinuities their sequence of o...
If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, then you're going to get at least as screwed as by the billionth-best opportunity for treachery.
The treacherous models make their decisions first, and whether we query at a timestep depends on whether they chose that moment to be treacherous. We don't have to say: this timestep is one of the billion best for treachery--guess we had better query. The decision process a treacherous model faces is:
If we ever produce generally intelligent predictors (or "accurate world-models" in the terminology we've been using so far), we will need a process that is much more efficient than evolution.
But also, I certainly don't think that in order to be generally intelligent you need to start with a generally intelligent subroutine. Then you could never get off the ground. I expect good hypothesis-generation / model-proposal to use a mess of learned heuristics which would not be easily directed to solve arbitrary tasks, and I expect the heuristic "look for models n...
It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science.
I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.
if local search is this bad, I don't think it is a viable path to AGI
We know that local search processes can produce AGI, so viability is a question of efficiency—and we know that SGD is at least efficient enough to solve a wide variety of prob...
Specifically, they cooperate in that they perfectly mimic the true model up until the point where...
This thread began by considering deceptive models cooperating with each other in the sense of separating the timing of their treacherous turns in order to be maximally annoying to us. So maybe our discussion on that topic is resolved, and we can move on to this scenario.
up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted
if alpha is low enough, this won't ever happen, and if alpha is t...
Well, just like we can write down the defectors, we can also write down the cooperators
If it's only the case that we can write them down, but they're not likely to arise naturally as simple consequentialists taking over simple physics, then that extra description length will be seriously costly to them, and we won't need to worry about any role they might play in p(treacherous)/p(truth). Meanwhile, when I was saying we could write down some defectors, I wasn't making a simultaneous claim about their relative prior weight, only that their existence would sp...
I agree with this in a sense, although I may be quite a bit a more harsh about what counts as "executing an action". For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as "executing the action" in the overseer-conversation environment, even if the action looks like it's for some other environment, like a plan to launch a new product in the market. I do ... (read more)