I was a bit lazy in how I phrased this. I agree with all your points; the thing I'm trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion "Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for":
FWIW I think of "OpenAI leadership being untrustworthy" (a significant factor in me leaving) as different from "OpenAI having bad safety policies" (not a significant factor in me leaving). Not sure if it matters, I expect that Scott was using "safety policies" more expansively than I do. But just for the sake of clarity:
I am generally pretty sympathetic to the idea that it's really hard to know what safety policies to put in place right now. Many policies pushed by safety people (including me, in the past) have been mostly kayfabe (e.g. being valuable as c...
Cool, ty for (characteristically) thoughtful engagement.
I am still intuitively skeptical about a bunch of your numbers but now it's the sort of feeling which I would also have if you were just reasoning more clearly than me about this stuff (that is, people who reason more clearly tend to be able to notice ways that interventions could be surprisingly high-leverage in confusing domains).
1. Yepp, seems reasonable. Though FYI I think of this less as some special meta argument, and more as the common-sense correction that almost everyone implicitly does when giving credences, and rationalists do less than most. (It's a step towards applying outside view, though not fully "outside view".)
2. Yepp, agreed, though I think the common-sense connotations of "if this became" or "this would have a big effect" are causal, especially in the context where we're talking to the actors who are involved in making that change. (E.g. the non-causal interpreta...
Good point re 2. Re 1, meh, still seems like a meta-argument to me, because when I roll out my mental simulations of the ways the future could go, it really does seem like my If... condition obtaining would cut out about half of the loss-of-control ones.
Re 3: point by point:
1. AISIs existing vs. not: Less important; I feel like this changes my p(doom) by more like 10-20% rather than 50%.
2. Big names coming out: idk this also feels like maybe 10-20% rather than 50%
3. I think Anthropic winning the race would be a 40% thing maybe, but being a runner-up doesn'...
We have discussed this dynamic before but just for the record:
I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half.
This is IMO not the type of change that should be able to cut someone's P(doom) in half. There are so many different factors that are of this size and importance or bigger (including many that people simply have not thought of yet) such that, if this change could halv...
Sorry! My response:
1. Yeah you might be right about this, maybe I should get less excited and say something like "it feels like it should cut in half but taking into account Richard's meta argument I should adjust downwards and maybe it's just a couple percentage points"
2. If the conditional obtains, that's also evidence about a bunch of other correlated good things though (timelines being slightly longer, people being somewhat more reasonable in general, etc.) so maybe it is legit to think this would have quite a big effect
3. Are you sure there are so man...
The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:
Scott Garrabrant just convinced me that my notion of conservatism was conflating two things:
I mainly intend conservatism to mean the former.
Whose work is relevant, according to you?
If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science.
Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris's interpretability work and still not know whether or not he's an "alignment researcher"...
I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting).
I disagree, I'd rather they'd hire someone who cares about making AI go well. E.g. I like Sam Marks's work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that's in the direction of being useful. (T...
What if your research goal is "I'd like to understand how neural networks work?" This is not research primarily about how to make AIs aligned. We tend to hypothesize, as a community, that it will help with alignment more than it helps with capabilities. But that's not an inherent part of the research goal for many interpretability researchers.
(Same for "I'd like to understand how agency works", which is a big motivation for many agent foundations researchers.)
Conversely, what if your research goal is "I'm going to design a training run that will produce a ...
Fair point. I've now removed that section from the post (and also, unrelatedly, renamed the post).
I was trying to make a point about people wanting to ensure that AI in general (not just current models) is "aligned", but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.
Nice post. I'm excited about the bargaining interpretation of UDT.
However, if we think of our probability for the coin-flip as the result of bargaining, it makes sense that it might be sensitive to size. The negotiation which was willing to trade $100 from one branch to get $10,000 in another branch need not be equally willing to perform that trade arbitrarily many times.
Given this, is there any reason to focus on iterated counterfactual mugging, as opposed to just counterfactual muggings with higher stakes?
It seems like iteration is maybe related to learn...
The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.
I agree that the particular type of misaligned goal is not crucial. I'm thinking of molecular squiggles as an unusually clean type of misalignment to make arguments about, because it's very clear that they're not valuable. If you told me that molecular squiggles weren't a central example of a goal that you think a misaligned superintelligence might have, then I'd update, but it sounds like your statements are consistent with t...
If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?
Value them primarily? Uhhh... maybe 1:3 against? I admit I have never actually pondered this question before today; but 1 in 4 uncontrolled superintelligences spending most of their resources on tiny squiggles doesn't sound off by, like, more than 1-2 orders of magnitude in either direction.
...Clocks ar
I haven't yet read through them thoroughly, but these four papers by Oliver Richardson are pattern-matching to me as potentially very exciting theoretical work.
tl;dr: probabilistic dependency graphs (PDGs) are directed graphical models designed to be able to capture inconsistent beliefs (paper 1). The definition of inconsistency is a natural one which allows us to, for example, reframe the concept of "minimizing training loss" as "minimizing inconsistency" (paper 2). They provide an algorithm for inference in PDGs (paper 3) and an algorithm for learning vi...
Ah, sorry for the carelessness on my end. But this still seems like a substantive disagreement: you expect
, and I don't, for the reasons in my comment.
Thanks for the extensive comment! I'm finding this discussion valuable. Let me start by responding to the first half of your comment, and I'll get to the rest later.
...The simplicity of a goal is inherently dependent on the ontology you use to view it through: while is (likely) true, pay attention to how this changes the ontology! The goal of the agent is indeed very simple, but not because the "essence" of the goal simplifies; instead, it's merely because it gets access to a more powerful ontology that has more detail, granu
I don't argue at any point that ASIs will have a single goal. The argument goes through equally well if it has many. The question is why some of those goals are of the form "tile the universe with squiggles" at all. That's the part I'm addressing in this post.
Curious who just strong-downvoted and why.
Early stage votes are pretty noisy (and I think have been getting noisier over time, which is somewhat of a proxy of polarization, which makes me sad).
Another concern about safety cases: they feel very vulnerable to regulatory capture. You can imagine a spectrum from "legislate that AI labs should do X, Y, Z, as enforced by regulator R" to "legislate that AI labs need to provide a safety case, which we define as anything that satisfies regulator R". In the former case, you lose flexibility, but the remit of R is very clear. In the latter case, R has a huge amount of freedom to interpret what counts as an adequate safety case. This can backfire badly if R is not very concerned about the most important thr...
Some opinions about AI and epistemology:
How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.
That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom.
I don't really buy this statement. Most frames, from most disciplines, and most styles of reasoning, do not make clear predictions about what will happen to humanity in the long-run future. A very few do, but the vast majority are silent on this issue. Silence is not anything like "50%".
Most frames, from most disciplines, and most styles of reasoning, don't predict sparks when you put metal in a microwave. This doesn't mean I don't...
Part of my view here is that ARA agents will have unique affordances that no human organization will have had before (like having truly vast, vast amounts of pretty high skill labor).
The more labor they have, the more detectable they are, and the easier they are to shut down. Also, are you picturing them gaining money from crimes, then buying compute legitimately? I think the "crimes" part is hard to stop but the "paying for compute" part is relatively easy to stop.
My guess is that you need to be a decent but not amazing software engineer to ARA.
Yeah, you're probably right. I still stand by the overall point though.
1) It’s not even clear people are going to try to react in the first place.
I think this just depends a lot on how large-scale they are. If they are using millions of dollars of compute, and are effectively large-scale criminal organizations, then there are many different avenues by which they might get detected and suppressed.
If we don't solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop.
A world which can pause AI development is one which can also easily throttle ARA AIs.
...The central point is:
- At some
However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger.
This is what I meant by "ARA as a benchmark"; maybe I should have described it as a proxy instead. Though while I agree that ARA rules out most danger, I think that's because it's just quite a low bar. The sort of tasks involved in buying compute etc are ones most humans could do. Meanwhile more plausible threat models involve expert-level or superhuman hacking. So I expect a significant gap between ARA and those threat models.
...once you do h
I think the opposite: ARA is just not a very compelling threat model in my mind. The key issue is that AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down. While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources. Meanwhile, if they grow large enough to be spending serious amounts of money, they'...
Hmm, I agree that ARA is not that compelling on its own (as a threat model). However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger. And, once you do have ARA ability, you just need some moderately potent self-improvement ability (including training successor models) for the situation to look reasonably scary. Further, it seems somewhat hard to do capabilities evaluations that rule out this self-improvement if models are ARA capable given that there are so possible routes.
Edit: TBC, I think it'...
You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.
Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible.
This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition?
The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-ca...
Is there a principled way to avoid the chaos of a too-early market state while also steering clear of knowledge we need to be updateless toward?
Is there a particular reason to think that the answer to this shouldn't just be "first run a logical inductor to P_f(f(n)), then use that distribution to determine how to use P_f(n) to determine how to choose an action from P_n" (at least for large enough n)?
But if you think TAI is coming within 10 years (for example, if you think that the current half-life on worlds surviving is 10 years; if you think 10 years is the amount of time in which half of worlds are doomed)
Note that these are very different claims, both because the half-life for a given value is below its mean, and because TAI doesn't imply doom. Even if you do have very high P(doom), it seems odd to just assume everyone else does too.
...then depending on your distribution-over-time you should absolutely not wait 5 years before doing research, because
I feel kinda frustrated whenever "shard theory" comes up in a conversation, because it's not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".
This is a particular pity because I think there's a version of the "shard" framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be int...
Copying over a response I wrote on Twitter to Emmett Shear, who argued that "it's just a bad way to solve the problem. An ever more powerful and sophisticated enemy? ... If the process continues you just lose eventually".
I think there are (at least) two strong reasons to like this approach:
1. It’s complementary with alignment.
2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs,...
People sometimes try to reason about the likelihood of deceptive alignment by appealing to speed priors and simplicity priors. I don't like such appeals, because I think that the differences between aligned and deceptive AGIs will likely be a very small proportion of the total space/time complexity of an AGI. More specifically:
1. If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring...
Ty for review. I still think it's better, because it gets closer to concepts that might actually be investigated directly. But happy to agree to disagree here.
Small relevant datapoint: the paper version of this was just accepted to ICLR, making it the first time a high-level "case for misalignment as an x-risk" has been accepted at a major ML conference, to my knowledge. (Though Langosco's goal misgeneralization paper did this a little bit, and was accepted at ICML.)
Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn't forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear's role in the whole mechanism)?
I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves th...
(drafted this reply a couple months ago but forgot to send it, sorry)
your low-level model of a specific gear's dynamics didn't change — locally, it was as correct as it could ever be.
And if you had a terminal utility function over that gear (e. g., "I want it to spin at the rate of 1 rotation per minutes"), that utility function won't change in the light of your model expanding, either. Why would it?
Let me list some ways in which it could change:
In the standard story, what are the terminal goals? You could say "random" or "a mess", but I think it's pretty compelling to say "well, they're probably related to the things that the agent was rewarded for during training". And those things will likely include "curiosity, gaining access to more tools or stockpiling resources".
I call these "convergent final goals" and talk more about them in this post.
I also think that an AGI might systematize other goals that aren't convergent final goals, but these seem harder to reason about, and my central story for w...
Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there's probably some additional argument that people would make about why this isn't just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)
Why not? It seems like this is a good description of how values change for humans under self-reflection; why not for AIs?
I'd classify them as values insofar as people care about them intrinsically.
Then they might also be strategies, insofar as people also care about them instrumentally.
I guess I should get rid of the "only" in the sentence you quoted? But I do want to convey "something which is only a strategy, not a goal or value, doesn't have any intrinsic value". Will think about phrasing.
...It's not actually the case that the derivation of a higher abstraction level always changes our lower-level representation. Again, consider people -> social groups -> countries. Our models of specific people we know, how we relate to them, etc., don't change just because we've figured out a way to efficiently reason about entire groups of people at once. We can now make better predictions about the world, yes, we can track the impact of more-distant factors on our friends, but we don't actually start to care about our friends in a different way in th
I agree that this is closely related to the predictive processing view of the brain. In the post I briefly distinguish between "low-level systematization" and "high-level systematization"; I'd call the thing you're describing the former. Whereas the latter seems like it might be more complicated, and rely on whatever machinery brains have on top of the predictive coding (e.g. abstract reasoning, etc).
In particular, some humans are way more systematizing than others (even at comparable levels of intelligence). And so just saying "humans are constantly doing...
Thanks for the comment! I agree that thinking of minds as hierarchically modeling the world is very closely related to value systematization.
But I think the mistake you're making is to assume that the lower levels are preserved after finding higher-level abstractions. Instead, higher-level abstractions reframe the way we think about lower-level abstractions, which can potentially change them dramatically. This is what happens with most scientific breakthroughs: we start with lower-level phenomena, but we don't understand them very well until we discover th...
I'm very sympathetic to this complaint; I think that these arguments simply haven't been made rigorously, and at this point it seems like Nate and Eliezer are not in an epistemic position where they're capable of even trying to do so. (That is, they reject the conception of "rigorous" that you and I are using in these comments, and therefore aren't willing to formulate their arguments in a way which moves closer to meeting it.)
You should look at my recent post on value systematization, which is intended as a framework in which these claims can be discussed more clearly.
FWIW I think that gradient hacking is pretty plausible, but it'll probably end up looking fairly "prosaic", and may not be a problem even if it's present.
Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?
How do you feel about "In an ideal world, we'd stop all AI progress"? Or "ideally, we'd stop all AI progress"?
FWIW I think some of the thinking I've been doing about meta-rationality and ontological shifts feels like metaphilosophy. Would be happy to call and chat about it sometime.
I do feel pretty wary about reifying the label "metaphilosophy" though. My preference is to start with a set of interesting questions which we can maybe later cluster into a natural category, rather than starting with the abstract category and trying to populate it with questions (which feels more like what you're doing, although I could be wrong).
Strong disagree. This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs (e.g. imagine trying to describe the useful applications of electricity before anyone knew what it was or how it worked; or imagine Galileo trying to justify the practical use of studying astronomy).
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works; this by itself is sufficient to re...
Five clusters of alignment researchers
Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:
"I don't think inserting Knightian uncertainty is that helpful; the object-level stuff is usually the most important thing to be communicating."
The main point of my post is that accounting for disagreements about Knightian uncertainly is the best way to actually communicate object level things, since otherwise people get sidetracked by epistemological disagreements.
"I'd follow the policy of first making it common knowledge that you're reporting your inside views"
This is a good step, but one part of the epistemological disagreements I mention above is that ...
FWIW I think that confrontation-worthy empathy and use of the phrase "everyone will die" to describe AI risk are approximately mutually exclusive with each other, because communication using the latter phrase results from a failure to understand communication norms.
(Separately I also think that "if we build AGI, everyone will die" is epistemically unjustifiable given current knowledge. But the point above still stands even if you disagree with that bit.)
I've now edited that section. Old version and new version here for posterity.
Old version:
None of these is very satisfactory! Intuitively speaking, Alice and Bob want to come to an agreement where respect for both of their interests is built in. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to weighted averages. The best they can do is to agree on a probabilistic mixture of EUMs—e.g. tossing a coin to decide between option 1 and opti... (read more)