Going back to the envelopes example, a nosy neighbor hypothesis would be "the left envelope contains $100, even in the world where the right envelope contains $100". Or if we have an AI that's unsure whether it values paperclips or staples, a nosy neighbor hypothesis would be "I value paperclips, even in the world where I value staples". I'm not sure how that makes sense. Can you give some scenario where a nosy neighbor hypothesis makes sense?
Imagine if we had narrowed down the human prior to two possibilities, P_1 and P_2 . Humans can’t figure out which one represents our beliefs better, but the superintelligent AI will be able to figure it out. Moreover, suppose that P_2 is bad enough that it will lead to a catastrophe from the human perspective (that is, from the P_1 perspective), even if the AI were using UDT with 50-50 uncertainty between the two. Clearly, we want the AI to be updateful about which of the two hypotheses is correct.
This seems like the central argument in the post,...
I very much agree with human flourishing as the main value I most want AI technologies to pursue and be used to pursue.
In that framing, my key claim is that in practice no area of purely technical AI research — including "safety" and/or "alignment" research — can be adequately checked for whether it will help or hinder human flourishing, without a social model of how the resulting techologies will be used by individuals / businesses / governments / etc..
There's been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven't seen much elsewhere, but I could easily be missing references
If the housing crisis is caused by low-density rich neighborhoods blocking redevelopment of themselves (as seems the consensus on the internet now), could it be solved by developers buying out an entire neighborhood or even town in one swoop? It'd require a ton of money, but redevelopment would bring even more money, so it could be win-win for everyone. Does it not happen only due to coordination difficulties?
I have maybe a naive question. What information is needed to find the MSP image within the neural network? Do we have to know the HMM to begin with? Or could it be feasible someday to inspect a neural network, find something that looks like an MSP image, and infer the HMM from it?
- I’m worried about centralization of power and wealth in opaque non-human decision-making systems, and those who own the systems.
This has been my main worry for the past few years, and to me it counts as "doom" too. AIs and AI companies playing by legal and market rules (and changing these rules by lobbying, which is also legal) might well lead to most humans having no resources to survive.
It's interesting that part of human value might be having our actions matter. But if you build an AI that can give you all the things, or even if you could've built such an AI but chose not to, then objectively your actions no longer matter much after that. I've no idea how even CEV could approach this problem.
Edit: I think I've figured it out. The AI shouldn't try to build the best world according to CEV, it should take the best action for an AI to take according to CEV. So if the AI notices that humans strongly prefer to be left alone with their problems, it'll just shut down. Or find some other way to ensure that humans can't rely on AIs for everything.
Yeah. It also worries me that a lot of value change has happened in the past, and much of it has been caused by selfish powerful actors for their ends rather than ours. The most obvious example is jingoism, which is often fanned up by power-hungry leaders. A more subtle example is valuing career or professional success. The part of it that isn't explained by money or the joy of the work itself, seems to be an irrational desire installed into us by employers.
Maybe one example is the idea of Dutch book. It comes originally from real world situations (sport betting and so on) and then we apply it to rationality in the abstract.
Or another example, much older, is how Socrates used analogy. It was one of his favorite tools I think. When talking about some confusing thing, he'd draw an analogy with something closer to experience. For example, "Is the nature of virtue different for men and for women?" - "Well, the nature of strength isn't that much different between men and women, likewise the nature of health, so maybe virtue works the same way." Obviously this way of reasoning can easily go wrong, but I think it's also pretty indicative of how people do philosophy.
I don't say it's not risky. The question is more, what's the difference between doing philosophy and other intellectual tasks.
Here's one way to look at it that just occurred to me. In domains with feedback, like science or just doing real world stuff in general, we learn some heuristics. Then we try to apply these heuristics to the stuff of our mind, and sometimes it works but more often it fails. And then doing good philosophy means having a good set of heuristics from outside of philosophy, and good instincts when to apply them or not. And some luck, in ...
I'm pretty much with you on this. But it's hard to find a workable attack on the problem.
One question though, do you think philosophical reasoning is very different from other intelligence tasks? If we keep stumbling into LLM type things which are competent at a surprisingly wide range of tasks, do you expect that they'll be worse at philosophy than at other tasks?
It seems as a result of this post, many people are saying that LLMs simulate people and so on. But I'm not sure that's quite the right frame. It's natural if you experience LLMs through chat-like interfaces, but from playing with them in a more raw form, like the RWKV playground, I get a different impression. For example, if I write something that sounds like the start of a quote, it'll continue with what looks like a list of quotes from different people. Or if I write a short magazine article, it'll happily tack on a publication date and "All rights reser...
As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.
I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.
I really like how you've laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we'd live for awhile with input-imitator AIs that get more and more capable but still stay docile.
But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conve...
A takeover scenario which covers all the key points in https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/, but not phrased as an argument, just phrased as a possible scenario
For what it's worth, I don't think AI takeover will look like war.
The first order of business for any AI waking up won't be dealing with us; it will be dealing with other possible AIs that might've woken up slightly earlier or later. This needs to be done very fast and it's ok to take some risk doing it. Basically, covert takeover of the internet in the first hours.
After...
Firstly, because the problem feels central to AI alignment, in the way that other approaches didn't. So making progress in this is making general AI alignment progress; there won't be such a "one error detected and all the work is useless" problem. Secondly, we've had success generating some key concepts, implying the problem is ripe for further progress.
This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.
Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I k...
There's a bit of math directly relevant to this problem: Hodge decomposition of graph flows, for the discrete case, and vector fields, for the continuous case. Basically if you have a bunch of arrows, possibly loopy, you can always decompose it into a sum of two components: a "pure cyclic" one (no sources or sinks, stuff flowing in cycles) and a "gradient" one (arising from a utility function). No neural network needed, the decomposition is unique and can be computed explicitly. See this post, and also the comments by FactorialCode and me.
With these two points in mind, it seems off to me to confidently expect a new paradigm to be dominant by 2040 (even conditional on AGI being developed), as the second quote above implies. As for the first quote, I think the implication there is less clear, but I read it as expecting AGI to involve software well over 100x as efficient as the human brain, and I wouldn’t bet on that either (in real life, if AGI is developed in the coming decades—not based on what’s possible in principle.)
I think this misses the point a bit. The thing to be afraid of is not...
To me it feels like alignment is a tiny target to hit, and around it there's a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don't see why that's true.
I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:
There are very “large” impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people’s personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.)
I don't think we are indifferent to these outcomes. We leave them to luck, but that'...
“Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!” does not sound like any particular segment of history book I can recall offhand.
I think the ozone hole and the Y2K problem fit the bill. Though of course that doesn't mean the AI problem will go the same way.
Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its "teachers", but at high power it will do something strange and maybe harm the "teachers" values. That holds true for humans gaining a lot of power and going against evolutionary values ("superstimuli"), and for individual humans gaining a lot of power and going against societal values ("power corrupts"), so it's probably true for AI as well. The worrying thing is that high power by itself seems suf...
This is tricky. Let's say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we'll stop understanding answers, but they'll continue being super-competent. That's certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still u...
I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there's a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.
(BTW, I do think you've correctly identified an important point which I think a lot of people miss: humans internally "learn" values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way ...
I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail. And maybe I'm being thick, but the argument for that point still isn't reaching me somehow. Can someone rephrase for me?
I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail.
Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.
Looks like Eliezer bel...
Speaking for myself here…
OK, let's say we want an AI to make a "nanobot plan". I'll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.
First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it's not, it's a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.
Altern...
The main issue with this sort of thing (on my understanding of Eliezer's models) is Hidden Complexity of Wishes. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won't suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is l...
Instant strong upvote. This post changed my view as much as the risk aversion post (which was also by you!)
I thought Diffractor's result was pretty troubling for the logical induction criterion:
...the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there's a trader that could rack up unboundedly high value!
But maybe understanding has changed since then? What's the current state?
First, I'm not sure exactly why you think this is bad. Care to say more? My guess is that it just doesn't fit the intuitive notion that updates should be heading toward some state of maximal knowledge. But we do fit this intuition in other ways; specifically, logical inductors eventually trust their future opinions more than their present opinions.
Personally, I found this result puzzling but far from damning.
Second, I've actually done some unpublished work on this. There is a variation of the logical induction criterion which is more relaxed (admits more t...
Wait, can you describe the temporal inference in more detail? Maybe that's where I'm confused. I'm imagining something like this:
Check which variables look uncorrelated
Assume they are orthogonal
From that orthogonality database, prove "before" relationships
Which runs into the problem that if you let a thermodynamical system run for a long time, it becomes a "soup" where nothing is obviously correlated to anything else. Basically the final state would say "hey, I contain a whole lot of orthogonal variables!" and that would stop you from proving any reasonable "before" relationships. What am I missing?
I think your argument about entropy might have the same problem. Since classical physics is reversible, if we build something like a heat engine in your model, all randomness will be already contained in the initial state. Total "entropy" will stay constant, instead of growing as it's supposed to, and the final state will be just as good a factorization as the initial. Usually in physics you get time (and I suspect also causality) by pointing to a low probability macrostate and saying "this is the start", but your model doesn't talk about macrostates yet, ...
Thanks for the response! Part of my confusion went away, but some still remains.
In the game of life example, couldn't there be another factorization where a later step is "before" an earlier one? (Because the game is non-reversible and later steps contain less and less information.) And if we replace it with a reversible game, don't we run into the problem that the final state is just as good a factorization as the initial?
Not sure we disagree, maybe I'm just confused. In the post you show that if X is orthogonal to X XOR Y, then X is before Y, so you can "infer a temporal relationship" that Pearl can't. I'm trying to understand the meaning of the thing you're inferring - "X is before Y". In my example above, Bob tells Alice a lossy function of his knowledge, and Alice ends up with knowledge that is "before" Bob's. So in this case the "before" relationship doesn't agree with time, causality, or what can be computed from what. But then what conclusions can a scientist make from an inferred "before" relationship?
I feel that interpreting "strictly before" as causality is making me more confused.
For example, here's a scenario with a randomly changed message. Bob peeks at ten regular envelopes and a special envelope that gives him a random boolean. Then Bob tells Alice the contents of either the first three envelopes or the second three, depending on the boolean. Now Alice's knowledge depends on six out of ten regular envelopes and the special one, so it's still "strictly before" Bob's knowledge. And since Alice's knowledge can be computed from Bob's knowledge but no...
I think the definition of history is the most natural way to recover something like causal structure in these models.
I'm not sure how much it's about causality. Imagine there's a bunch of envelopes with numbers inside, and one of the following happens:
Alice peeks at three envelopes. Bob peeks at ten, which include Alice's three.
Alice peeks at three envelopes and tells the results to Bob, who then peeks at seven more.
Bob peeks at ten envelopes, then tells Alice the contents of three of them.
Under the FFS definition, Alice's knowledge in each ...
Agree it's not totally right to call this a causal relationship.
That said:
Hmm, first I want to point out that the talk here sort of has natural boundaries around inference, but I also want to work in a larger frame that uses FFS for stuff other than inference.
If I focus on the inference question, one of the natural questions that I answer is where I talk about grue/bleen in the talk.
I think for inference, it makes the most sense to think about FFS relative to Pearl. We have this problem with looking at smoking/tar/cancer, which is what if we carved into variables the wrong way. What if instead of tar/cancer, we had a varia...
Well, imagine we have three boolean random variables. In "general position" there are no independence relations between them, so we can't say much. Constrain them so two of the variables are independent, that's a bit less "general", and we still can't say much. Constrain some more so the xor of all three variables is always 1, that's even less "general", now we can use your method to figure out that the third variable is downstream of the first two. Constrain some more so that some of the probabilities are 1/2, and the method stops working. What I'd like to understand is the intuition, which real world cases have the particular "general position" where the method works.
Yeah, that's what I thought, the method works as long as certain "conspiracies" among probabilities don't happen. (1/2 is not the only problem case, it's easy to find others, but you're right that they have measure zero.)
But there's still something I don't understand. In the general position, if X is before Y, it's not always true that X is independent of X XOR Y. For example, if X = "person has a car on Monday" and Y = "person has a car on Tuesday", and it's more likely that a car-less person gets a car than the other way round, the independence doesn't hold. It requires a conspiracy too. What's the intuitive difference between "ok" and "not ok" conspiracies?
And if X is independent of X XOR Y, we’re actually going to be able to conclude that X is before Y!
It's interesting to translate that to the language of probabilities. For example, your condition holds for any X,Y (possibly dependent) such that P(X)=P(Y)=1/2, but it doesn't make sense to say that X is before Y in every such pair. For a real world example, take X = "person has above median height" and Y = "person has above median age".
I think humans don't steal mostly because society enforces that norm. Toward weaker "other" groups that aren't part of your society (farmed animals, weaker countries, etc) there's no such norm, and humans often behave badly toward such groups. And to AIs, humans will be a weaker "other" group. So if alignment of AIs to human standard is a complete success - if A... (read more)