Thanks for this! I think the categories of morality is a useful framework. I am very wary of the judgement that care-morality is appropriate for less capable subjects - basically because of paternalism.
Just to confirm that this is a great example and wasn't deliberately left out.
I found this post to be a clear and reasonable-sounding articulation of one of the main arguments for there being catastrophic risk from AI development. It helped me with my own thinking to an extent. I think it has a lot of shareability value.
I think this is basically correct and I'm glad to see someone saying it clearly.
I agree with this post. However, I think it's common amongst ML enthusiasts to eschew specification and defer to statistics on everything. (Or datapoints trying to capture an "I know it when I see it" "specification".)
The trick is that for some of the optimisations, a mind is not necessary. There is a sense perhaps in which the whole history of the universe (or life on earth, or evolution, or whatever is appropriate) will become implicated for some questions, though.
I think https://www.alignmentforum.org/posts/TATWqHvxKEpL34yKz/intelligence-or-evolution is somewhat related in case you haven't seen it.
I'll add $500 to the pot.
Interesting - it's not so obvious to me that it's safe. Maybe it is because avoiding POUDA is such a low bar. But the sped up human can do the reflection thing, and plausibly with enough speed up can be superintelligent wrt everyone else.
A possibly helpful - because starker - hypothetical training approach you could try for thinking about these arguments is make an instance of the imitatee that has all their (at least cognitive) actions sped up by some large factor (e.g. 100x), e.g., via brain emulation (or just "by magic" for the purpose of the hypothetical).
It means f(x) = 1 is true for some particular x's, e.g., f(x_1) = 1 and f(x_2) = 1, there are distinct mechanisms for why f(x_1) = 1 compared to why f(x_2) = 1, and there's no efficient discriminator that can take two instances f(x_1) = 1 and f(x_2) = 1 and tell you whether they are due to the same mechanism or not.
Will the discussion be recorded?
(Bold direct claims, not super confident - criticism welcome.)
The approach to ELK in this post is unfalsifiable.
A counterexample to the approach would need to be a test-time situation in which:
Points 2 and 3 are in direct conflict: the predictor knowing that the diamond is unsafe rules out the usual explanation for the safe-looking predic...
I think you're right - thanks for this! It makes sense now that I recognise the quote was in a section titled "Alignment research can only be done by AI systems that are too dangerous to run".
“We can compute the probability that a cell is alive at timestep 1 if each of it and each of its 8 neighbors is alive independently with probability 10% at timestep 0.”
we the readers (or I guess specifically the heuristic argument itself) can do this, but the “scientists” cannot, because the
“scientists don’t know how the game of life works”.
Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque?
Another thing confusing to me along these lines:
...“for example they may have noti
They have a strong belief that in order to do good alignment research, you need to be good at “consequentialist reasoning,” i.e. model-based planning, that allows creatively figuring out paths to achieve goals.
I think this is a misunderstanding, and that approximately zero MIRI-adjacent researchers hold this belief (that good alignment research must be the product of good consequentialist reasoning). What seems more true to me is that they believe that better understanding consequentialist reasoning -- e.g., where to expect it to be instantiated, what form it takes, how/why it "works" -- is potentially highly relevant to alignment.
I'm focusing on the code in Appendix B.
What happens when self.diamondShard
's assessment of whether some consequences contain diamonds differs from ours? (Assume the agent's world model is especially good.)
upweights actions and plans that lead to
how is it determined what the actions and plans lead to?
...We expect an explanation in terms of the weights of the model and the properties of the input distribution.
We have a model that predicts a very specific pattern of observations, corresponding to “the diamond remains in the vault.” We have a mechanistic explanation π for how those correlations arise from the structure of the model.
Now suppose we are given a new input on which our model predicts that the diamond will appear to remain in the vault. We’d like to ask: in this case, does the diamond appear to remain in the vault for the normal reason
Partitions (of some underlying set) can be thought of as variables like this:
Another way of looking at it: say we're thinking of a variable as a function from the underlying set to 's domain . Then we can equivalently think of as the partition of with (up to) ...
I agree with you - and yes we ignore this problem by assuming goal-alignment. I think there's a lot riding on the pre-SLT model having "beneficial" goals.
I'll take a stab at answering the questions for myself (fairly quick takes):
I think Dan's point is good: that the weights don't change, and the activations are reset between runs, so the same input (including rng) always produces the same output.
I agree with you that the weights and activations encode knowledge, but Dan's point is still a limit on learning.
I think there are two options for where learning may be happening under these conditions:
Expanding a bit on why: I think this will fail because the house-building AI won't actually be very good at instrumental reasoning, so there's nothing for the sticky goals hypothesis to make use of.
I agree with this prediction directionally, but not as strongly.
I'd prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.
What was it changed from and to?
Hm, no, not really.
OK let's start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?
There are several claims which are not true about this function:
Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you're getting at isn't as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]
Could this be accomplished with literally zero effort from the post-writers? The tasks of identifying which posts are arXiv-worthy, formatting for submission, and doing the submission all seem like they could be done by entities other than the author. The only issue might be in associating the arXiv submitter account with the right person.
What about the real world is important here? The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don't think of a Platonic game but a real world implementation). Does that still seem fine?
Another aspect of the real world is that we don't necessarily have compact specifications of what we want. Consider the (Platonic) function that assigns to every 96x96 grayscale (8 bits per pixel) image a label from {0, 1, ..., 9, X} and correctly labels unambiguous images of digits (with X for the non-digit or ambiguous im...
I agree it is related! I hope we as a community can triangulate in on whatever is going on between theories of mental representation and theories of optimisation or intelligence.
Sure, one concrete example is the reward function in the tic-tac-toe environment (from X's perspective) that returns -1 when the game is over and O has won, returns +1 when the game is over and X has won, and returns 0 on every other turn (including a game over draw), presuming what I really want is for X to win in as few turns as possible.
I can probably illustrate something outside of such a clean game context too, but I'm curious what your response to this one is first, and to make sure this example is as clear as it needs to be.
I agree that humans satisfying the conditions of claim 1 is an argument in favour of it being possible to build machines that do the same. A couple of points: I think the threat model would posit the core of general intelligence as the reason both why humans can do these things and why the first AGI we build might also do these things. Claim 1 should perhaps be more clear that it's not just saying such an AI design is possible, but that it's likely to be found and built.
The first thing I imagine is that nobody asks those questions. But let's set that aside.
This seems unlikely to me. I.e., I expect people to ask these questions. It would be nice to see the version of the OP that takes this most seriously, i.e., expect people to make a non-naive safety effort (trying to prevent AI takeover) focused on scalable oversight as the primary method. Because right now it's hard to disentangle your strong arguments against scalable oversight from weak arguments against straw scalable oversight.
Because doing something reliably in the world is easy to operationalise with feedback mechanisms, but us being happy with the outcomes is not.
Getting some feedback mechanism (including "what do human raters think of this?" but also mundane things like "what does this sensor report in this simulation or test run?") to reliably output high scores typically requires intelligence/capability. Optimising for that is where the AI's ability to get stuff done in the world comes from. The problem is genuinely capturing "will we be happy with the outcomes?" with such a mechanism.
The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing.
This sounds pretty close to what ELK is for. And I do expect if there is a solution found for ELK for people to actually use it. Do you? (We can argue separately about whether a solution is likely to be found.)
If our alignment training data correctly favors aligned behavior over unaligned behavior, then we have solved outer alignment.
I'm curious to understand what this means, what "data favoring aligned behavior" means particularly. I'll take for granted as background that there are some policies that are good ("aligned" and capable) and some that are bad. I see two problems with the concept of data favoring a certain kind of policy:
Straw person: We haven't found any feedback producer whose outputs are safe to maximise. We strongly suspect there isn't one.
Ramana's gloss of TurnTrout: But AIs don't maximise their feedback. The feedback is just input to the algorithm that shapes the AI's cognition. This cognition may then go on to in effect "have a world model" and "pursue something" in the real world (as viewed through its world model). But its world model might not even contain the feedback producer, in which case it won't be pursuing high feedback. (Also, it might just do something e...
For 2, I think a lot of it is finding the "sharp left turn" idea unlikely. I think trying to get agreement on that question would be valuable.
For 4, some of the arguments for it in this post (and comments) may help.
For 3, I'd be interested in there being some more investigation into and explanation of what "interpretability" is supposed to achieve (ideally with some technical desiderata). I think this might end up looking like agency foundations if done right.
For example, I'm particularly interested in how "interpretability" is supposed to work if, in some...
The desiderata you mentioned:
It seems that RRM/Debate somewhat addresses both of these, and path-specific objectives is mainly aimed at addressing issue 2. I think (part of) John's point is that RRM/Debate don't address issue 1 very well, because we don't have very good or robust processes for judging the various ways we could construct or improve these schemes. Debate relies on a trustworthy/reliable judge at the end of the day, and we might not actually have that.
Nice - thanks for this comment - how would the argument be summarised as a nice heading to go on this list? Maybe "Capabilities can be optimised using feedback but alignment cannot" (and feedback is cheap, and optimisation eventually produces generality)?
I think what you say makes sense, but to be clear the argument does not consider those things as the optimisation target but rather considers fitness or reproductive capacity as the optimisation target. (A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.)
Yes that sounds right to me.
I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.
But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).
The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:
I addressed this distinction previously, in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it's external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.
The important disanalogy
I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we h...
In this story deception is all about the model having hidden behaviors that never get triggered during training
Not necessarily - depends on how abstractly we're considering behaviours. (It also depends on how likely we are to detect the bad behaviours during training.)
Consider an AI trained on addition problems that is only exposed to a few problems that look like 1+3=4, 3+7=10, 2+5=7, 2+6=8 during training, where there are two summands which are each a single digit and they appear in ascending order. Now at inference time the model exposed to 10+2= output...
We might summarise this counterargument to #30 as "verification is easier than generation". The idea is that the AI comes up with a plan (+ explanation of how it works etc.) that the human systems could not have generated themselves, but that human systems can understand and check in retrospect.
Counterclaim to "verification is easier than generation" is that any pivotal act will involve plans that human systems cannot predict the effects of just by looking at the plan. What about the explanation, though? I think the problem there may be more that we don't ...
This seems to me like a case of the imaginary hypothetical "weak pivotal act" that nobody can ever produce. If you have a pivotal act you can do via following some procedure that only the AI was smart enough to generate, yet humans are smart enough to verify and smart enough to not be reliably fooled about, NAME THAT ACTUAL WEAK PIVOTAL ACT.
Vaguely related perhaps is the work on Decoupled Approval: https://arxiv.org/abs/2011.08827