The Big Picture Of Alignment (Talk Part 2)

johnswentworth

This is a linkpost for https://www.youtube.com/watch?v=QCd-Yf7PqeA

I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don't currently know of any good writeup. Linkpost for the first part is here; this linkpost is for the second part.

Compared to the first part, the second part has less material which has not been written up already, although it does do a better job tying it all into the bigger picture than any already-written source. I will link to relevant posts in the outline below.

Major pieces in part two:

Programs as a compressed representation for large (potentially infinite) probabilistic causal models with symmetry
- Potentially allows models of worlds larger than the data structure representing the model, including models of worlds in which the model itself is embedded.
- Can't brute-force evaluate the whole model; must be a lazy data structure with efficient methods for inference
The Pointers Problem: the inputs to human values are latent variables in humans' world models
- This is IMO the single most important barrier to alignment
Other aspects of the "type signature of human values" problem (just a quick list of things which I'm not really the right person to talk about)
Abstraction (a.k.a. ontology identification)
- Three roughly-equivalent models of natural abstraction
Summary (around 1:30:00 in video)

I ended up rushing a bit on the earlier parts, in order to go into detail on abstraction. That was optimal for the group I was presenting to at the time I presented, but probably not for most people reading this. Sorry.

Here's the video:

Again, big thanks to Rob Miles for editing! (Note that the video had some issues - don't worry, the part where the camera goes bonkers and adjusts the brightness up and down repeatedly does not go on for very long.) The video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.

Some thoughts on this question that you mention briefly in the talk.

What decision theory does evolution select for?

I think that evolution selects for functional decision theory (FDT). More specifically, it selects the best policy over a life time, and not the best action in a given situation. I don't mean that we actually cognitively calculate FDT, but that there is an evolutionary pressure to act as if we follow FDT

Example: Revenge
By revenge I mean burning some of your utility just to get back at someone who hurt you.

Revenge is equivalent to transparent Newcomb's problem. You can see that Omega has predicted that you will two-box, i.e. the box that could have lots of money is empty. What do you do? You can one-box anyway, and counterfactually make this situation less likely, but giving up the smaller reward too (take revenge) or you can accept that you can't change the past, cut your losses, and just take the smaller reward (no revenge).

The way this is evolutionally encoded in humans is not a tendency to think about counterfactual situations. Instead we get mad at Omega for withholding the larger reward, and we one-box out of spite, to make Omegas prediction wrong, forcing a loose-loose outcome. But it is still FDT in practice.

Taking revenge can also be justified causally, if it it's about upholding your reputation so no one crosses you again. Humans defiantly do this calculation too. But it seems like most humans have a revenge drive that is stronger than what CDT would recommend, which is why I think this example backs up my claim that evolution selects for FDT.

My claim is of a similar type as Caspar's Doing what has worked well in the past leads to evidential decision theory. It's a statement about the resulting policy, not the reasoning steps of the agent. Caspar describes an agent that does the action that has worked well in the past. Evolution is a process that select the policy that has worked well in the past, which should give you effectively some version of FDT-son-of-EDT.

There are situations where FDT behave differently depending on when and why it was created (e.g.). I think I could figure out how this would playout out in the context of evolution, but it would take some more thinking. If you think I'm on the right track, and convince me that this is useful, I'll give it a try.

This argument sounds roughly right to me, though I'm not sure FDT is exactly the right thing. If two organisms were functionally identical but had totally different genomes, then FDT would decide as though they're one unit, whereas I think evolution would select for deciding as though only the genetically-identical organisms are a unit? I'm not entirely sure about that.

I agree that it's not exactly FDT. I think I actually meant updateless decision theory (UDT), but I'm not sure because I have some of uncertainty to exactly what others mean by UDT.

I claim that mutations + natural selection (evolution) selects for agents that acts according to the policy they would have wanted to pre-commit to, at the time of their birth (last mutation).

Yes, there are some details around who I recognize as a copy of me. In classical FDT this would be anyone who are running the same program (what ever that means). In evolution this would be anyone who are carrying the same genes. Both of these concept are complicated by "same program" and "same genes" are scalar (or more complicated?) and not Boolean values.

Edit: I'm not sure I agree with what I just said. I believe something in this direction, but I want to think some more. For example, people with similar genes probably don't cooperate because decision theory (my decision to cooperate with you is correlated with your decision to cooperate with me), but because shared goals (we both want to spread our shared genes).

Some thoughts on this question that you mention briefly in the talk.

What decision theory does evolution select for?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

13

The Big Picture Of Alignment (Talk Part 2)

13