So I just want to check: Is your goal with boundaries just to formalize a moral thing?
I'll summarize what I mean by that:
Claim 1: By "boundaries", you mean "the boundaries around moral patients— namely humans".
Claim 1b: And to some degree also the boundaries around plants and animals. Also maybe nations, institutions, and other things.
Claim 2: If we can just
(i) locate the important boundaries in the world, and then
(ii) somehow protect them,
Then this gets at a lot (but not all!) of what the "safety" in "AI safety" should be.
Claim 3: We might actually be able to do that↑.
e.g.: Markov blankets are a natural abstraction for (2.i).
Claim 4: Protecting boundaries won't be sufficient for all of "safety" and there are probably also other (non-boundaries) specifications/actions that will also be necessary.
For example, we would probably also need to separately specify some things that aren't obviously contained by the boundaries we mean, e.g.: "clean water", "clean air", and a tractably small set of other desiderata.
Here are my questions for you:
Q1: Do you agree with each of the claims above?
Q2: Is your goal with boundaries just to formalize the moral/safety thing, or is there anything else you want from boundaries?
Q3: It seems that Garrabrant, Critch, and maybe others want different things from you and I'm wondering if you have thoughts about that.
Garrabrant: From talking to him I know that he's thinking about boundaries too but more about boundaries in the world as instruments to preserve causal locality and predictability and evolution etc.. But this is quite different than talking about specifically the boundaries around agents.
Critch: I haven't spoken to him yet, but I think you once told me that Critch seems to be thinking about boundaries more in terms of ~"just find the 'boundary protocol' and follow it and all cooperation with other agents will be safe". Is this right? If so, this seems closer to what you want, but still kinda different.
TJ: I think TJ has some other ideas that I am currently unable to summarize.
davidad
Claim 1+1b: yes, to first order. [To second order, I expect that the general concept of things with «boundaries» will also be useful for multi-level world-modelling in general, e.g. coarse-graining fluid flow by modelling it in terms of cells that have boundaries on which there is a net flow, and that it might be a good idea to "bake in" something like a concept of boundaries to an AI system's meta-ontology, so that it has more of a tendency to have moral patients among the entities in its object-level ontology. But my mainline intention is for the object-level ontology to be created with humans in the loop, and the identification of entities with boundaries could perhaps be just as easily a layer of interpretation on top of an ontology with a more neutral meta-ontology of causation. Thinking through both routes more is at the frontier of what I consider "conceptual «boundaries» research".]
Q2: yes, my ultimate goal with «boundaries» is just to formalise injunctions against doing harm, disrespecting autonomy, or (at the most ambitious) excluding humans from cooperation. (I am borrowing the pluralism of Garrett Cullity's Concern, Respect, & Cooperation in separating those three cases. However, as discussed in my response to Claim 1, there might also be bonus benefits where incorporating «boundaries» at a meta-ontological level (prior to the moral/normative/axiological specifications) makes multi-scale world-modelling go better (and/or makes the models more amenable to using «boundaries» to formalise the injunctions).
davidad
Q3: Your models seem roughly right to me. (I, too, consider attempting to summarize TJ's ideas as a bit of a risky proposition, as they have a lot of nuance that is easy to get wrong. Perhaps we could invite them to join the dialogue.) I have a lot of respect for all three people. My current view is that:
Critch's full vision of «boundary protocols», if it works out, would be strictly superior to my current vision of «boundaries», but I don't currently see how to fill in the mathematical details of «boundary protocols». However, given that the mathematical details of «boundaries» (simpliciter) are due to Critch, I am keen to find out if he (perhaps with the support of other workshop participants) can write down some indications about how to formalise «boundary protocols» too.
My understanding from past conversations with Garrabrant is that he wants to remove time from the ontological dependencies for defining «boundaries». The notion of time used in the current formalism is a global discrete clock, which is clearly inadequate. I tend to prefer a partial order defining causal precedence, which is better and may even be adequate for my purposes. Garrabrant's perspective on causation and time, which incorporates insights from logical induction, is clearly even better, and I would not be shocked if it somehow turns out to be crucially necessary. It is even harder to work out how to define «boundaries» in that setting, but surely worth spending some time on.
Chipmonk
Okay, I'll summarize what I learned from your messages:
re Q1 Claim 1 and Q2-
You think it might also be helpful to have boundaries as a ~primitive in the OAA's multi-level world-model.
re Q2-
I liked this: "formalise injunctions against doing harm, disrespecting autonomy, or (at the most ambitious) excluding humans from cooperation".
Hm, I understand how boundaries help with preventing harm and preventing disrespect to autonomy. But I don't immediately understand how boundaries help with preventing "excluding humans from cooperation". I'll have to ask about that.
re Q1, Q2, Q3-
You roughly agree with the other stuff I wrote.
re Q3-
Garrabrant has some inquiries about logical time vs physical time, wrt boundaries.
Hm, I don't understand this in detail. But I also don't feel inclined to dig into it right now.
Chipmonk
New questions I have for you:
Q2.1: Could you give one short canonical example of how boundaries might possibly help prevent "excluding humans from cooperation"?
Q3.1: Could you give one short canonical example of what the boundary protocol thing is and how it would be good?
Q3.2: You said "given that the mathematical details of «boundaries» (simpliciter) are due to Critch", but I'm not sure this is true… Why do you say this?
Context: I assume you're referring to his boundaries 3a post.
(Note for the reader: I summarize some of that post here.)
Chipmonk
davidad
Q2.1: e.g. using boundaries as a basis for identifying humans in a world, and then using that as the basis for defining a cooperative game in which the humans are treated as player agents, even if the AI system has a good enough model of the internals that it more natively sees humans as NPCs. I haven't thought much about this direction though
davidad
Q3.1: A canonical example is that a person may consent to have their physical boundaries violently invaded by a blade by a surgeon for the purpose of excising a cancerous tumour. Without boundary protocols, there would be nothing these people could do to prevent an overzealous boundary-protector from preventing the blade puncturing the skin.
davidad
Q3.2: Yes, I am.
Chipmonk
Q2.1: Oh, that's neat I've also had thoughts in a similar direction.
(Ie: Other agents are only sovereign from your perspective to the extent that you lack information about their internal state. If you had a high res brain scan of someone and the ability to simulate it, they would not be sovereign from your perspective, and the boundaries abstraction falls apart.)
Chipmonk
Q3.1: ah ok
Chipmonk
Q3.2: ok
Chipmonk
That's all of my questions right now. I'll just publish?
davidad
Ok!
Chipmonk
Oh I forgot to mention: This should also all be considered in the context of Davidad's Open Agency Architecture (OAA). His big end-to-end AI safety plan for which he has ~£50M from the UK gov over the next 3 years. Programme thesis.