How I think about alignment

[-]Charlie Steiner3y38

You can't zoom infinitely far in on the causal chain between values and actions, because values (and to a large extent actions) are abstractions that we use when modeling agents like ourselves. They are emergent. To talk about my values at all is to use a model of me where I use my models in a certain agenty way and you don't sweat the details too hard.

[-]Vladimir_Nesov3y23

Somewhere in my brain there is some sort of physical encoding of my values.

Not sure if this is an intended meaning, but the claim that values don't depend on content of the world outside the brain is generally popular (especially in decision theory), and there seems to be no basis for it. Brains are certainly some sort of pointers to value, but a lot (or at least certainly some) of the content of values could be somewhere else, most likely in civilization's culture.

This is an important distinction for corrigibility, because this claim is certainly false for a corrigible agent, it instead wants to find content of its values in environment, it's not part of its current definition/computation. It also doesn't make sense to talk about this agent pursuing its goals in a diverse set of environments, unless we expect the goals to vary with environment.

For decision theory of such agents, this could be a crucial point. For example, an updateless corrigible agent wouldn't be able to know the goals that it must choose a policy in pursuit of. The mapping from observations to actions that UDT would pick now couldn't be chosen as the most valuable mapping, because value/goal itself depends on observations, and even after some observations it's not pinned down precisely. So if this point is taken into account, we need a different decision theory, even if it's not trying to do anything fancy with corrigibility or mild optimization, but merely acknowledges that goal content could be located in the environment!

[-]Linda Linsefors3y20

I mean that the information of what I value exists in my brain. Some of this information is pointers to things in the real world. So in a sense the information partly exist in the relation/correlation between me and the world.

I defiantly don't mean that I can only care about my internal brain state. To me that is just obviously wrong. Although I have met people who disagree, so I see where the misunderstanding came from.

[-]Vladimir_Nesov3y*22

That's not what I'm talking about. I'm not talking about what known goals are saying, or what they are speaking of, what they consider valuable or important. I'm talking about where the data to learn what they are is located, as we start out not knowing the goals at all and need to learn them. There is a particular thing, say a utility function, that is the intended formulation of goals. It could be the case that this intended utility function could be found somewhere in the brain. That doesn't mean that it's a utility function that cares about brains, the questions of where it's found and what it cares about are unrelated.

Or it could be the case that it's recorded on an external hard drive, and the brain only contains the name of the drive (this name is a "pointer to value"). It's simply not the case that you can recover this utility function without actually looking at the drive, and only looking at the brain. So utility function u itself depends on environment E, that is there is some method of formulating utility functions t such that u=t(E). This is not the same as saying that utility of environment depends on environment, giving the utility value u(E)=t(E)(E) (there's no typo here). But if it's actually in the brain, and says that hard drives are extremely valuable, then you do get to know what it is without looking at the hard drives, and learn that it values hard drives.

[-]Peter Hroššo3y21

I expect there to be too much happenstance encoded in my values.

I believe this is a bug, not a feature that we would like to reproduce.

I think that the direction you described with the AI analysing how you acquired your values is important, because it shouldn't be mimicking just your current values. It should be able to adapt the values to new situations the way you'd do (distributional shift). Think all the books / movies where people get to unusual situations and have to make tough moral calls. Like plane crashing in the middle of nowhere with 20 survivors who are gradually running out of food.. Superhuman AI will be running into unknown situations all the time because of different capabilities.

Human values are undefined for most situations a superhuman AI will encounter.

[-]Gunnar_Zarncke3y00

Some observations:

Genes reproduce themselves.
Humans reproduce themselves.
Symbols are relearned.
Values are reproduced.

Each needs an environment to do so, but the key observation seems to be that a structure is reliably reproduced across intermediate forms (mitosis, babies, language, society) and build on top of each other. It seems plausible that there is a class of formal representations that describe

the parts that are retained across instances and
the embedding into each other (values into genes and symbols), and
the dynamics of the transfer.

[-]Linda Linsefors3y10

If something is good at replicating, then there will be more of that thing, this creates a selection effect for things that are good at replicating. The effects of this can be observed in biology and memetics.

Maybe self replication can be seen as an agentic system with the goal of self replicating? In this particular question all uncertainty comes from "agent" being a fuzzy concept, and not from any uncertainty about the world. So answering this question will be a choice of perspective, not information about the world.

Either way, the type of agency I'm mainly interested in is the type of agency that have other goals than just self replication. Although maybe there are things to be learned from the special case of having self replication as a goal?

If the AI learns my values then this is a replication of my values. But there are also examples of magic agentic force where my values are not copied at any point along the way.

Looking at how society is transferred between generations, might have some clues to value learning? But I'm less optimistic about looking at what is similar between self replication in general, because I think I already know this, and also, it seems to be one abstraction level to high, i.e. the similarity are properties above the mechanistic details, and those details is what I want.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

12

12

Magic agentic fluid/force

Trial and error as an incomplete example of MAF

Mapping the territory along the path

My current focus

Appendix: Is there a possible alternative path around (not through) the brain?

Appendix: Aligned AI as a MAF amplifier