11 min read

15

TL;DR: This post is about value of recreating “caring drive” similar to some animals and why it might be useful for AI Alignment field in general. Finding and understanding the right combination of training data/loss function/architecture/etc that allows gradient descent to robustly find/create agents that will care about other agents with different goals could be very useful for understanding the bigger problem. While it's neither perfect nor universally present, if we can understand, replicate, and modify this behavior in AI systems, it could provide a hint to the alignment solution where the AGI “cares” for humans.

Disclaimers: I’m not saying that “we can raise AI like a child to make it friendly” or that “people are aligned to evolution”. Both of these claims I find to be obvious errors. Also, I will write a lot about evolution, as some agentic entity, that “will do that or this”, not because I think that it’s agentic, but because it’s easier to write this way. I think that GPT-4 have some form of world model, and will refer to it a couple of times.

Nature's Example of a "Caring Drive"

Certain animals, notably humans, display a strong urge to care for their offspring.

I think that part of one of the possible “alignment solutions” will look like the right set of training data + training loss that allow gradient to robustly find something like a ”caring drive” that we can then study, recreate and repurpose for ourselves. And I think we have some rare examples of this in nature already. Some animals, especially humans, will kind-of-align themselves to their presumable offspring. They will want to make their life easier and better, to the best of their capabilities and knowledge. Not because they “aligned to evolution” and want to increase the frequency of their genes, but because of some strange internal drive created by evolution. 

The set of triggers tuned by evolution, activated by events associated with the birth will awake the mechanism. It will re-aim the more powerful mother agent to be aligned to the less powerful baby agent, and it just so happens that their babies will give them the right cues and will be nearby when the mechanism will do its work. 

We will call the more powerful initial agent that changes its behavior and tries to protect and help its offspring “mother” and the less powerful and helpless agent “baby”. Of course the mechanism isn’t ideal, but it works well enough, even in the modern world, far outside of initial evolutionary environment. And I’m not talking about humans only, stray urban animals that live in our cities will still adapt their “caring procedures” to this completely new environment, without several rounds of evolutionary pressure. If we can understand how to make this mechanism for something like a “cat-level” AI, by finding it via gradient descend and then rebuild it from scratch, maybe we will gain some insides into the bigger problem.

The rare and complex nature of the caring drive in contrast to simpler drives like hunger or sleep.

What do I mean by “caring drive”? Animals, including humans, have a lot of competing motivations, “want drives”, they want to eat, sleep, have sex, etc. It seems that the same applies to caring about babies. But it seems to be much more complicated set of behaviors. You need to:
correctly identify your baby, track its position, protect it from outside dangers, protect it from itself, by predicting the actions of the baby in advance to stop it from certain injury, trying to understand its needs to correctly fulfill them, since you don’t have direct access to its internal thoughts etc.
Compared to “wanting to sleep if active too long” or “wanting to eat when blood sugar level is low” I would confidently say that it’s a much more complex “wanting drive”. And you have no idea about “spreading the genes” part. You just “want a lot of good things to happen” to your baby for some strange reason. I’m yet not sure, but this complex nature could be the reason why there is an attraction basin for more “general” and “robust” solution. Just like LLM will find some general form of “addition” algorithm instead of trying to memorize a bunch of examples seen so far, especially if it will not see them again too often. I think that instead of hardcoding a bunch of britle optimized caring procedures, evolution repeatedly finds the way to make mothers “love” their babies, outsourcing a ton of work to them, especially if situations where it’s needed aren’t too similar.

And all of it is a consequence of a blind hill climbing algorithm. That’s why I think that we might have a chance of recreating something similar with gradient descend. The trick is to find the right conditions that will repeatedly allow gradient descend to find the same caring-drive-structure, find similarities, understand the mechanism, recreate it from scratch to avoid hidden internal motivations, repurpose it for humans and we are done! Sounds easy (it’s not)

Characteristics and Challenges of a Caring Drive

It’s rare: most animals don’t care, because they can’t or don’t need to.

A lot of times, it’s much more efficient to just make more babies, but sometimes they must provide some care, simply because it was the path that evolution found that works. And even if they will care about some of them, they may choose one, and left others die, again, because they don’t have a lot of resources to spare and evolution will tune this mechanism to favor the most promising offspring if it is more efficient. And not all animals could become such caring parents: you can’t really care and protect something else if you are too dumb for example. So there is also some capability requirements for animals to even have a chance of obtaining such adaptation. I expect the same capability requirements for AI systems. If we want to recreate it, we will need to try it with some advanced systems, otherwise I don’t see how it might work at all.

It’s not extremely robust: give enough brain damage or the wrong tunings and the mechanism will malfunction severely

Which is obvious, there is nothing surprising in “if you damage it, it could break”, this will apply to any solution to some degree. It shouldn’t be surprising that drug abusing or severely ill parents will often fail to care about their child at all. However, If we will succeed at building aligned AGI stable enough for some initial takeoff time, then the problem of protecting it from damage should not be ours to worry at some moment. But we still need to ensure initial stability.

It’s not ideal from our AI -> humans view

Evolution has naturally tuned this mechanism for optimal resource allocation, which sometimes means shutting down care when resources needed to be diverted elsewhere. Evolution is ruthless because of the limited resources, and will eradicate not only genetic lines that care too less, but also the ones that care too much. We obviously don’t need that part. And a lot of times you can just give up on your baby and instead try to make a new one, if the situation is too dire, which we also don’t want to happen to us. Which means that we need to understand how it works, to be able to construct it in the way we want.

But it’s also surprisingly robust!

Of course, there are exceptions, all people are different and we can’t afford to clone some “proven to be a loving mother” woman hundreds of times to see if the underlying mechanism triggers reliably in all environments. But it seems to work in general, and more so: it continues to work reliably even with our current technologies, in our crazy world, far away from initial evolution environment. And we didn’t had to live through waves of birth declines and rises as evolution tries to adapt us to the new realities, tuning brains of new generation of mothers to find the ones that will start to care about their babies in the new agricultural or industrial or information era.

Is this another “anthropomorphizing trap”?

For what I know, it is possible to imagine alternative human civilization, without any parental care, so instead our closest candidate for such behavior would be some other intelligent species. Intelligent enough to be able to care in theory and forced by their weak bodies to do so in order to have any descendants at all, maybe it could be some mammals, birds, or whatever, it doesn’t matter. The point I’m making here is that: I don’t think that it is some anthropic trap to search for inspiration or hints in our own behavior, it just so happens that we are smart, but have weak babies that require a lot of attention so that we received this mechanism from evolution as a “simplest solution”. You don’t need to search for more compact brains that will allow for longer pregnancy, or hardwire even more knowledge into the infants brains if you can outsource a lot of stuff to the smart parents, you just need to add the “caring drive” and it will work fine. We want AI to care about us, not because we care about our children, and want the same from AI, we just don’t want to die, and we would want AI to care about us, even if we ourselves would lack this ability.

Potential flaws:

I’m not saying that it’s a go-to solution that we can just copy, but the step in right direction from my view. Replicating similar behavior and studying its parts could be a promising direction. There are a few moments that might make this whole approach useless, for example:

  1. Somebody will show that there was in fact a lot of evolutionary pressure each time our civilization made another technological leap forward, which caused a lot of leftover children or something like this. Note that it is not sufficient to point at the birth decline, which will kind of prove the point that “people aren’t aligned to evolution”, which I’m not claiming in the first place. You need to show that modern humans/animals will have lower chances to care about their already born babies in the new modern environment. And I’m not sure that pointing out cases where parents made terrible caring choices, would work either, since it could be a capability problem, not the intentions.
  2. The technical realization requires too much computation because we can’t calculate gradients properly, or something like that. I expect this to work only when some “high” level of capabilities is reached, something on the level of GPT-4 ability to actually construct some world model in order to have a chance of meaningfully predict tokens in completely new texts that are far away from original dataset. Without an agent that have some world model, that could adapt to a new context, I find it strange to expect it to have some general “caring drive”. At best it could memorize some valuable routines that will likely break completely in the new environment.
  3. Any such drive will be always "aimed" by the global loss function, something like: our parents only care about us in a way for us to make even more babies and to increase our genetic fitness. But it seems false? Maybe because there is in fact some simpler versions of such “caring drives” that evolution found for many genetic lines independently that just makes mothers “love” their babies in some general and robust way, and while given enough time it will be possible to optimize it all out for sure, in most cases it’s the easiest solution that evolution can afford first, for some yet unknown reasons. I understand that in similar environment something really smart will figure out some really optimal way of “caring”, like keeping the baby in some equivalent of cryosleep, shielded from outside, farming the points for keeping it alive, but we will not get there so easily. What we might actually get is some agent that smart enough to find creative ways of keeping the baby agent happy/alive/protected, but still way too dumb to Goodhart all the way to the bottom and out of simulation. And by studying what makes it behave that way we might get some hints about the bigger problem. It still seems helpful to have something like “digital version of a caring cat” to run experiments and understand the underlying mechanism. 
  4. Similar to 3: Maybe it will work mostly in the direction of “mother” agent values, since your baby agent needs roughly the same things to thrive in nature. The mechanism that will mirror the original values and project them to the baby agent will work fine, and that what evolution finds all the time. And it will turns out that the same strange, but mostly useless for us mechanism we will find repeatedly in our experiments.

Overall I’m pretty sure that this do in fact work, certainly good enough to be a viable research direction.

Technical realization: how do we actually make this happen?

I have no concrete idea. I have a few, but I’m not sure about how practically possible they are. And since nobody knows how this mechanism works as far as I know, it’s hard to imagine having the concrete blueprint to create one. So the best I can give is: we try to create something that looks right from the outside and see if there is anything interesting in the inside. I also have some ideas about “what paths could or couldn’t lead to the interesting insides”.

First of all, I think this “caring drive” couldn’t run without some internal world model. Something like: it's hard to imagine far generalized goals without some far generalized capabilities. And world model could be obtained from highly diverse, non repetitive dataset, which forces the model to actually “understand” something and stop memorizing.

Maybe you can set an environment with multiple agents, similar to Deepmind’s here (https://www.deepmind.com/blog/generally-capable-agents-emerge-from-open-ended-play), initially reward an agent for surviving by itself, and then introduce the new type of task: baby-agent, that will appear near the original mother-agent (we will call it “birth”), and from that point of time, the whole reward will come purely from how long baby will survive? Baby will initially have less mechanical capabilities, like speed, health, etc and then “grow” to be more capable by itself? I’m not sure what should be the “brain” of baby-agent, another NN or maybe the same that were found from training the mother agent? Maybe creating a chain of agents: agent 1 at some point gives birth to the agent 2 and receive reward for each tick that agent 2 is alive, which itself will give birth to the agent 3 and receive reward fot each tick that agent 3 is alive, and so on. Maybe it will produce something interesting? Obviously the “alive time” is a proxy, and given enough optimization power we should expect Goodhart horrors beyond our comprehension. But the idea is that maybe there is some “simple” solution that will be found first, which we can study. Recreating the product of evolution, not using the immense “computational power” of it could be very tricky.

But if it seems to work, and mother-agent behave in a seemingly “caring way”, then we can try to apply interpretability tools, try to change the original environment drastically, to see how far it will generalize, try to break something and see how well it works, or manually override some parameters and study the change. However, I’m not qualified to make this happen anyway, so if you find this idea interesting, contact me, maybe we can do this project together.

How the good result might look like?

Let’s imagine that we’ve got some agent that can behave with care toward the right type of “babies”. For some yet unknown reason, from outside view it behaves as if it cares about its baby-agent, it finds the creative ways to do so in new contexts. Now the actual work begins: we need to understand where are the parts that make this possible located, what is the underlying mechanism, what parts are crucial and what happens when you break them, how can we re-write the “baby” target, so that our agent will care about different baby-agents, under what conditions gradient descent will find an automatic off switch (I expect this to be related to the chance of obtaining another baby and given only 1 baby per “life”, gradient will never find the switch, since it will have no use). Then we can actually start to think about recreating it from scratch. Just like what people did with modular addition: https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking . Except this time we don’t know how the algorithm could work or look like. But “intentions”, “motivations” and “goals” of potential AI systems are not magic, we should be able to recreate and reverse-engineer them.

New Comment
1 comment, sorted by Click to highlight new comments since:

Suppose your AI includes an LLM. You can just prompt it with "You love, and want to protect, all humans in the same way as a parent loves their children." Congratulations — you just transferred the entire complex behavior pattern. Now all you need to do it tune it.