Quick thoughts on the implications of multi-agent views of mind on AI takeover

Kaj_Sotala

There was conversation on Facebook over an argument that any sufficiently complex system, whether a human, a society, or an AGI, will be unable to pursue a unified goal due to internal conflict among its parts, and that this should make us less worried about "one paperclipper"-style AI FOOM scenarios. Here's a somewhat edited and expanded version of my response:

1) yes this is a very real issue

2) yet as others pointed out, humans and organizations are still able to largely act as if they had unified goals, even if they often also act contrary to those goals

3) there's a lot of variance in how unified any given human is. trauma makes you less unified, while practices such as therapy and certain flavors of meditation can make a person significantly more unified than they used to be. if you were intentionally designing a mind, you could create mechanisms that artificially mimicked the results of these practices

4) a lot of the human inconsistency looks like it has actually been evolutionarily adaptive for social purposes. E.g. if your social environment punishes you for having a particular trait or belief, then it's adaptive to suppress that to avoid punishment, while also retaining a desire to still express it when you can get away with it. This then manifests as what could be seen as conflicting sub-agents, with internal conflict and inconsistent behaviour.

5) Depending on its training regime, an AI might be anywhere between not having those incentives for inconsistency at all (if it was optimized for one goal), to having them almost as much as humans (if it was trained in some kind of multi-agent artificial life setting with similar kinds of social dynamics as humans faced; maybe also if it's trained something like the way ChatGPT is, where it has to give an acceptable-to-the-median-liberal-Westerner answer to any question, even if those answers are internally inconsistent)

6) at the same time there is still genuinely the angle about complexity and unpredictability making it hard to get a complex mind to work coherently and internally aligned. I think that evolution has done a lot of trial and error to set up parameters that result in brain configurations where people end up acting in a relatively sensible way - and even after all that trial and error, lots of people today still end up with serious mental illnesses, failing to achieve almost any of the goals they have (even when the environment isn't stacked against them), dying young due to doing something that they predictably shouldn't have, etc. I'd say it's less like "evolution has found a blueprint than reliably works" and more like "evolution keeps doing trial-and-error search in every generation, with a lot of people not making it"

7) aligning an AI's sub-agents in a purely simulated environment may not be fully feasible because a lot of the questions that need to be solved are things like "how much priority to allocate to which sub-agent in which situation". E.g. humans come with lots of biological settings that shift the internal balance of sub-agents when hungry, tired, scared,, etc. Some people develop an obsessive focus on a particular topic which may end up being beneficial if they are lucky (obsession on programming that you can turn into a career), or harmful if they are unlucky (an obsession on anything that doesn't earn you money and actively distracts you from it). The optimal prioritization depends on the environment and I don't think there is any theoretically optimal result that would be real-world relevant and that you could calculate beforehand. Rather you just have to do trial-and-error, and while running your AIs in a simulated environment may help a bit, it may not help much if your simulation doesn't sufficiently match the real world.

8) humans are susceptible to internal Goodhart's Law, where they optimize for proxy variables like "sense of control over one's environment", and this also leads them to doing things like playing games or smoking cigarettes to increase their perceived control of the environment without increasing their actual control of the environment. I think that an AI having the same issue is much more likely than it just being able to single-mindedly optimize for a single goal and derive all of its behavior and subgoals from that. Moreover, evolution has put quite a bit of optimization power into developing the right kinds of proxy variables which overall still largely work. Having control of your environment is actually quite important and even if the drive for that can misfire, having the drive to increase that control is mostly still better than not having it. But the exact configuration of these kinds of proxy variables feels like it's also in the class of things that you just need to find out by trial and error and throwing lots of minds at it, there's no a priori answer for exactly how much the AI should optimize for that in an arbitrary environment.

9) a lot of these kinds of failures are generally not correctable from within the system. Suppose that an AI's internal priority-allocation system ends up giving most of the priority to the subsystem thinking about how to best develop nanotech, and this subsystem ends up obsessively thinking about minute theoretical details about nanotech long past the point it would have had any practical relevance for the AI's world-takeover plans. Even if other subsystems realize that this has turned into a lost cause, if they cannot directly affect the priority-allocation system which keeps the nanotech-obsessed subsystem in control, the nanotech-obsessed subsystem will continue spending all the time just thinking about this and nothing else. Or if the other subsystems can directly affect the priority-allocation system, then it creates an incentive for them to seize control of it and ensure that it will always keep them in charge, even past the point that their contributions turned out to matter. (cf. Minsky on mutually bidding subagents)

10) overall, this makes me put less credence on the "the first AI to become superintelligent will take over the world" scenario - I think that it's likely that the first superintelligent AI will turn out to be internally misaligned and fail to achieve a goal as complex as taking over the world. However, I don't think that this necessarily helps us much, because AI looks like it can become far more internally aligned than humans ever can, and given enough trial and error (different actors creating their own AIs), one of them is going to get there eventually.

[-]Roger Dearnaley1y10

Depending on its training regime, an AI might be anywhere between not having those incentives for inconsistency at all (if it was optimized for one goal), to having them almost as much as humans

If our agent was created from an LLM: then the process started with a base model LLM isn't actually an agent: instead it's a simulator that contextually simulates a wide range of human-like agents token-generation processes round on the web. What agent it picks to simulate is highly contextual. Instruct-training attempts to reduce the range of agent to just helpful, honest, and harmless assistants. Currently, it's not entirely successful at this, which is why jaibreaks like telling it that it's DAN, which stands for Do Anything Now, work. Even after instruct-training, the range of agents it can simulate is actually a lot wider than a typical human: wider even that a skilled Method Improv actor who's also highly mililibngual, ridiculously widely read, and knows trivia from all over the world. So even when we try to reduce inconsistency in an LLM sa hard as we can, we still can't get it to levels as low as most humans.

AI ALIGNMENT FORUM
AF

17

Quick thoughts on the implications of multi-agent views of mind on AI takeover

17