Human values and preferences are hard to specify, especially in complex domains. Accordingly, much AGI safety research has focused on approaches to AGI design that refer to human values and preferences indirectly, by learning a model that is grounded in expressions of human values (via stated preferences, observed behaviour, approval, etc.) and/or real-world processes that generate expressions of those values. There are additionally approaches aimed at modelling or imitating other aspects of human cognition or behaviour without an explicit aim of capturing human preferences (but usually in service of ultimately satisfying them). Let us refer to all these models as human models.

In this post, we discuss several reasons to be cautious about AGI designs that use human models. We suggest that the AGI safety research community put more effort into developing approaches that work well in the absence of human models, alongside the approaches that rely on human models. This would be a significant addition to the current safety research landscape, especially if we focus on working out and trying concrete approaches as opposed to developing theory. We also acknowledge various reasons why avoiding human models seems difficult.

Problems with Human Models

To be clear about human models, we draw a rough distinction between our actual preferences (which may not be fully accessible to us) and procedures for evaluating our preferences. The first thing, actual preferences, is what humans actually want upon reflection. Satisfying our actual preferences is a win. The second thing, procedures for evaluating preferences, refers to various proxies for our actual preferences such as our approval, or what looks good to us (with necessarily limited information or time for thinking). Human models are in the second category; consider, as an example, a highly accurate ML model of human yes/no approval on the set of descriptions of outcomes. Our first concern, described below, is about overfitting to human approval and thereby breaking its connection to our actual preferences. (This is a case of Goodhart’s law.)

Less Independent Audits

Imagine we have built an AGI system and we want to use it to design the mass transit system for a new city. The safety problems associated with such a project are well recognised; suppose we are not completely sure we have solved them, but are confident enough to try anyway. We run the system in a sandbox on some fake city input data and examine its outputs. Then we run it on some more outlandish fake city data to assess robustness to distributional shift. The AGI’s outputs look like reasonable transit system designs and considerations, and include arguments, metrics, and other supporting evidence that they are good. Should we be satisfied and ready to run the system on the real city’s data, and to implement the resulting proposed design?

We suggest that an important factor in the answer to this question is whether the AGI system was built using human modelling or not. If it produced a solution to the transit design problem (that humans approve of) without human modelling, then we would more readily trust its outputs. If it produced a solution we approve of with human modelling, then although we expect the outputs to be in many ways about good transit system design (our actual preferences) and in many ways suited to being approved by humans, to the extent that these two targets come apart we must worry about having overfit to the human model at the expense of the good design. (Why not the other way around? Because our assessment of the sandboxed results uses human judgement, not an independent metric for satisfaction of our actual preferences.)

Humans have a preference for not being wrong about the quality of a design, let alone being fooled about it. How much do we want to rely on having correctly captured these preferences in our system? If the system is modelling humans, we strongly rely on the system learning and satisfying these preferences, or else we expect to be fooled to the extent that a good-looking but actually bad transit system design is easier to compose than an actually-good design. On the other hand, if the system is not modelling humans, then the fact that its output looks like a good design is better evidence that it is in fact a good design. Intuitively, if we consider sampling possible outputs and condition on the output looking good (via knowledge of humans), the probability of it being good (via knowledge of the domain) is higher when the system’s knowledge is more about what is good than what looks good.

Here is a handle for this problem: a desire for an independent audit of the system’s outputs. When a system uses human modelling, the mutual information between its outputs and the auditing process (human judgement) is higher. Thus, using human models reduces our ability to do independent audits.

Avoiding human models does not avoid this problem altogether. There is still an “outer-loop optimisation” version of the problem. If the system produces a weird or flawed design in sandbox, and we identify this during an audit, we will probably reject the solution and attempt to debug the system that produced it. This introduces a bias on the overall process (involving multiple versions of the system over phases of auditing and debugging) towards outputs that fool our auditing procedure.

However, outer-loop optimisation pressures are weaker, and therefore less worrying, than in-loop optimisation pressures. We would argue that the problem is much worse, i.e., the bias towards fooling is stronger, when one uses human modelling. This is because the relevant optimisation is in-loop instead and is encountered more often.

As one more analogy to illustrate this point, consider a classic Goodhart’s law example of teaching to the test. If you study the material, then take a test, your test score reveals your knowledge of the material fairly well. If you instead study past tests, your test score reveals your ability to pass tests, which may be correlated with your knowledge of the material but is increasingly less likely to be so correlated as your score goes up. Here human modelling is analogous to past tests and actual preferences are analogous to the material. Taking the test is analogous to an audit, which we want to be independent from the study regimen.

Risk from Bugs

We might implement our first AGI system incorrectly in a mundane sense. Specifically, even if we fully develop a theory of safe or aligned AGI, we might fail to implement that theory due to bugs or problems with our implementation techniques. In this case, we would be relatively better off if the mutual information between the AGI’s knowledge and human preferences is low. We expect the system’s behaviour to be dependent on its knowledge in some way, and we expect implementation errors to shift the nature of that dependence away from our intentions and expectations. Incorrect behaviour that depends on human preferences seems more dangerous than incorrect behaviour that does not.

Consider the space of AGI system implementations, under a metric like similarity to an intended design (equivalently: severity of deviation from the design due to bugs). We want all the points near the first AGI system we build to be safe, because we may end up with a slightly different design than intended for reasons such as being confused about what we are doing or making implementation errors.

There are at least three ways in which the risk from bugs can manifest.

Incorrectly Encoded Values

Supposing we intend the first use of AGI to be solving some bounded and well-specified task, but we misunderstand or badly implement it so much that what we end up with is actually unboundedly optimising some objective function. Then it seems better if that objective is something abstract like puzzle solving rather than something more directly connected to human preferences: consider, as a toy example, if the sign (positive/negative) around the objective were wrong.

Manipulation

The earlier arguments for independent audits do not just apply to the specific tasks we would plan to audit, but also to any activities an AGI system might carry out that humans might disapprove of. Examples include finding ways to hack into our supposedly secure systems, hiding its intentions and activity from us, or outright manipulating us. These tasks are much easier with access to a good psychological model of humans, which can be used to infer what mistakes we might make, or what loopholes we might overlook, or how we might respond to different behaviour from the system.

Human modelling is very close to human manipulation in design space. A system with accurate models of humans is close to a system which successfully uses those models to manipulate humans.

Threats

Another risk from bugs comes not from the AGI system caring incorrectly about our values, but from having inadequate security. If our values are accurately encoded in an AGI system that cares about satisfying them, they become a target for threats from other actors who can gain from manipulating the first system. More examples and perspectives on this problem have been described here.

The increased risk from bugs of human modelling can be summarised as follows: whatever the risk that AGI systems produce catastrophic outcomes due to bugs, the very worst outcomes seem more likely if the system was trained using human modelling because these worst outcomes depend on the information in human models.

Less independent audits and the risk from bugs can both be mitigated by preserving independence of the system from human model information, so the system cannot overfit to that information or use it perversely. The remaining two problems we consider, mind crime and unexpected agents, depend more heavily on the claim that modelling human preferences increases the chances of simulating something human-like.

Mind Crime

Many computations may produce entities that are morally relevant because, for example, they constitute sentient beings that experience pain or pleasure. Bostrom calls improper treatment of such entities “mind crime”. Modelling humans in some form seems more likely to result in such a computation than not modelling them, since humans are morally relevant and the system’s models of humans may end up sharing whatever properties make humans morally relevant.

Unexpected Agents

Similar to the mind crime point above, we expect AGI designs that use human modelling to be more at risk of producing subsystems that are agent-like, because humans are agent-like. For example, we note that trying to predict the output of consequentialist reasoners can reduce to an optimisation problem over a space of things that contains consequentialist reasoners. A system engineered to predict human preferences well seems strictly more likely to run into problems associated with misaligned sub-agents. (Nevertheless, we think the amount by which it is more likely is small.)

Safe AGI Without Human Models is Neglected

Given the independent auditing concern, plus the additional points mentioned above, we would like to see more work done on practical approaches to developing safe AGI systems that do not depend on human modelling. At present, this is a neglected area in the AGI safety research landscape. Specifically, work of the form “Here’s a proposed approach, here are the next steps to try it out or investigate further”, which we might term engineering-focused research, is almost entirely done in a human-modelling context. Where we do see some safety work that eschews human modelling, it tends to be theory-focused research, for example, MIRI’s work on agent foundations. This does not fill the gap of engineering-focused work on safety without human models.

To flesh out the claim of a gap, consider the usual formulations of each of the following efforts within safety research: iterated distillation and amplification, debate, recursive reward modelling, cooperative inverse reinforcement learning, and value learning. In each case, there is human modelling built into the basic setup for the approach. However, we note that the technical results in these areas may in some cases be transportable to a setup without human modelling, if the source of human feedback (etc.) is replaced with a purely algorithmic, independent system.

Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment. Although they do not rely on human modelling, some of these approaches nevertheless make most sense in a context where human modelling is happening: for example, impact measures seem to make most sense for agents that will be operating directly in the real world, and such agents are likely to require human modelling. Nevertheless, we would like to see more work of all these kinds, as well as new techniques for building safe AGI that does not rely on human modelling.

Difficulties in Avoiding Human Models

A plausible reason why we do not yet see much research on how to build safe AGI without human modelling is that it is difficult. In this section, we describe some distinct ways in which it is difficult.

Usefulness

It is not obvious how to put a system that does not do human modelling to good use. At least, it is not as obvious as for the systems that do human modelling, since they draw directly on sources (e.g., human preferences) of information about useful behaviour. In other words, it is unclear how to solve the specification problem---how to correctly specify desired (and only desired) behaviour in complex domains---without human modelling. The “against human modelling” stance calls for a solution to the specification problem wherein useful tasks are transformed into well-specified, human-independent tasks either solely by humans or by systems that do not model humans.

To illustrate, suppose we have solved some well-specified, complex but human-independent task like theorem proving or atomically precise manufacturing. Then how do we leverage this solution to produce a good (or better) future? Empowering everyone, or even a few people, with access to a superintelligent system that does not directly encode their values in some way does not obviously produce a future where those values are realised. (This seems related to Wei Dai’s human-safety problem.)

Implicit Human Models

Even seemingly “independent” tasks leak at least a little information about their origins in human motivations. Consider again the mass transit system design problem. Since the problem itself concerns the design of a system for use by humans, it seems difficult to avoid modelling humans at all in specifying the task. More subtly, even highly abstract or generic tasks like puzzle solving contain information about the sources/designers of the puzzles, especially if they are tuned for encoding more obviously human-centred problems. (Work by Shah et al. looks at using the information about human preferences that is latent in the world.)

Specification Competitiveness / Do What I Mean

Explicit specification of a task in the form of, say, an optimisation objective (of which a reinforcement learning problem would be a specific case) is known to be fragile: there are usually things we care about that get left out of explicit specifications. This is one of the motivations for seeking more and more high level and indirect specifications, leaving more of the work of figuring out what exactly is to be done to the machine. However, it is currently hard to see how to automate the process of turning tasks (vaguely defined) into correct specifications without modelling humans.

Performance Competitiveness of Human Models

It could be that modelling humans is the best way to achieve good performance on various tasks we want to apply AGI systems to for reasons that are not simply to do with understanding the problem specification well. For example, there may be aspects of human cognition that we want to more or less replicate in an AGI system, for competitiveness at automating those cognitive functions, and those aspects may carry a lot of information about human preferences with them in a hard to separate way.

What to Do Without Human Models?

We have seen arguments for and against aspiring to solve AGI safety using human modelling. Looking back on these arguments, we note that to the extent that human modelling is a good idea, it is important to do it very well; to the extent that it is a bad idea, it is best to not do it at all. Thus, whether or not to do human modelling at all is a configuration bit that should probably be set early when conceiving of an approach to building safe AGI.

It should be noted that the arguments above are not intended to be decisive, and there may be countervailing considerations which mean we should promote the use of human models despite the risks outlined in this post. However, to the extent that AGI systems with human models are more dangerous than those without, there are two broad lines of intervention we might attempt. Firstly, it may be worthwhile to try to decrease the probability that advanced AI develops human models “by default”, by promoting some lines of research over others. For example, an AI trained in a procedurally-generated virtual environment seems significantly less likely to develop human models than an AI trained on human-generated text and video data.

Secondly, we can focus on safety research that does not require human models, so that if we eventually build AGI systems that are highly capable without using human models, we can make them safer without needing to teach them to model humans. Examples of such research, some of which we mentioned earlier, include developing human-independent methods to measure negative side effects, to prevent specification gaming, to build secure approaches to containment, and to extend the usefulness of task-focused systems.

Acknowledgements: thanks to Daniel Kokotajlo, Rob Bensinger, Richard Ngo, Jan Leike, and Tim Genewein for helpful comments on drafts of this post.

New Comment
9 comments, sorted by Click to highlight new comments since:

Wow, now I take the "But what if a bug puts a negation on the utility function" AGI failure mode more seriously:

One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. (From https://openai.com/blog/fine-tuning-gpt-2/)

Might be worth adding a link to this episode in the text?

While I don't disagree with the reasoning, I disagree with the main thrust of this post. Under my inside view, I think that we should accept that there will be human models in AI systems and figure out how to deal with them. From an outside view perspective, I agree that work on avoiding human models is neglected, but it seems like this is because it only matters in a very particular set of futures. If you want to avoid human models, it seems that a better approach would be to figure out how to navigate into that set of futures.

Avoiding human models necessarily loses a lot of performance

(This point is similar to the ones made in the "Usefulness" and "Specification Competitiveness" sections, stated more strongly and more abstractly. It may be obvious; if so, feel free to skip to the next section.)

Consider the following framework. There is a very large space of behaviors (or even just goals) that an AI system could have, and there need to be a lot of bits of information in order to select the behavior/goal that we actually want from our AI system. Each bit of information corresponds to halving the space of possible behaviors and goals that the AI system could have, if the AI started out as potentially having any possible behavior/goal. (A more formal treatment would consider the entropy of the distribution over possible behaviors/goals.)

Note that this is a very broad definition of "bits of information about the desired behavior/goal": for example, I think that "ceteris paribus, we prefer low impact actions and plans" counts as a (relatively) small number of bits, and these are the bits that impact measures are working with.

It is also important that the bits of information are interpreted correctly by the AI system. I have said before that I worry that an impact measure strong enough to prevent all catastrophes would probably lead to an AI system that never does anything; in this framework, my concern is that the bits of information provided by an impact measure are being misinterpreted as definitively choosing a particular behavior/goal (i.e. providing the maximum possible number of bits, rather than the relatively small number of bits it should be). I'm more excited about learning from the state of the world because there are more bits (since you can tell which impactful behaviors are good vs. bad), and the bits are interpreted more correctly (since it is interpreted as Bayesian evidence rather than a definitive answer).

In this framework, the most useful AI systems will be the ones that can get and correctly interpret the largest number of bits about what humans want; and behave reasonably with respect to any remaining uncertainty about behavior/goals. But even having a good idea of what the desired goal/behavior is means that you understand humans very well; which means that you are capable of modeling humans, and leads to all of the problems mentioned in this post. (ETA: Note that these could be implicit human models.) So, in order to avoid these problems, you need to have your AI systems have fewer bits about the desired goal/behavior. Such systems will not be nearly as useful and will have artificial upper limits on performance in particular domains. (Compare our current probably-rule-based Siri with something along the lines of Samantha from Her.)

(The Less Independent Audits point cuts against this slightly, but in my opinion not by much.)

Is it okay to sacrifice performance?

While it is probably technically feasible to create AI systems without human models, it does not seem strategically feasible to me. That said, there are some strategic views under which this seems feasible. The key property you need is that we do not build the most useful AI systems before we have solved issues with human models; i.e. we have to be able to sacrifice the competitiveness desideratum.

This could be done with very strong global coordination, but my guess is that this article is not thinking about that case so I'll ignore that possibility. It could also be done by having a single actor (or aligned group of actors) develop AGI with a discontinuous leap in capabilities, and the resulting AGI then quickly improves enough to execute a pivotal act. That actor can then unilaterally decide not to create the most useful AI systems from that point on, and prevent them from having human models.

How does current research on avoiding human models help in this scenario?

If the hope is to prevent human models after the pivotal act, that doesn't seem to rely much on current technical research -- the most significant challenge is in having a value aligned actor create AGI in the first place; after which you could presumably take your time solving AI safety concerns. Of course having some technical research on what to do after the pivotal act would be useful in convincing actors in the first place, but that's a very different argument for the importance of this research and I would expect to do significantly different things to achieve this goal.

That leads me to conclude that this research would be impactful by preventing human models before a pivotal act. This means that we need to create an AI that (with the assistance of humans) executes a plan that leads the humans + AI to take over the world -- but the AI must do this without being able to consider how human society will respond to any action it takes (since that would require human models). This seems to limit you to plans that humans come up with, which can make use of specific narrow "superpowers" (e.g. powerful new technology). This seems to me particularly difficult to accomplish, but I don't have a strong argument for this besides intuition.

It could be that all the other paths seem even more doomed; if that's the main motivation for this then I think that claim should be added somewhere in this post.

Summary

It seems like work on technical AI safety research without human models is especially impactful only in the scenario where a single actor uses the work in order to create an AI system that without modeling humans is able to execute a pivotal act (which usually also rests on assumptions of discontinuous AI progress and/or some form of fast takeoff). This seems particularly unlikely to me. If this is the main motivating scenario, it also places further constraints on technical safety research that avoids human models: the safety measures need to be loose enough that the AI system is still able to help humans execute a pivotal act.

Another plausible scenario would be strong global coordination around not building dangerous AI systems including ones that have world models. I don't have strong inside view beliefs on that scenario but my guess is other people are pessimistic about that scenario.

Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment.

I claim that all of these approaches appear not to rely on human modeling because they are only arguing for safety properties and not usefulness properties, and in order for them to be useful they will need to model humans. (The one exception might be tool AIs + formal specifications, but for the reasons in the parent comment I think that these will have an upper limit on usefulness.)

Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment.

Most of these require at least partial specification of human preferences, hence partial modelling of humans: https://www.lesswrong.com/posts/sEqu6jMgnHG2fvaoQ/partial-preferences-needed-partial-preferences-sufficient

Eliezer's behaviorist genie is explicitly about not modeling humans, and I think falls under "engineering-focused work" although I'm not sure how much work has gone into it aside from that article.

I continue to agree with my original comment on this post (though it is a bit long-winded and goes off on more tangents than I would like), and I think it can serve as a review of this post.

If this post were to be rewritten, I'd be particularly interested to hear example "deployment scenarios" where we use an AGI without human models and this makes the future go well. I know of two examples:

  1. We use strong global coordination to ensure that no powerful AI systems with human models are ever deployed.
  2. We build an AGI that can do science / engineering really well (STEM AI), use it to build technology that allows us to take over the world, and then proceed carefully to make the future good.

I don't know if anyone endorses these as plans for the future; either way I have serious qualms with both of them.

I think this post helped draw attention to an under-appreciated strategy/approach to AI safety such that I am very glad that it exists.

This is a very carefully reasoned and detailed post, which lays out a clear framework for thinking about approaches to alignment, and I'm especially excited because it points to one quadrant - engineering-focused research without human models - as highly neglected. For these three reasons I've curated the post.

This post raises awareness about some important and neglected problems IMO. It's not flashy or mind-blowing, but it's solid.