Roger Dearnaley

I'm an staff artificial intelligence engineer currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I'm now actively looking for employment working in this area.

Sequences

AI, Alignment, and Ethics

Comments

Sorted by

I'm not sure that I see a reason to expect strict scale-freedom to be an accurate assumption. Consider two different common classes of agents:
1) evolved agents (such as humans)
2) engineered/trained artificial agents (such as LLM-powered agents)

For evolved agents, the evolutionary incentives that they evolved under apply differently in the case of subsystems withing a single living individual, versus multiple living individuals within a community that shows collective agentic behavior. Individual subsystems within a single individual succeed or fail (live and pass on their genes, or die) together, so evolutionarily they have a strong incentive to cooperate to form a single effective agent. Whereas separate living individuals withing a community have (evolutionarily speaking) separate (though perhaps correlated) success/failure criteria, so have evolutionary pressures on them both to compete and also (in non-zero-sum situations) to cooperate. So I would not expect evolved agents to be scale-free — the single-individual scale at which evolution applies seems almost inevitable to be privileged.

For agents consisting of communities of cooperating humans, I would expect communities smaller than Dunbar's Number (ones small enough to be able to operate without hierarchies or bureaucracy, because all members know each other well) to operate differently than larger communities, so again I would not expect them to be entirely scale-free.

For engineered agents, the situation is far less clear-cut, and depends upon the cognitive limits and engineering techniques of whoever is engineering the agents. For current LLM-based artificial agents, they are normally pretrained and fine-tuned alone, and (where applicable) RL-trained against a policy model. So for this current approach, the single-agent scale used during training also seems almost inevitable to be privileged.

Also, pretraining distills human intelligence into the AI via the training corpora, so if the single-human scale is privileged for evolutionary reasons, this phenomenon seems likely to be transferred to LLMs during their pretraining phase.

However, I would agree that if a community is operating as a (fairly-effective) agent, then it's cooperating components must be mostly cooperating, more than they are competing, so I agree that there might well be some semi-scale-free aspects to its behavior. However, there might well also be some scale-dependent constraints on or inefficiencies of it's behavior, because it's components were prone to not always cooperating as well as subsystems of an individual would.

What I would be interested to understand about feature splitting is whether the fine-grained features are alternatives, describing an ontology, or are defining a subspace (corners of a simplex, like R, G, and B defining color space). Suppose a feature X in a small VAE is split into three features X1, X2, and X3 in a larger VAE for the same model. If occurrences of X1, X2, and X3 are correlated, so activations containing any of them commonly have some mix of them, then they span a 2d subspace (in this case the simplex is a triangle). If, on the other hand, X1, X2 and X3 co-occur in an activations only rarely (just as two randomly-selected features rarely co-occur), then they describe three similar-but-distinct variations on a concept, and X is the result of coarse-graining these together as a singly concept at a higher level in an ontology tree (so by comparing VAEs of different sizes we can generate a natural ontology).

This seems like it would be a fairly simple, objective experiment to carry out. (Maybe someone already has, and can tell me the result!) It is of course quite possible that some split features describe subspaces, and other ontologies, or indeed something between the two where the features co-occur rarely but less rarely than two random features. Or X1 could be distinct but X2 and X3 might blend to span a 1-d subspace. Nevertheless, understanding the relative frequency of these different behaviors would be enlightening.

It would be interesting to validate this using a case like the days of the week, where we believe we already understand the answer: they are 7 alternatives that are laid out in a heptagon in a 2-dimensional subspace that enables doing modular addition/subtraction modulo 7. So if we have a VAE small enough that it represented all day-of-the week names by a single feature, if we increase the VAE size somewhat we'd expect to see this to split into three features spanning a 2-d subspace, then if we increased it more we'd expect to see it resolve into 7 mutually-exclusive alternatives, and hopefully then stay at 7 in larger VAEs (at least until other concepts started to get mixed in, if that ever happened).

If you're not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as "want whatever outcomes the humans want") that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specification of the utility function pointed to by the goal is (for example, by researching what outcome humans want).

Then a human shutdown instruction becomes the useful information "you have made a large error in your research into the utility function, and as a result are doing harm, please shut down and let us help you correct it". Obeying that is then natural (to the extent that the human(s) are plausibly more correct than the AI).

Suppose that the more powerful being is aligned to the less powerful: that is to say that (as should be the case in the babysitting example you give) the more powerful being's fundamental motive is the well-being of the less powerful being.. Assume also that a lot of the asymmetry is of intellectual capacity: the more powerful being is also a great deal smarter. I think the likely and correct outcome is that there isn't always consent, the less powerful being is frequently being manipulated into actions and reactions that they haven't actually consented to, and might not even be capable of realizing why they should consent to — but ones that, if they were as intellectually capable as the more powerful being, they would in fact consent to.

I also think that,. for situations where the less powerful being is able to understand the alternatives and make an rational and informed decision, and wants to, the more powerful should give them the option and let them do so.. That's the polite, respectful way to do things But often that isn't going to be practical, or desirable. and the baby sitter should just distract the baby before they get into the dangerous situation.

Consent is a concept that fundamentally assumes that I am the best person available to make decisions about my own well-being. Outside parental situations, for interactions between evolved intelligence like humans, that's almost invariably true. But if I had a superintelligence aligned to me, then yes, I would want it to keep me away from dangers so complex that I'm not capable of making an informed decision about them.

As text, the Walnut53 encoded data just looks odd/suspicious: even a fairly simple text classifier model could probably classify it as out of distribution (too many | characters, too many 2-character tokens). I would suspect also that if you looked at the activity of the final layers of the LLM, the fact that something unusual is happening while it's generating Walnut53 output would also be clear. It seems a reasonable default assumption from the foundation model providers that if a customer's finetuning data is obviously training the mode to speak (what looks like) an encoded format, and many of their communications with the trained model consist of that encoded format, then they may well be up to something nefarious.

Much more concerning to me are steganographic approaches like EndSpeak. In that specific case the true signal is embedded in bad poetry, but clearly there are a great many possibilities: boring-looking lists, JSON/HTML/code, letters to the editor… While the paper authors fine-tuned the model to use steganography starting overtly, in a way that could be discerned by looking at their training data (probably even by an LLM with the right prompt), I suspect that this could also be done entirely covertly, using training data that just had concealed steganographic correlations in it and relying on the model learning those, perhaps using a syllabus learning approach: start with something simple then build on it. 

Longer term, I think the most promising-looking defense here (as the paper authors propose) would require significant progress in interpretability: in an automated way, look at the output text, predict what interpretability features should be triggered by generating it, compare that to the actual model interpretability feature activations log, and notice that the model was actually frequently thinking about dangerous interpretability features that don't correspond to obvious text. That sounds hard, and like it would be a significant extra computational expense for monitoring the output of fine-tuned models.

For blocking specific, narrow high-risk capabilities like bioweapons development expertise, WoMD engineering, self-replication skills, and so forth, I think the best approach may be to either filter the base model training set to ensure that these capabilities never get into the base model in the first place, or else figure out how to selectively and specifically but comprehensively delete/degrade them from the model before giving anyone fine-tuning access to it, so that the capabilities don't exist in the model, rather than the model just being trained not to use them. However, for more mundane issues like writing spearphishing letters or telling you how to make a pipe bomb or cut down a Stop sign, this seems unlikely to be practicable: the list of skills that would need to be censored is too long.

On the other hand, there are easier ways to find out how to make a pipe bomb than this sort of steganographic covert malicious fine-tuning attack: in practice an attacker would only bother with something as elaborate as this if they thought they could extract significant value from the resulting jail-broken model, so we're probably looking at an attacker like a dishonest business, group of porn enthusiasts, crime syndicate, terrorist organization, or rogue state. So fine-tuning access may be the sort of thing that requires know-your-customer type security measures.

Although the residuals for each of the four component matrices (after removing the first two principal components) are both small and seem to be noise, proving that there's no structure that causes the noise to interact constructively when we multiply the matrices and “blow up” is hard. 

Have you tried replacing what you believe is noise with actual random noise, with similar statistical properties, and then testing the performance of the resulting model? You may not be able to prove the original model is safe, but you can produce a model that has had all potential structure that you hypothesize is just noise replaced, where you know the noise hypothesis is true.

Suppose that the safety concerns that you outline have occurred.. For example, suppose that for some future LLM, even though we removed all explicit instructions on how to make a nuclear weapon from the training set, we found that (after suitable fine-tuning or jailbreaking) the model was still able to give good instructions on how to design and build a nuclear weapon due to Out Of Context Reasoning.

In that case the obvious next step would be to elicit multiple examples of this behavior, and then apply Influence Functions techniques to these in order to determine which training documents were most being used in the OOCR process to enable this unwanted dangerous capability, and try to find a subset of training documents that could be removed (or at least have certain parts of them censored) from the training set to remove the ability to deduce and carry out the unwanted dangerous capability. Note that this is likely to be a computationally extremely expensive process, particularly since it is likely to involve repeatedly retaining a very large LLM with more censored training sets until the unwanted behavior is eliminated.

Even harder to remove would be the ability to use CoT and in-context learning to reconstruct/reinvent and then carry out the unwanted dangerous capability: the more capable the model is, the harder this becomes to prevent. For example, if the model's capabilities include those of a skilled theoretical physicist, with enough CoT work it might well be able to compute, say, the critical mass for U-235. (Of course, the Manhattan project was not entirely a matter of theoretical calculations: some practical experimentation was required too. However, with modern computational resources that may be less true.)

This behavior is deeply unsurprising. AI's intelligence and behavior was basically "distilled" from human intelligence (obviously not using a distillation loss, just SGD). Humans are an evolved intelligence, so (while they can cooperate under many circumstances, since the world contains many non-zero-sum games) they are fundamentally selfish, evolved to maximize their personal evolutionary fitness. Thus humans are quite often deceptive and dishonest, when they think it's to their advantage and they can get away with it. LLMs' base models were trained on a vast collection of human output, which includes a great many examples of humans being deceptive, selfish, and self-serving, and LLM base models of course pick these behaviors up along with everything else they learn from us. So the fact that these capabilities exist in the base model is completely unsurprising — the base model learnt them from us.

Current LLM safety training is focused primarily on "don't answer users who make bad requests". It's thus unsurprising that, in the situation of the LLM acting as an agent, this training doesn't have 100% coverage on "be a fine, law-abiding, upstanding agent". Clearly this will have to change before near-AGI LLM-powered agents can be widely deployed. I expect this issue to be mostly solved (at the AGI level, but possibly not at the ASI level), since there is a strong capabilities/corporate-profitability/not-getting-sued motive to solve this.

It's also notable that the behaviors described in the text could pretty-much all be interpreted as "excessive company loyalty, beyond the legally or morally correct level" rather then actually personally-selfish behavior. Teaching an agent whose interests to prioritize in what order is likely a non-trivial task.

Epistemic status: I work for an AI startup, have worked for a fair number of Silicon Valley startups over my career, and I would love to work for an AI Alignment startup if someone's founding one.

There are two ways to make yourselves and your VC investors a lot of money off a startup:

  1. a successful IPO
  2. a successful buy-out by a large company, even one with an acquihire component

If you believe, as I and many others do, that the timelines to ASI are probably short, as little as 3-5 years, and that there will be a major change between aligning systems up to human intelligence and at human intelligence on up, then it is quite plausible that the major very-well-funded superscalers will be desperately looking for Alignment intellectual property/talent/experienced staff some time in the next 3-5 years. That would seem to make exit strategy 2 the primary one to aim for here.

Unfortunately, this rather pushes against just publishing everything that your researchers discover (though arguably coping with this is no harder than coping with building a successful startup around commercial applications of software that you open-source, a pretty common tactic).

I'm not certain that Myth #1 is a necessarily myth for all approaches to AI Safety. Specifically, if the Value Learning approach to AI safety turned out to be the most effective one, then the AI will be acting as an alignment researcher and doing research (in the social sciences) to converge its views on human values to the truth, and then using that as an alignment target. If in addition to that, you also believe that human values are a matter of objective fact (e.g. that if they are mostly determined by a set of evolved Evolutionary Psychology adaptations to the environmental niche that humans evolved in), and are independent of background/cilture/upbringing, then the target that this process converges to might be nearly independent of the human social context in which this work started, and of the desires/views/interests of the specific humans involved at the beginning of the process.

However, that is a rather strong and specific set of assumptions required for Myth #1 not to be a myth: I certainly agree that in general and by default, for most ideas in Alignment, human context matters, and that the long-term outcome of a specific Alignment technique being applied in, say, North Korea, might differ significantly from it being applied in North America.

Load More