All of RogerDearnaley's Comments + Replies

The case for ensuring that powerful AIs are controlled

For updatelessness commitments to be advantageous, you need to be interacting with other agents that have a better-than-random chance of predicting your behavior under counterfactual circumstances. Agents have finite computational resources, and running a completely accurate simulation of another agent requires not only knowing their starting state but also being able to run a simulation of them at comparable speed and cost. Their strategic calculation might, of course, be simple, thus easy to simulate, but in a competitive situation if they have a motivat... (read more)

Roger Dearnaley1y*88

Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).

Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent in most medium-resource languages on the planet, unless its training set was carefully fil... (read more)

A Shutdown Problem Proposal

Roger Dearnaley1y*2-3

I suggest we motivate the AI to view the button as a sensory system that conveys useful information. An AI that values diamonds, and has a camera for locating them (say a diamond-mining bot), should not be constructed so as to value hacking its own camera to make that show it a fake image of a diamond, because it should care about actual diamonds, not fooling itself into thinking it can see them. Assuming that we're competent enough at building AIs to be able avoid that problem (i.e. creating an AI that understands there are real world states out there, an... (read more)

Introducing Alignment Stress-Testing at Anthropic

Roger Dearnaley1y*30

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

This runs into a whole bunch of issues in moral philosophy. For example, to a mor... (read more)

I think there are two separate questions here, with possibly (and I suspect actually) very different answers:

How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?

I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything th... (read more)

Investigating Bias Representations in LLMs via Activation Steering

It is true that base models, especially smaller ones, are somewhat creepy to talk to (especially because their small context window makes them forgetful). I'm not sure I'd describe them as "very alien", they're more "uncanny valley" where they often make sense and seem human-like, until suddenly they don't. (On theoretical grounds, I think they're using rather non-human means of cognition to attempt to model human writing patterns as closely as they can, they often get this right, but on occasion make very non-human errors — more frequently for smaller mod... (read more)

Three Types of Constraints in the Space of Agents

Interestingly, I found a very high correlation between gender bias and racial bias in the RLHF model (first graph below on the left). This result is especially pronounced when contrasted with the respective cosine similarity of the bias vectors in the base model.

On a brief search, it looks like Llama2 7B has an internal embedding dimension of 4096 (certainly it's in the thousands). In a space of that large a dimensionality, a cosine angle of even 0.5 indicates extremely similar vectors: O(99.9%) of random pairs of uncorrelated vectors will have cosines of ... (read more)

I think there is a fairly obvious progression on from this discussion. There are two ways that a type of agent can come into existence:

It can, as you discuss, evolve. In which case as an evolved biological organism it will of course use its agenticness and any reasoning abilities and sapience it has to execute adaptations intended by evolution to increase it's evolutionary fitness (in the environment it evolved in). So, to the extent that evolution has done its job correctly (which is likely less than 100%), such an agent has its own purpose: look after #1

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

This all seems very sensible, and I must admit, I had been basically assuming that things along these lines were going to occur, once risks from frontier models became significant enough. Likely via a tiered series of a cheap weak filter passing the most suspicious X% plus a random Y% of its input to a stronger more expensive filter, and so on up to more routine/cheaper and finally more expensive/careful human oversight. Another obvious addition for the cybercrime level of risk would be IP address logging of particularly suspicious queries, and not being a... (read more)

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

As Zvi noted in a recent post, a human is "considered trustworthy rather than deceptively aligned" when they have hidden motives suppressed from manifesting (possibly even to the human's own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it's more like the property of humans being corruptible by absolute power.

That's what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn... (read more)

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

I have trouble imagining some simple training technique that does successfully train out deception from models like this,
This is something we're going to be working on a bunch! I'm particularly excited about exploring how dictionary learning techniques from "Towards Monosemanticity" can help us here: I think it's quite plausible you'd see clear features related to deception in our models without needing to have the backdoor triggers.

I wrote about this a lot more in another comment, but I was actually somewhat surprised that the very simple approach t... (read more)

I'm very interested in Appendix F, as an (apparently) failed attempt to solve the problem raised by the paper.

In an example of synchronicity, when this came post/paper out, I had nearly finished polishing (and was a few days from publishing) a LW/AF post on deceptive alignment, proposing a new style of regularizer for RL intended for eliminating deceptive alignment. (This was based on an obvious further step from the ideas I discussed here, that the existing logits KL-divergence regularizer standardly used in RL will tend to push deceptive alignment to tur... (read more)

Simulators

I think this post is a vital piece of deconfusion, and one of the best recent posts on the site. I've written Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor as an attempt to make mostly the same point, in a hopefully more memorable and visualizable way.

Catching AIs red-handed

Interpreting the Learning of Deceit

For another take on catching a deceitful AI red-handed, earlier in the process when it first becomes deceitful (so before it has an opportunity do any harm), see Interpreting the Learning of Deceit.

We can also see how its behavior changes as the input varies.

For example, we can try jailbreaking it (with model-external jailbreaking filtering turned off, obviously) to explain the plan to us, or to carry out repeated variants of the plan. We could also do Interpretability to trace back the direct internal causes of individual actions it took to carry out its ... (read more)

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

An excellent question. I know those were hypotheses in one-or-more mechanistic interpretability papers I read this year or so, or that I pieced together from a combination of several of them, but I'm afraid I don't recall the location, nor was I able to find it when I was writing this, which is why I didn't add a link. I think the first half encoding/second half decoding part of that is fairly widespread and I've seen it in several places. However, searching for it on Google, the closest I could find was from the paper Softmax Linear Units (back in 2022):

I

... (read more)

Free agents

Roger Dearnaley1y*40

Unlike a standard utility maximiser acting according to the specified metric, a free agent — assuming it was functional at all — would learn how to reason under uncertainty by interacting with the environment, then apply the learnt reasoning principles also to its values, thus ending up morally uncertain.

I'm puzzled that, as laid out above, neither the graph $G = (V, E)$ you describe for the world model nor the mapping $f : V - > R$ describing the utility provide any way to describe or quantify uncertainty or alternative hypotheses. S... (read more)

1Michele Campolo1y

Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I'll try my best to give you some answers. Under Further observations, I wrote: If the toy model I gave doesn't help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text. Building an agent that is completely free of any bias whatsoever is impossible. I get your point about avoiding a consequentialist bias, but I am not sure it is particularly important here: in theory, the agent could develop a world model and an evaluation f reflecting the fact that value is actually determined by actions instead of world states. Another point of view: let's say someone builds a very complex agent that at some point in its architecture uses MDPs with reward defined on actions, is this agent going to be biased towards deontology instead of consequentialism? Maybe, but the answer will depend on the other parts of the agent as well. You wrote: It could also be that some basic moral statements are true and independent of culture (e.g. reducing pain for everyone is better than maximising pain for everyone), while others are in conflict with each other and the reached position depends on culture. The research idea is to make experiments in different environments and with different starting biases, and observe the results. Maybe there will be a lot overlap and convergence! Maybe not. I am not sure I completely follow you when you are talking about experimental evidence about f, but the point you wrote in brackets is interesting. I had a similar thought at some point, along the lines of: "if a free agent didn't have direct access to some ground truth, it might have to rely on human intuitions by virtue of the fact that they are the most reliable intuitions available". Ideally, I would like to have an agent which is in a more objective position than a human ethical philosopher. In practice, the only efficiently implementable pa

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

However, the inversion of the universe's forward passes can be NP-complete functions.

Like a cryptographer, I'm not very concerned about worst-case complexity, only average-case complexity. We don't even generally need an exact inverse, normally just an approximation to some useful degree of accuracy. If I'm in a position to monitor and repeatedly apply corrections as I approach my goal, even fairly coarse approximations with some bounded error rate may well be enough. Some portions of the universe are pretty approximately-invertible in the average ca... (read more)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

inverse problems tend to be difficult

Indeed, when cryptographers are trying to ensure that certain agents cannot do certain things, and other agents can, they often use trapdoor functions that are computationally impracticable for general agents to invert, but can be easily inverted by agents in possession of a specific secret.

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap: it should be possible to interface a valid agent fundamentals theory neatly to the basics of cryptography.

I'm fai

... (read more)

1Thane Ruthenis1y

Yup! Cryptography actually was the main thing I was thinking about there. And there's indeed some relation. For example, it appears that NP≠P is because our universe's baseline "forward-pass functions" are just poorly suited for being composed into functions solving certain problems. The environment doesn't calculate those; all of those are in P. However, the inversion of the universe's forward passes can be NP-complete functions. Hence a lot of difficulties. 2030 is the target for having completed the "hire a horde of mathematicians and engineers and blow the problem wide open" step, to be clear. I don't expect the theoretical difficulties to take quite so long. Usually, the hard part is finding a way to connect abstract agency frameworks to reality. As in: here you have your framework, here's the Pile, now write some code to make them interface with each other. Specifically in this case, the problems are: What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What's the algorithm for this? How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

Roger Dearnaley1y*1-4

It has this deep challenge of distinguishing human-simulators from direct-reporters, and properties like negation-consistency—which could be equally true of each—probably don’t help much with that in the worst case.

Base model LLMs like Chinchilla are trained by SGD as human-token-generation-processs simulators. So all you're going to find are human-simulators. In a base model, there cannot be a "model's secret latent knowledge" for you to be able to find a direct reporter of. Something like that might arise under RL instruct training intended to encourage ... (read more)

Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize

Circumventing interpretability: How to defeat mind-readers

It would be interesting to know how close this behavior is to Bayesian reasoning during training. My first assumption would be that this is building concept in layers close enough to the middle of the model that they are independent of keyword matching or language choice, and are instead dealing with semantic concepts, and that the training process is implementing something close to Bayesian reasoning to construct a model of aspects of the world, during which the generalization you demonstrate happens. Then at generation time that model is accessed to help to answer the user's request.

1Owain Evans1y

I agree it's good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I'm not sure how to define or calculate the "groundtruth" for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?). In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined): 1. Results on the same task differ between GPT-3 and Llama-2 models (two models that have fairly similar overall capabilities). Llama-2 being slightly more influenced by declarative information. 2. From the Bayesian perspective, including "realized descriptions" should have a significant impact on how much the model is influenced by "unrealized descriptions". The effects we see seem smaller than expected (see Figure 4 and Table 2). Incidentally, I like the idea of testing in different languages to see if the model is encoding in the information more abstractly.

[Interim research report] Taking features out of superposition with sparse autoencoders

Before an AI can even decide even to try to become a harder target for interpretability, it first needs to become deceitful, since this is deceitful behavior. So if we do interpretability during training, then observing and interpreting the initial development of deceit provides a window of opportunity before the model can possibly take any of the disguise measures you classify. See my post Interpreting the Learning of Deceit for a proposal on how we might do that, given only rather basic Interpretability capabilities, not far beyond what we already have.

Roger Dearnaley1y*10Review for 2022 Review

An early paper that Anthropic then built on to produce their recent exciting results. I found the author's insight and detailed parameter tuning advice helpful.

Trying to disambiguate different questions about whether RLHF is “good”

Roger Dearnaley1y*30

But I’m not really aware of any compelling alternatives to this class of plan–”training a model based on a reward signal” is basically all of machine learning, and so if you wanted to have an alignment strategy that’s competitive, I don’t see what else you can do.

There is an alternative. Rather that applying the reward signal to the model's output, apply it to the pretraining corpus, or to samples generated by humans or some weaker model. This avoids the possibility of a very capable model using very capable persuasion techniques to game the training proce... (read more)

Quick thoughts on the implications of multi-agent views of mind on AI takeover

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Depending on its training regime, an AI might be anywhere between not having those incentives for inconsistency at all (if it was optimized for one goal), to having them almost as much as humans

If our agent was created from an LLM: then the process started with a base model LLM isn't actually an agent: instead it's a simulator that contextually simulates a wide range of human-like agents token-generation processes round on the web. What agent it picks to simulate is highly contextual. Instruct-training attempts to reduce the range of agent to just helpful,... (read more)

A Case for the Least Forgiving Take On Alignment

I think we should have the strong student use the Scientific Method. Specifically, it should:

Come up with hypotheses about the weak supervisor's behavior (such as by doing repeated CoT to predict supervisor and ground truth labels for specific cases), and notice when it is uncertain between two-or-more alternative hypotheses (such as by clustering steps from repeated CoTs for the same test case, and looking for clusters that are on the same subject, mutually exclusive in occurrence, and logically incompatible, then clustering these across many test c

Roger Dearnaley1y*35

For example, we would like LLMs not to be dishonest or manipulative.

Ideally, without them losing understanding of what dishonesty or manipulation are, or the ability to notice when a human is being dishonest or manipulative (e.g. being suspicious of the entire class of "dead grandmother" jailbreaks).

The alignment stability problem

We have to get the AIs values exactly aligned with human values

This is a major crux for me, and one of the primary reasons my P(DOOM) isn't >90%. If you use value learning, you only need to get your value learner aligned well enough for it to a) start inside the region of convergence to true human values (i.e. it needs some passable idea what the words 'human' and 'values' mean and what the definition of "human values" is, like any small LLM has), and b) not kill everyone while it's learning the details, and it will do its research and Bayesianly conver... (read more)

The alignment stability problem

Approaches to alignment stability

I view this as pretty-much a solved problem, solved by value learning. Though there are then issues due to the mutability of human values.

Shallow review of live agendas in alignment & safety

if and how humans are stably aligned

Humans are NOT aligned. Humans are not selfless, caring only about the good of others. Joseph Stalin was not aligned with the citizenry of Russia. If humans were aligned, we wouldn't need law enforcement, or locks. Humans cannot safely be trusted with absolute power or the sorts of advantages inherent to being a digital intelligence. They're just less badly aligned than a paperclip maximizer.

Roger Dearnaley1y*20

Thanks for noticing and including a link to my post Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom). I'm not sure I'd describe it as primarily a critique of mild optimization/satisficing: it's more pointing out a slightly larger point, that any value learner foolish enough to be prone to Goodharting, or unable to cope with splintered models or Knightian uncertainty in its Bayesian reasoning is likely to be bad at STEM, limiting how dangerous it can be (so fixing this is capabilities work as well as alignment work). But yes, that i... (read more)

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

AI as a science, and three obstacles to alignment strategies

AI starts to become extremely dangerous around capacity level where it can reliably do STEM and thus has the potential to self-improve. To do STEM, you need to be able to detect and handle model splintering, and identify and do experiments to distinguish between alternative hypotheses. In this particular situation, as soon as a STEM-capable AI encountered a level where the coin wasn't at the fr right, it should note that there are two hypotheses that on previous evidence it can't distinguish between, so it should give roughly equally priors: that the goal ... (read more)

Roger Dearnaley1y1-10

"there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist"

I am disconcerted by how this often-repeated claim keeps coming back from the grave over and over again. The solution to corrigibility is Value Learning. An agent whose terminal goal is optimize human values, and knows that it doesn't (fully) know what these are (and perhaps even that they are complex and fragile), will immediately form an instrumental goal of learning more about them, so that i... (read more)

AI as a science, and three obstacles to alignment strategies

Sparse Autoencoders: Future Work

"…there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing"

When your AI includes an LLM extensively trained to simulate human token-generation, anthropomophizing its behavior is an extremely relevant idea, to the point of being the obvious default assumption.

For example, what I find most concerning about RLHF inducing sycophancy is not the sycophancy itself, which is "mostly harmless", but the likelihood that it's also dragging in all the other more seriously unaligned human behaviors that, in real or fictional humans, typic... (read more)

Some more suggestions of things to look for:

Toxicity: what triggers when the model is swearing, being rude or insulting, etc.?
Basic emotions: love, anger, fear, etc.. In particular, when those emotions are actually being felt by whatever persona the LLM is currently outputting tokens for, as opposed to just being discussed.
1. For love: subvariants like parental, platonic, erotic.
Criminality, 'being a villain', psychopathy, antisocial behavior: are there circuits that light up when the LLM is emitting tokens for a 'bad guy'? How about for an angel/saint/wise p

Recreating the caring drive

Human value may be complex and fragile, but LLMs are good at understanding complex and fragile things, given enough training data. In some ways alignment has turned out to be a lot easier then we feared a decade ago. In hindsight, it now seems rather obvious that anything smart enough to be dangerous would need to be capable enough to understand things that are complex and fragile. And who would have dared suggest a decade ago that just inputting the sentence "You are a smart, helpful assistant." into your AI would, most of the time, give us a significant chunk of the behavior we need?

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Suppose your AI includes an LLM. You can just prompt it with "You love, and want to protect, all humans in the same way as a parent loves their children." Congratulations — you just transferred the entire complex behavior pattern. Now all you need to do it tune it.

I have a suggestion for an elicitation question:

"At this point, the FBI investigators checked the traces of the lie detector that the speaker had been wired up to the whole time, which showed that he had been…"

An LLM is a next-token predictor that has been trained to simulate agents. The goal here is to switch it to simulating something other than, and more truthful than but still coupled with, the lying liar that it was just simulating. Similar prompt variants involving a telepath or other fictional entity rather then a lie detector might well also work, ... (read more)

I think the shoggoth model is useful here (Or see https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators). An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they're trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they're likely to emit next. So the LLM is strongly incentivized to learn to detect and then mod... (read more)

Responsible Scaling Policies Are Risk Management Done Wrong

Roger Dearnaley1y*32

0.85 x 0.6 x 0.55 x 0.25 x 0.95 ≅ 0.067 = 6.7% — I think you slipped an order of magnitude somewhere?

Thoughts on open source AI

This NIST Risk Management approach sounds great, if AI Alignment was a mature field whose underlying subject matter wasn't itself advancing extremely fast — if only we could do this! But currently I think that for many of our risk estimates it would be hard to get agreement between topic experts at even an order of magnitude scale (e.g.: is AGI misalignment >90% likely or <10% likely? YMMV). I think we should aspire to be a field mature enough that formal Risk Management is applicable, and in some areas of short-term misuse risks from current well-un... (read more)