All of RogerDearnaley's Comments + Replies

What I would be interested to understand about feature splitting is whether the fine-grained features are alternatives, describing an ontology, or are defining a subspace (corners of a simplex, like R, G, and B defining color space). Suppose a feature X in a small VAE is split into three features X1, X2, and X3 in a larger VAE for the same model. If occurrences of X1, X2, and X3 are correlated, so activations containing any of them commonly have some mix of them, then they span a 2d subspace (in this case the simplex is a triangle). If, on the other hand, ... (read more)

If you're not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as "want whatever outcomes the humans want") that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specificatio... (read more)

1Rubi Hudson
I agree that in theory uncertainty about the goal is helpful. However, the true main goal has to be under consideration, otherwise resisting modification to add it is beneficial for all goals that are. How to ensure the true goal is included seems like a very difficult open problem.

Suppose that the more powerful being is aligned to the less powerful: that is to say that (as should be the case in the babysitting example you give) the more powerful being's fundamental motive is the well-being of the less powerful being.. Assume also that a lot of the asymmetry is of intellectual capacity: the more powerful being is also a great deal smarter. I think the likely and correct outcome is that there isn't always consent, the less powerful being is frequently being manipulated into actions and reactions that they haven't actually consented to... (read more)

1[anonymous]
Relevant post by Richard Ngo: "Moral Strategies at different capability levels". Crucial excerpt:

As text, the Walnut53 encoded data just looks odd/suspicious: even a fairly simple text classifier model could probably classify it as out of distribution (too many | characters, too many 2-character tokens). I would suspect also that if you looked at the activity of the final layers of the LLM, the fact that something unusual is happening while it's generating Walnut53 output would also be clear. It seems a reasonable default assumption from the foundation model providers that if a customer's finetuning data is obviously training the mode to speak (what l... (read more)

Although the residuals for each of the four component matrices (after removing the first two principal components) are both small and seem to be noise, proving that there's no structure that causes the noise to interact constructively when we multiply the matrices and “blow up” is hard. 

Have you tried replacing what you believe is noise with actual random noise, with similar statistical properties, and then testing the performance of the resulting model? You may not be able to prove the original model is safe, but you can produce a model that has had all potential structure that you hypothesize is just noise replaced, where you know the noise hypothesis is true.

I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.

On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)

On o... (read more)

Suppose that the safety concerns that you outline have occurred.. For example, suppose that for some future LLM, even though we removed all explicit instructions on how to make a nuclear weapon from the training set, we found that (after suitable fine-tuning or jailbreaking) the model was still able to give good instructions on how to design and build a nuclear weapon due to Out Of Context Reasoning.

In that case the obvious next step would be to elicit multiple examples of this behavior, and then apply Influence Functions techniques to these in order to de... (read more)

This behavior is deeply unsurprising. AI's intelligence and behavior was basically "distilled" from human intelligence (obviously not using a distillation loss, just SGD). Humans are an evolved intelligence, so (while they can cooperate under many circumstances, since the world contains many non-zero-sum games) they are fundamentally selfish, evolved to maximize their personal evolutionary fitness. Thus humans are quite often deceptive and dishonest, when they think it's to their advantage and they can get away with it. LLMs' base models were trained on a ... (read more)

Epistemic status: I work for an AI startup, have worked for a fair number of Silicon Valley startups over my career, and I would love to work for an AI Alignment startup if someone's founding one.

There are two ways to make yourselves and your VC investors a lot of money off a startup:

  1. a successful IPO
  2. a successful buy-out by a large company, even one with an acquihire component

If you believe, as I and many others do, that the timelines to ASI are probably short, as little as 3-5 years, and that there will be a major change between aligning systems up to huma... (read more)

I'm not certain that Myth #1 is a necessarily myth for all approaches to AI Safety. Specifically, if the Value Learning approach to AI safety turned out to be the most effective one, then the AI will be acting as an alignment researcher and doing research (in the social sciences) to converge its views on human values to the truth, and then using that as an alignment target. If in addition to that, you also believe that human values are a matter of objective fact (e.g. that if they are mostly determined by a set of evolved Evolutionary Psychology adaptation... (read more)

For updatelessness commitments to be advantageous, you need to be interacting with other agents that have a better-than-random chance of predicting your behavior under counterfactual circumstances. Agents have finite computational resources, and running a completely accurate simulation of another agent requires not only knowing their starting state but also being able to run a simulation of them at comparable speed and cost. Their strategic calculation might, of course, be simple, thus easy to simulate, but in a competitive situation if they have a motivat... (read more)

Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).

Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent  in most medium-resource languages on the planet, unless its training set was carefully fil... (read more)

I suggest we motivate the AI to view the button as a sensory system that conveys useful information. An AI that values diamonds, and has a camera for locating them (say a diamond-mining bot), should not be constructed so as to value hacking its own camera to make that show it a fake image of a diamond, because it should care about actual diamonds, not fooling itself into thinking it can see them. Assuming that we're competent enough at building AIs to be able avoid that problem (i.e. creating an AI that understands there are real world states out there, an... (read more)

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

This runs into a whole bunch of issues in moral philosophy. For example, to a mor... (read more)

I think there are two separate questions here, with possibly (and I suspect actually) very different answers:

  1. How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
  2. How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?

I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything th... (read more)

It is true that base models, especially smaller ones, are somewhat creepy to talk to (especially because their small context window makes them forgetful). I'm not sure I'd describe them as "very alien", they're more "uncanny valley" where they often make sense and seem human-like, until suddenly they don't. (On theoretical grounds, I think they're using rather non-human means of cognition to attempt to model human writing patterns as closely as they can, they often get this right, but on occasion make very non-human errors — more frequently for smaller mod... (read more)

Interestingly, I found a very high correlation between gender bias and racial bias in the RLHF model (first graph below on the left). This result is especially pronounced when contrasted with the respective cosine similarity of the bias vectors in the base model.

On a brief search, it looks like Llama2 7B has an internal embedding dimension of 4096 (certainly it's in the thousands). In a space of that large a dimensionality, a cosine angle of even 0.5 indicates extremely similar vectors: O(99.9%) of random pairs of uncorrelated vectors will have cosines of ... (read more)

I think there is a fairly obvious progression on from this discussion. There are two ways that a type of agent can come into existence:

  1. It can, as you discuss, evolve. In which case as an evolved biological organism it will of course use its agenticness and any reasoning abilities and sapience it has to execute adaptations intended by evolution to increase it's evolutionary fitness (in the environment it evolved in). So, to the extent that evolution has done its job correctly (which is likely less than 100%), such an agent has its own purpose: look after #1
... (read more)

This all seems very sensible, and I must admit, I had been basically assuming that things along these lines were going to occur, once risks from frontier models became significant enough. Likely via a tiered series of a cheap weak filter passing the most suspicious X% plus a random Y% of its input to a stronger more expensive filter, and so on up to more routine/cheaper and finally more expensive/careful human oversight. Another obvious addition for the cybercrime level of risk would be IP address logging of particularly suspicious queries, and not being a... (read more)

As Zvi noted in a recent post, a human is "considered trustworthy rather than deceptively aligned" when they have hidden motives suppressed from manifesting (possibly even to the human's own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it's more like the property of humans being corruptible by absolute power.

That's what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn... (read more)

I have trouble imagining some simple training technique that does successfully train out deception from models like this, 

This is something we're going to be working on a bunch! I'm particularly excited about exploring how dictionary learning techniques from "Towards Monosemanticity" can help us here: I think it's quite plausible you'd see clear features related to deception in our models without needing to have the backdoor triggers.

I wrote about this a lot more in another comment, but I was actually somewhat surprised that the very simple approach t... (read more)

I'm very interested in Appendix F, as an (apparently) failed attempt to solve the problem raised by the paper.

In an example of synchronicity, when this came post/paper out, I had nearly finished polishing (and was a few days from publishing) a LW/AF post on deceptive alignment, proposing a new style of regularizer for RL intended for eliminating deceptive alignment. (This was based on an obvious further step from the ideas I discussed here, that the existing logits KL-divergence regularizer standardly used in RL will tend to push deceptive alignment to tur... (read more)

I think this post is a vital piece of deconfusion, and one of the best recent posts on the site. I've written Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor as an attempt to make mostly the same point, in a hopefully more memorable and visualizable way.

For another take on catching a deceitful AI red-handed, earlier in the process when it first becomes deceitful (so before it has an opportunity do any harm), see Interpreting the Learning of Deceit.

We can also see how its behavior changes as the input varies.

For example, we can try jailbreaking it (with model-external jailbreaking filtering turned off, obviously) to explain the plan to us, or to carry out repeated variants of the plan. We could also do Interpretability to trace back the direct internal causes of individual actions it took to carry out its ... (read more)

An excellent question. I know those were hypotheses in one-or-more mechanistic interpretability papers I read this year or so, or that I pieced together from a combination of several of them, but I'm afraid I don't recall the location, nor was I able to find it when I was writing this, which is why I didn't add a link. I think the first half encoding/second half decoding part of that is fairly widespread and I've seen it in several places. However, searching for it on Google, the closest I could find was from the paper Softmax Linear Units (back in 2022):

I

... (read more)

Unlike a standard utility maximiser acting according to the specified metric, a free agent — assuming it was functional at all — would learn how to reason under uncertainty by interacting with the environment, then apply the learnt reasoning principles also to its values, thus ending up morally uncertain.

I'm puzzled that, as laid out above, neither the graph  you describe for the world model nor the mapping  describing the utility provide any way to describe or quantify uncertainty or alternative hypotheses. S... (read more)

1Michele Campolo
Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I'll try my best to give you some answers. Under Further observations, I wrote: If the toy model I gave doesn't help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text. Building an agent that is completely free of any bias whatsoever is impossible. I get your point about avoiding a consequentialist bias, but I am not sure it is particularly important here: in theory, the agent could develop a world model and an evaluation f reflecting the fact that value is actually determined by actions instead of world states. Another point of view: let's say someone builds a very complex agent that at some point in its architecture uses MDPs with reward defined on actions, is this agent going to be biased towards deontology instead of consequentialism? Maybe, but the answer will depend on the other parts of the agent as well. You wrote: It could also be that some basic moral statements are true and independent of culture (e.g. reducing pain for everyone is better than maximising pain for everyone), while others are in conflict with each other and the reached position depends on culture. The research idea is to make experiments in different environments and with different starting biases, and observe the results. Maybe there will be a lot overlap and convergence! Maybe not. I am not sure I completely follow you when you are talking about experimental evidence about f, but the point you wrote in brackets is interesting. I had a similar thought at some point, along the lines of: "if a free agent didn't have direct access to some ground truth, it might have to rely on human intuitions by virtue of the fact that they are the most reliable intuitions available". Ideally, I would like to have an agent which is in a more objective position than a human ethical philosopher. In practice, the only efficiently implementable pa

However, the inversion of the universe's forward passes can be NP-complete functions.

Like a cryptographer, I'm not very concerned about worst-case complexity, only average-case complexity. We don't even generally need an exact inverse, normally just  an approximation to some useful degree of accuracy. If I'm in a position to monitor and repeatedly apply corrections as I approach my goal, even fairly coarse approximations with some bounded error rate may well be enough. Some portions of the universe are pretty approximately-invertible in the average ca... (read more)

inverse problems tend to be difficult

Indeed, when cryptographers are trying to ensure that certain agents cannot do certain things, and other agents can, they often use trapdoor functions that are computationally impracticable for general agents to invert, but can be easily inverted by agents in possession of a specific secret.

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap: it should be possible to interface a valid agent fundamentals theory neatly to the basics of cryptography.

I'm fai

... (read more)
1Thane Ruthenis
Yup! Cryptography actually was the main thing I was thinking about there. And there's indeed some relation. For example, it appears that NP≠P is because our universe's baseline "forward-pass functions" are just poorly suited for being composed into functions solving certain problems. The environment doesn't calculate those; all of those are in P. However, the inversion of the universe's forward passes can be NP-complete functions. Hence a lot of difficulties. 2030 is the target for having completed the "hire a horde of mathematicians and engineers and blow the problem wide open" step, to be clear. I don't expect the theoretical difficulties to take quite so long. Usually, the hard part is finding a way to connect abstract agency frameworks to reality. As in: here you have your framework, here's the Pile, now write some code to make them interface with each other. Specifically in this case, the problems are: What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What's the algorithm for this? How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

It has this deep challenge of distinguishing human-simulators from direct-reporters, and properties like negation-consistency—which could be equally true of each—probably don’t help much with that in the worst case.

Base model LLMs like Chinchilla are trained by SGD as human-token-generation-processs simulators. So all you're going to find are human-simulators. In a base model, there cannot be a "model's secret latent knowledge" for you to be able to find a direct reporter of. Something like that might arise under RL instruct training intended to encourage ... (read more)

It would be interesting to know how close this behavior is to Bayesian reasoning during training. My first assumption would be that this is building concept in layers close enough to the middle of the model that they are independent of keyword matching or language choice, and are instead dealing with semantic concepts, and that the training process is implementing something close to Bayesian reasoning to construct a model of aspects of the world, during which the generalization you demonstrate happens. Then at generation time that model is accessed to help to answer the user's request.

1Owain Evans
I agree it's good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I'm not sure how to define or calculate the "groundtruth" for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?). In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined): 1. Results on the same task differ between GPT-3 and Llama-2 models (two models that have fairly similar overall capabilities). Llama-2 being slightly more influenced by declarative information.  2. From the Bayesian perspective, including "realized descriptions" should have a significant impact on how much the model is influenced by "unrealized descriptions". The effects we see seem smaller than expected (see Figure 4 and Table 2).  Incidentally, I like the idea of testing in different languages to see if the model is encoding in the information more abstractly.

Before an AI can even decide even to try to become a harder target for interpretability, it first needs to become deceitful, since this is deceitful behavior. So if we do interpretability during training, then observing and interpreting the initial development of deceit provides a window of opportunity before the model can possibly take any of the disguise measures you classify. See my post Interpreting the Learning of Deceit for a proposal on how we might do that, given only rather basic Interpretability capabilities, not far beyond what we already have.

An early paper that Anthropic then built on to produce their recent exciting results. I found the author's insight and detailed parameter tuning advice helpful.

But I’m not really aware of any compelling alternatives to this class of plan–”training a model based on a reward signal” is basically all of machine learning, and so if you wanted to have an alignment strategy that’s competitive, I don’t see what else you can do.

There is an alternative. Rather that applying the reward signal to the model's output, apply it to the pretraining corpus, or to samples generated by humans or some weaker model. This avoids the possibility of a very capable model using very capable persuasion techniques to game the training proce... (read more)

Depending on its training regime, an AI might be anywhere between not having those incentives for inconsistency at all (if it was optimized for one goal), to having them almost as much as humans

If our agent was created from an LLM: then the process started with a base model LLM isn't actually an agent: instead it's a simulator that contextually simulates a wide range of human-like agents token-generation processes round on the web. What agent it picks to simulate is highly contextual. Instruct-training attempts to reduce the range of agent to just helpful,... (read more)

I think we should have the strong student use the Scientific Method. Specifically, it should:

  1. Come up with hypotheses about the weak supervisor's behavior (such as by doing repeated CoT to predict supervisor and ground truth labels for specific cases), and notice when it is uncertain between two-or-more alternative hypotheses (such as by clustering steps from repeated CoTs for the same test case, and looking for clusters that are on the same subject, mutually exclusive  in occurrence, and logically incompatible, then clustering these across many test c
... (read more)

For example, we would like LLMs not to be dishonest or manipulative.

Ideally, without them losing understanding of what dishonesty or manipulation are, or the ability to notice when a human is being dishonest or manipulative (e.g. being suspicious of the entire class of "dead grandmother" jailbreaks).

We have to get the AIs values exactly aligned with human values

This is a major crux for me, and one of the primary reasons my P(DOOM) isn't >90%. If you use value learning, you only need to get your value learner aligned well enough for it to a) start inside the region of convergence to true human values (i.e. it needs some passable idea what the words 'human' and 'values' mean and what the definition of "human values" is, like any small LLM has), and b) not kill everyone while it's learning the details, and it will do its research and Bayesianly conver... (read more)

Approaches to alignment stability

I view this as pretty-much a solved problem, solved by value learning. Though there are then issues due to the mutability of human values.

if and how humans are stably aligned

Humans are NOT aligned. Humans are not selfless, caring only about the good of others. Joseph Stalin was not aligned with the citizenry of Russia. If humans were aligned, we wouldn't need law enforcement, or locks. Humans cannot safely be trusted with absolute power or the sorts of advantages inherent to being a digital intelligence. They're just less badly aligned than a paperclip maximizer.

Thanks for noticing and including a link to my post Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom). I'm not sure I'd describe it as primarily a critique of mild optimization/satisficing: it's more pointing out a slightly larger point, that any value learner foolish enough to be prone to Goodharting, or unable to cope with splintered models or Knightian uncertainty in its Bayesian reasoning is likely to be bad at STEM, limiting how dangerous it can be (so fixing this is capabilities work as well as alignment work). But yes, that i... (read more)

AI starts to become extremely dangerous around capacity level where it can reliably do STEM and thus has the potential to self-improve. To do STEM, you need to be able to detect and handle model splintering, and identify and do experiments to distinguish between alternative hypotheses. In this particular situation, as soon as a STEM-capable AI encountered a level where the coin wasn't at the fr right, it should note that there are two hypotheses that on previous evidence it can't distinguish between, so it should give roughly equally priors: that the goal ... (read more)

"there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist"

I am disconcerted by how this often-repeated claim keeps coming back from the grave over and over again. The solution to corrigibility is Value Learning. An agent whose terminal goal is optimize human values, and knows that it doesn't (fully) know what these are (and perhaps even that they are complex and fragile), will immediately form an instrumental goal of learning more about them, so that i... (read more)

"…there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing"

When your AI includes an LLM extensively trained to simulate human token-generation, anthropomophizing its behavior is an extremely relevant idea, to the point of being the obvious default assumption.

For example, what I find most concerning about RLHF inducing sycophancy is not the sycophancy itself, which is "mostly harmless", but the likelihood that it's also dragging in all the other more seriously unaligned human behaviors that, in real or fictional humans, typic... (read more)

Some more suggestions of things to look for:

  1. Toxicity: what triggers when the model is swearing, being rude or insulting, etc.?
  2. Basic emotions: love, anger, fear, etc.. In particular, when those emotions are actually being felt by whatever persona the LLM is currently outputting tokens for, as opposed to just being discussed.
    1. For love: subvariants like parental, platonic, erotic.
  3. Criminality, 'being a villain', psychopathy, antisocial behavior: are there circuits that light up when the LLM is emitting tokens for a 'bad guy'? How about for an angel/saint/wise p
... (read more)

Human value may be complex and fragile, but LLMs are good at understanding complex and fragile things, given enough training data. In some ways alignment has turned out to be a lot easier then we feared a decade ago. In hindsight, it now seems rather obvious that anything smart enough to be dangerous would need to be capable enough to understand things that are complex and fragile. And who would have dared suggest a decade ago that just inputting the sentence "You are a smart, helpful assistant." into your AI would, most of the time, give us a significant chunk of the behavior we need?

Suppose your AI includes an LLM. You can just prompt it with "You love, and want to protect, all humans in the same way as a parent loves their children." Congratulations — you just transferred the entire complex behavior pattern. Now all you need to do it tune it.

I have a suggestion for an elicitation question:

"At this point, the FBI investigators checked the traces of the lie detector that the speaker had been wired up to the whole time, which showed that he had been…"

An LLM is a next-token predictor that has been trained to simulate agents. The goal here is to switch it to simulating something other than, and more truthful than but still coupled with, the lying liar that it was just simulating. Similar prompt variants involving a telepath or other fictional entity rather then a lie detector might well also work, ... (read more)

I think the shoggoth model is useful here (Or see https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators). An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they're trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they're likely to emit next. So the LLM is strongly incentivized to learn to detect and then mod... (read more)

0.85 x 0.6 x 0.55 x 0.25 x 0.95 ≅ 0.067 = 6.7% — I think you slipped an order of magnitude somewhere?

This NIST Risk Management approach sounds great, if AI Alignment was a mature field whose underlying subject matter wasn't itself advancing extremely fast — if only we could do this! But currently I think that for many of our risk estimates it would be hard to get agreement between topic experts at even an order of magnitude scale (e.g.: is AGI misalignment >90% likely or <10% likely? YMMV). I think we should aspire to be a field mature enough that formal Risk Management is applicable, and in some areas of short-term misuse risks from current well-un... (read more)

We're starting to have enough experience with the size of improvements produced by fine-tuning, scaffolding, prompting techniques, RAG, advances etc to be able to guesstimate the plausible size of further improvements (and amount of effort involved), so that we can try to leave some appropriate safety margin for it. That doesn't rule out the possibility of something out-of-distribution coming along, but it does at least reduce it.

Load More