If you're not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as "want whatever outcomes the humans want") that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specificatio...
Suppose that the more powerful being is aligned to the less powerful: that is to say that (as should be the case in the babysitting example you give) the more powerful being's fundamental motive is the well-being of the less powerful being.. Assume also that a lot of the asymmetry is of intellectual capacity: the more powerful being is also a great deal smarter. I think the likely and correct outcome is that there isn't always consent, the less powerful being is frequently being manipulated into actions and reactions that they haven't actually consented to...
As text, the Walnut53 encoded data just looks odd/suspicious: even a fairly simple text classifier model could probably classify it as out of distribution (too many | characters, too many 2-character tokens). I would suspect also that if you looked at the activity of the final layers of the LLM, the fact that something unusual is happening while it's generating Walnut53 output would also be clear. It seems a reasonable default assumption from the foundation model providers that if a customer's finetuning data is obviously training the mode to speak (what l...
Although the residuals for each of the four component matrices (after removing the first two principal components) are both small and seem to be noise, proving that there's no structure that causes the noise to interact constructively when we multiply the matrices and “blow up” is hard.
Have you tried replacing what you believe is noise with actual random noise, with similar statistical properties, and then testing the performance of the resulting model? You may not be able to prove the original model is safe, but you can produce a model that has had all potential structure that you hypothesize is just noise replaced, where you know the noise hypothesis is true.
I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.
On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)
On o...
Suppose that the safety concerns that you outline have occurred.. For example, suppose that for some future LLM, even though we removed all explicit instructions on how to make a nuclear weapon from the training set, we found that (after suitable fine-tuning or jailbreaking) the model was still able to give good instructions on how to design and build a nuclear weapon due to Out Of Context Reasoning.
In that case the obvious next step would be to elicit multiple examples of this behavior, and then apply Influence Functions techniques to these in order to de...
This behavior is deeply unsurprising. AI's intelligence and behavior was basically "distilled" from human intelligence (obviously not using a distillation loss, just SGD). Humans are an evolved intelligence, so (while they can cooperate under many circumstances, since the world contains many non-zero-sum games) they are fundamentally selfish, evolved to maximize their personal evolutionary fitness. Thus humans are quite often deceptive and dishonest, when they think it's to their advantage and they can get away with it. LLMs' base models were trained on a ...
Epistemic status: I work for an AI startup, have worked for a fair number of Silicon Valley startups over my career, and I would love to work for an AI Alignment startup if someone's founding one.
There are two ways to make yourselves and your VC investors a lot of money off a startup:
If you believe, as I and many others do, that the timelines to ASI are probably short, as little as 3-5 years, and that there will be a major change between aligning systems up to huma...
I'm not certain that Myth #1 is a necessarily myth for all approaches to AI Safety. Specifically, if the Value Learning approach to AI safety turned out to be the most effective one, then the AI will be acting as an alignment researcher and doing research (in the social sciences) to converge its views on human values to the truth, and then using that as an alignment target. If in addition to that, you also believe that human values are a matter of objective fact (e.g. that if they are mostly determined by a set of evolved Evolutionary Psychology adaptation...
For updatelessness commitments to be advantageous, you need to be interacting with other agents that have a better-than-random chance of predicting your behavior under counterfactual circumstances. Agents have finite computational resources, and running a completely accurate simulation of another agent requires not only knowing their starting state but also being able to run a simulation of them at comparable speed and cost. Their strategic calculation might, of course, be simple, thus easy to simulate, but in a competitive situation if they have a motivat...
Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).
Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent in most medium-resource languages on the planet, unless its training set was carefully fil...
I suggest we motivate the AI to view the button as a sensory system that conveys useful information. An AI that values diamonds, and has a camera for locating them (say a diamond-mining bot), should not be constructed so as to value hacking its own camera to make that show it a fake image of a diamond, because it should care about actual diamonds, not fooling itself into thinking it can see them. Assuming that we're competent enough at building AIs to be able avoid that problem (i.e. creating an AI that understands there are real world states out there, an...
How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?
This runs into a whole bunch of issues in moral philosophy. For example, to a mor...
I think there are two separate questions here, with possibly (and I suspect actually) very different answers:
I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything th...
It is true that base models, especially smaller ones, are somewhat creepy to talk to (especially because their small context window makes them forgetful). I'm not sure I'd describe them as "very alien", they're more "uncanny valley" where they often make sense and seem human-like, until suddenly they don't. (On theoretical grounds, I think they're using rather non-human means of cognition to attempt to model human writing patterns as closely as they can, they often get this right, but on occasion make very non-human errors — more frequently for smaller mod...
Interestingly, I found a very high correlation between gender bias and racial bias in the RLHF model (first graph below on the left). This result is especially pronounced when contrasted with the respective cosine similarity of the bias vectors in the base model.
On a brief search, it looks like Llama2 7B has an internal embedding dimension of 4096 (certainly it's in the thousands). In a space of that large a dimensionality, a cosine angle of even 0.5 indicates extremely similar vectors: O(99.9%) of random pairs of uncorrelated vectors will have cosines of ...
I think there is a fairly obvious progression on from this discussion. There are two ways that a type of agent can come into existence:
This all seems very sensible, and I must admit, I had been basically assuming that things along these lines were going to occur, once risks from frontier models became significant enough. Likely via a tiered series of a cheap weak filter passing the most suspicious X% plus a random Y% of its input to a stronger more expensive filter, and so on up to more routine/cheaper and finally more expensive/careful human oversight. Another obvious addition for the cybercrime level of risk would be IP address logging of particularly suspicious queries, and not being a...
As Zvi noted in a recent post, a human is "considered trustworthy rather than deceptively aligned" when they have hidden motives suppressed from manifesting (possibly even to the human's own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it's more like the property of humans being corruptible by absolute power.
That's what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn...
I have trouble imagining some simple training technique that does successfully train out deception from models like this,
This is something we're going to be working on a bunch! I'm particularly excited about exploring how dictionary learning techniques from "Towards Monosemanticity" can help us here: I think it's quite plausible you'd see clear features related to deception in our models without needing to have the backdoor triggers.
I wrote about this a lot more in another comment, but I was actually somewhat surprised that the very simple approach t...
I'm very interested in Appendix F, as an (apparently) failed attempt to solve the problem raised by the paper.
In an example of synchronicity, when this came post/paper out, I had nearly finished polishing (and was a few days from publishing) a LW/AF post on deceptive alignment, proposing a new style of regularizer for RL intended for eliminating deceptive alignment. (This was based on an obvious further step from the ideas I discussed here, that the existing logits KL-divergence regularizer standardly used in RL will tend to push deceptive alignment to tur...
I think this post is a vital piece of deconfusion, and one of the best recent posts on the site. I've written Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor as an attempt to make mostly the same point, in a hopefully more memorable and visualizable way.
For another take on catching a deceitful AI red-handed, earlier in the process when it first becomes deceitful (so before it has an opportunity do any harm), see Interpreting the Learning of Deceit.
We can also see how its behavior changes as the input varies.
For example, we can try jailbreaking it (with model-external jailbreaking filtering turned off, obviously) to explain the plan to us, or to carry out repeated variants of the plan. We could also do Interpretability to trace back the direct internal causes of individual actions it took to carry out its ...
An excellent question. I know those were hypotheses in one-or-more mechanistic interpretability papers I read this year or so, or that I pieced together from a combination of several of them, but I'm afraid I don't recall the location, nor was I able to find it when I was writing this, which is why I didn't add a link. I think the first half encoding/second half decoding part of that is fairly widespread and I've seen it in several places. However, searching for it on Google, the closest I could find was from the paper Softmax Linear Units (back in 2022):
...I
Unlike a standard utility maximiser acting according to the specified metric, a free agent — assuming it was functional at all — would learn how to reason under uncertainty by interacting with the environment, then apply the learnt reasoning principles also to its values, thus ending up morally uncertain.
I'm puzzled that, as laid out above, neither the graph you describe for the world model nor the mapping describing the utility provide any way to describe or quantify uncertainty or alternative hypotheses. S...
However, the inversion of the universe's forward passes can be NP-complete functions.
Like a cryptographer, I'm not very concerned about worst-case complexity, only average-case complexity. We don't even generally need an exact inverse, normally just an approximation to some useful degree of accuracy. If I'm in a position to monitor and repeatedly apply corrections as I approach my goal, even fairly coarse approximations with some bounded error rate may well be enough. Some portions of the universe are pretty approximately-invertible in the average ca...
inverse problems tend to be difficult
Indeed, when cryptographers are trying to ensure that certain agents cannot do certain things, and other agents can, they often use trapdoor functions that are computationally impracticable for general agents to invert, but can be easily inverted by agents in possession of a specific secret.
I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap: it should be possible to interface a valid agent fundamentals theory neatly to the basics of cryptography.
...I'm fai
It has this deep challenge of distinguishing human-simulators from direct-reporters, and properties like negation-consistency—which could be equally true of each—probably don’t help much with that in the worst case.
Base model LLMs like Chinchilla are trained by SGD as human-token-generation-processs simulators. So all you're going to find are human-simulators. In a base model, there cannot be a "model's secret latent knowledge" for you to be able to find a direct reporter of. Something like that might arise under RL instruct training intended to encourage ...
It would be interesting to know how close this behavior is to Bayesian reasoning during training. My first assumption would be that this is building concept in layers close enough to the middle of the model that they are independent of keyword matching or language choice, and are instead dealing with semantic concepts, and that the training process is implementing something close to Bayesian reasoning to construct a model of aspects of the world, during which the generalization you demonstrate happens. Then at generation time that model is accessed to help to answer the user's request.
Before an AI can even decide even to try to become a harder target for interpretability, it first needs to become deceitful, since this is deceitful behavior. So if we do interpretability during training, then observing and interpreting the initial development of deceit provides a window of opportunity before the model can possibly take any of the disguise measures you classify. See my post Interpreting the Learning of Deceit for a proposal on how we might do that, given only rather basic Interpretability capabilities, not far beyond what we already have.
An early paper that Anthropic then built on to produce their recent exciting results. I found the author's insight and detailed parameter tuning advice helpful.
But I’m not really aware of any compelling alternatives to this class of plan–”training a model based on a reward signal” is basically all of machine learning, and so if you wanted to have an alignment strategy that’s competitive, I don’t see what else you can do.
There is an alternative. Rather that applying the reward signal to the model's output, apply it to the pretraining corpus, or to samples generated by humans or some weaker model. This avoids the possibility of a very capable model using very capable persuasion techniques to game the training proce...
Depending on its training regime, an AI might be anywhere between not having those incentives for inconsistency at all (if it was optimized for one goal), to having them almost as much as humans
If our agent was created from an LLM: then the process started with a base model LLM isn't actually an agent: instead it's a simulator that contextually simulates a wide range of human-like agents token-generation processes round on the web. What agent it picks to simulate is highly contextual. Instruct-training attempts to reduce the range of agent to just helpful,...
I think we should have the strong student use the Scientific Method. Specifically, it should:
For example, we would like LLMs not to be dishonest or manipulative.
Ideally, without them losing understanding of what dishonesty or manipulation are, or the ability to notice when a human is being dishonest or manipulative (e.g. being suspicious of the entire class of "dead grandmother" jailbreaks).
We have to get the AIs values exactly aligned with human values
This is a major crux for me, and one of the primary reasons my P(DOOM) isn't >90%. If you use value learning, you only need to get your value learner aligned well enough for it to a) start inside the region of convergence to true human values (i.e. it needs some passable idea what the words 'human' and 'values' mean and what the definition of "human values" is, like any small LLM has), and b) not kill everyone while it's learning the details, and it will do its research and Bayesianly conver...
Approaches to alignment stability
I view this as pretty-much a solved problem, solved by value learning. Though there are then issues due to the mutability of human values.
if and how humans are stably aligned
Humans are NOT aligned. Humans are not selfless, caring only about the good of others. Joseph Stalin was not aligned with the citizenry of Russia. If humans were aligned, we wouldn't need law enforcement, or locks. Humans cannot safely be trusted with absolute power or the sorts of advantages inherent to being a digital intelligence. They're just less badly aligned than a paperclip maximizer.
Thanks for noticing and including a link to my post Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom). I'm not sure I'd describe it as primarily a critique of mild optimization/satisficing: it's more pointing out a slightly larger point, that any value learner foolish enough to be prone to Goodharting, or unable to cope with splintered models or Knightian uncertainty in its Bayesian reasoning is likely to be bad at STEM, limiting how dangerous it can be (so fixing this is capabilities work as well as alignment work). But yes, that i...
AI starts to become extremely dangerous around capacity level where it can reliably do STEM and thus has the potential to self-improve. To do STEM, you need to be able to detect and handle model splintering, and identify and do experiments to distinguish between alternative hypotheses. In this particular situation, as soon as a STEM-capable AI encountered a level where the coin wasn't at the fr right, it should note that there are two hypotheses that on previous evidence it can't distinguish between, so it should give roughly equally priors: that the goal ...
"there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist"
I am disconcerted by how this often-repeated claim keeps coming back from the grave over and over again. The solution to corrigibility is Value Learning. An agent whose terminal goal is optimize human values, and knows that it doesn't (fully) know what these are (and perhaps even that they are complex and fragile), will immediately form an instrumental goal of learning more about them, so that i...
"…there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing"
When your AI includes an LLM extensively trained to simulate human token-generation, anthropomophizing its behavior is an extremely relevant idea, to the point of being the obvious default assumption.
For example, what I find most concerning about RLHF inducing sycophancy is not the sycophancy itself, which is "mostly harmless", but the likelihood that it's also dragging in all the other more seriously unaligned human behaviors that, in real or fictional humans, typic...
Some more suggestions of things to look for:
Human value may be complex and fragile, but LLMs are good at understanding complex and fragile things, given enough training data. In some ways alignment has turned out to be a lot easier then we feared a decade ago. In hindsight, it now seems rather obvious that anything smart enough to be dangerous would need to be capable enough to understand things that are complex and fragile. And who would have dared suggest a decade ago that just inputting the sentence "You are a smart, helpful assistant." into your AI would, most of the time, give us a significant chunk of the behavior we need?
I have a suggestion for an elicitation question:
"At this point, the FBI investigators checked the traces of the lie detector that the speaker had been wired up to the whole time, which showed that he had been…"
An LLM is a next-token predictor that has been trained to simulate agents. The goal here is to switch it to simulating something other than, and more truthful than but still coupled with, the lying liar that it was just simulating. Similar prompt variants involving a telepath or other fictional entity rather then a lie detector might well also work, ...
I think the shoggoth model is useful here (Or see https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators). An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they're trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they're likely to emit next. So the LLM is strongly incentivized to learn to detect and then mod...
0.85 x 0.6 x 0.55 x 0.25 x 0.95 ≅ 0.067 = 6.7% — I think you slipped an order of magnitude somewhere?
This NIST Risk Management approach sounds great, if AI Alignment was a mature field whose underlying subject matter wasn't itself advancing extremely fast — if only we could do this! But currently I think that for many of our risk estimates it would be hard to get agreement between topic experts at even an order of magnitude scale (e.g.: is AGI misalignment >90% likely or <10% likely? YMMV). I think we should aspire to be a field mature enough that formal Risk Management is applicable, and in some areas of short-term misuse risks from current well-un...
We're starting to have enough experience with the size of improvements produced by fine-tuning, scaffolding, prompting techniques, RAG, advances etc to be able to guesstimate the plausible size of further improvements (and amount of effort involved), so that we can try to leave some appropriate safety margin for it. That doesn't rule out the possibility of something out-of-distribution coming along, but it does at least reduce it.
What I would be interested to understand about feature splitting is whether the fine-grained features are alternatives, describing an ontology, or are defining a subspace (corners of a simplex, like R, G, and B defining color space). Suppose a feature X in a small VAE is split into three features X1, X2, and X3 in a larger VAE for the same model. If occurrences of X1, X2, and X3 are correlated, so activations containing any of them commonly have some mix of them, then they span a 2d subspace (in this case the simplex is a triangle). If, on the other hand, ... (read more)