1 min read

7

This is a special post for quick takes by Evan Hubinger. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
1.

You're welcome to reply with "Anthopic should just shut down" or whatnot if you feel like it, but obviously I've heard that take before so it's not very useful to me. ↩︎

25 comments, sorted by Click to highlight new comments since:

I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anything private.


  1. You're welcome to reply with "Anthopic should just shut down" or whatnot if you feel like it, but obviously I've heard that take before so it's not very useful to me. ↩︎

Thanks for asking! Off the top of my head, would want to think more carefully before finalizing & come up with more specific proposals:

  1. Adopt some of the stuff from here https://time.com/collection/time100-voices/7086285/ai-transparency-measures/ e.g. the whistleblower protections, the transparency about training goal/spec.
  2. Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it's important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it's similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
  3. (warning, untested idea) Record absolutely everything that happens within Anthropic and commit -- ideally in a legally binding and literally-hard-to-stop way -- to publishing it all with a 10-year delay.
  4. Implement something like this: https://sideways-view.com/2018/02/01/honest-organizations/
  5. Implement the recommendations in this: https://docs.google.com/document/d/1DTmRdBNNsRL4WlaTXr2aqPPRxbdrIwMyr2_cPlfPCBA/edit?usp=sharing
  6. Prepare the option to do big, coordinated, costly signals of belief and virtue. E.g. suppose you want to be able to shout to the world "this is serious people, we think that there's a good chance the current trajectory leads to takeover by misaligned AIs, we aren't just saying this to hype anything, we really believe it" and/or "we are happy to give up our personal wealth, power, etc. if that's what it takes to get [policy package] passed." A core problem is that lots of people shout things all the time, and talk is cheap, so people (rightly) learn to ignore it. Costly signals are a potential solution to this problem, but they probably need a non-zero amount of careful thinking well in advance + a non-zero amount of prep. 
  7. Give more access to orgs like Redwood, Apollo, and METR (I don't know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
  8. Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don't do what OpenAI did with o1). Doesn't have to be all of it, just has to be enough -- e.g. each user gets 1 CoT view per day. Make sure that organizations like METR and Apollo that are doing research on your models get to see the full CoT.
  9. Do more safety case sketching + do more writing about what the bad outcomes could look like. E.g. the less rosy version of "Machines of loving grace." Or better yet, do a more serious version of "Machines of loving grace" that responds to objections like "but how will you ensure that you don't hand over control of the datacenters to AIs that are alignment faking rather than aligned" and "but how will you ensure that the alignment is permanent instead of temporary (e.g. that some future distribution shift won't cause the models to be misaligned and then potentially alignment-fake)" and "What about bad humans in charge of Anthropic? Are we just supposed to trust that y'all will be benevolent and not tempted by power? Or is there some reason to think Anthropic leadership couldn't become dictators if they wanted to?" and "what will the goals/values/spec/constitution be exactly?" and "how will that be decided?"

     

Another idea: "AI for epistemics" e.g. having a few FTE's working on making Claude a better forecaster. It would be awesome if you could advertise "SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc."

And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets -- and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....

Or even if you don't do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.

Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...

You could also train it to be good at taking people's offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions. 

You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.

My current tenative guess is that this is somewhat worse than other alignment science projects that I'd recommend at the margin, but somewhat better than the 25th percentile project currently being done. I'd think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we'd learn generalizable scalable oversight / control approaches.

Sure, here are some things: 

  • Anthropic should publicly clarify the commitments it made on not pushing the state of the art forward in the early years of the organization.
  • Anthropic should appoint genuinely independent members to the Long Term Benefit Trust, and should ensure the LTBT is active and taking its role of supervision seriously.
  • Anthropic should remove any provisions that allow shareholders to disempower the LTBT
  • Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGI
  • Anthropic should publicly state its opinions on what AGI architectures or training processes it considers more dangerous (like probably long-horizon RL training), and either commit to avoid using those architectures and training-processes, or at least very loudly complain that the field at large should not use those architectures
  • Anthropic should not ask employees or contractors to sign non-disparagement agreements with Anthropic, especially not self-cloaking ones
  • Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
  • Anthropic should do more collaborations like the one you just did with Redwood, where external contractors get access to internal models. I think this is of course infosec-wise hard, but I think you can probably do better than you are doing right now.
  • Anthropic should publicly clarify what the state of its 3:1 equity donation matching program is, which it advertised publicly (and which played a substantial role in many people external to Anthropic supporting it, given that they expected a large fraction of the equity to therefore be committed to charitable purposes). Recent communications suggest any equity matching program at Anthropic does not fit what was advertised.

I can probably think of some more. 

I'd add:

  • Support explicit protections for whistleblowers.

Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGIwhi

I'll echo this and strengthen it to:

... call for policymakers to stop the development of AGI.

Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)

To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think 
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuinely trying to work on what's important for safety and stay virtuous
-I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don't see) stuff (i.e. this is not about "do I trust Dario?")
 

As discussed in How will we update about scheming?:

While I expect that in some worlds, my P(scheming) will be below 5%, this seems unlikely (only 25%). AI companies have to either disagree with me, expect to refrain from developing very powerful AI, or plan to deploy models that are plausibly dangerous schemers; I think the world would be safer if AI companies defended whichever of these is their stance.

I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful.

Something similar on state proof security would be useful as well.

I think there is a way to do this such that the PR costs aren't that high and thus it is worth doing unilaterially from a variety of perspectives.

This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.

  • What would be necessary to build a good auditing game benchmark?
  • How would AI safety AI work? What is necessary for it to go well?
  • How do we avoid end-to-end training while staying competitive with it? Can we use transparency on end-to-end models to identify useful modules to train non-end-to-end?
  • What would it look like to do interpretability on end-to-end trained probabilistic models instead of end-to-end trained neural networks?
  • Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”).
  • Argue for the importance of ensuring that the state-of-the-art in “using AI for alignment” never lags behind as a capability compared to where it could be given just additional engineering effort.
  • What does inner alignment look like in the context of models with access to memory (e.g. a retrieval database)?
  • Argue for doing scaling laws for phase changes. We have found some phase changes in models—e.g. the induction bump—but we haven’t yet really studied the extent to which various properties—e.g. Honesty—generalize across these sorts of phase changes.
  • Humans rewarding themselves for finishing their homework by eating candy suggests a plausible mechanism for gradient hacking.
  • If we see precursors to deception (e.g. non-myopia, self-awareness, etc.) but suspiciously don’t see deception itself, that’s evidence for deception.
  • The more model’s objectives vary depending on exact setup, randomness, etc., the less likely deceptive models are to want to cooperate with future deceptive models, thus making defection earlier more likely.
  • China is not a strategically relevant actor for AI, at least in short timeline scenarios—they are too far behind, their GDP isn’t growing fast enough, and their leaders aren’t very good at changing those things.
  • If you actually got a language model that was a true generative model of the world that you could get arbitrary conditionals from, that would be equivalent to having access to a quantum suicide machine.
  • Introduce the concept of how factored an alignment solution is in terms of how easy it is to turn up or down alignment relative to capabilities—or just swap out an aligned goal for a misaligned one—as an important axis to pay attention to. Currently, things are very factored—alignment and capabilities are both heavily dependent on dataset, reward, etc.—but that could change in the future.
  • Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
  • How has transparency changed over time—Chris claims it's easier to interpret later models; is that true?
  • Which AI safety proposals are most likely to fail safely? Proposals which have the property that the most likely way for them to fail is just not to work are better than those that are most likely to fail catastrophically. In the former case, we’ve sacrificed some of our alignment tax, but still have another shot.
  • What are some plausible scenarios for how a model might be suboptimality deceptively aligned?
  • What can we learn about AI safety from the domestication of animals? Does the success of domesticating dogs from wolves provide an example of how to train for corrigibility? Or did we just make them dumber via the introduction of something like William’s syndrome?

I'll continue to include more directions like this in the comments here.

  • Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
  • Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
  • One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don't know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.
  • Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won't be), you get at most one extra bit of optimization towards alignment.
  • Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.
  • A deceptive model doesn't have to have some sort of very explicit check for whether it's in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it's in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn't really think about it very often because during training it just looks too unlikely.
  • Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.

Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.

It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.

No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview? 

Disclaimer: At the time of writing, this has not been endorsed by Evan.

I can give this a go.

Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.

One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that "strategic cognition" is that which chooses bundles of context-dependent tactical policies, and "tactical cognition" is that which implements a given tactic's choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.

One Vaguely Mechanistic Illustration of a Similar Concept:
A similar way for this to be broken in humans, departing just a bit from Evan's comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: "Feel physically great,"  and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like "alone in my room, take the most fun available drug" which takes actions that result in  physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like "alone in my room, abstain from dissociative drugs and instead text my favorite friend" taking actions which result in  physical pleasure over a day. 

Suppose also that asocial cognitions like "eat this" have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.

Since WIRE only gets credit a fraction of the time that it's due, the average reward (over 30 days, say) credited to WIRE is . If and only if , like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.

Conclusion:
I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.

Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.

REMINDER: At the time of writing, this has not been endorsed by Evan.

Thanks for the story! I may comment more on it later.

Here's a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:

When we train AI systems to be nice, we're giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?

To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)

Important disanalogies seem:
1) Most humans aren't good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time.
2) The listener will assume that [score highly on niceness] isn't the human's only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating])
3) A fairly large proportion of humans are nice (I think!).

The second could be addressed somewhat by raising the stakes.
The first seems hard to remedy within this analogy. I'd be a little concerned that people initially buy it, then think for themselves and conclude "But if we design a really clever niceness test, then it'd almost always work - all we need is clever people to work for a while on some good tests".
Combined with (3), this might seem like a decent solution.

Overall, I think what's missing is that we'd expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn't going to have this intuition in the human-niceness-test case.

Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.

The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".

Yeah I also felt some vague optimism about that.

If you want to better understand counting arguments for deceptive alignment, my comment here might be a good place to start.

I wrote up some of my thoughts on how to think about comparing the complexities of different model classes here.

If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.