I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anything private.
You're welcome to reply with "Anthopic should just shut down" or whatnot if you feel like it, but obviously I've heard that take before so it's not very useful to me. ↩︎
Thanks for asking! Off the top of my head, would want to think more carefully before finalizing & come up with more specific proposals:
Another idea: "AI for epistemics" e.g. having a few FTE's working on making Claude a better forecaster. It would be awesome if you could advertise "SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc."
And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets -- and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....
Or even if you don't do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.
Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...
You could also train it to be good at taking people's offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions.
You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.
My current tenative guess is that this is somewhat worse than other alignment science projects that I'd recommend at the margin, but somewhat better than the 25th percentile project currently being done. I'd think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we'd learn generalizable scalable oversight / control approaches.
Sure, here are some things:
I can probably think of some more.
I'd add:
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGIwhi
I'll echo this and strengthen it to:
... call for policymakers to stop the development of AGI.
Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)
To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuinely trying to work on what's important for safety and stay virtuous
-I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don't see) stuff (i.e. this is not about "do I trust Dario?")
As discussed in How will we update about scheming?:
While I expect that in some worlds, my P(scheming) will be below 5%, this seems unlikely (only 25%). AI companies have to either disagree with me, expect to refrain from developing very powerful AI, or plan to deploy models that are plausibly dangerous schemers; I think the world would be safer if AI companies defended whichever of these is their stance.
I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful.
Something similar on state proof security would be useful as well.
I think there is a way to do this such that the PR costs aren't that high and thus it is worth doing unilaterially from a variety of perspectives.
This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.
I'll continue to include more directions like this in the comments here.
- Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.
It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.
No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?
Disclaimer: At the time of writing, this has not been endorsed by Evan.
I can give this a go.
Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.
One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that "strategic cognition" is that which chooses bundles of context-dependent tactical policies, and "tactical cognition" is that which implements a given tactic's choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.
One Vaguely Mechanistic Illustration of a Similar Concept:
A similar way for this to be broken in humans, departing just a bit from Evan's comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: "Feel physically great," and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like "alone in my room, take the most fun available drug" which takes actions that result in physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like "alone in my room, abstain from dissociative drugs and instead text my favorite friend" taking actions which result in physical pleasure over a day.
Suppose also that asocial cognitions like "eat this" have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.
Since WIRE only gets credit a fraction of the time that it's due, the average reward (over 30 days, say) credited to WIRE is . If and only if , like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.
Conclusion:
I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.
Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.
REMINDER: At the time of writing, this has not been endorsed by Evan.
Here's a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:
When we train AI systems to be nice, we're giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?
To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)
Important disanalogies seem:
1) Most humans aren't good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time.
2) The listener will assume that [score highly on niceness] isn't the human's only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating])
3) A fairly large proportion of humans are nice (I think!).
The second could be addressed somewhat by raising the stakes.
The first seems hard to remedy within this analogy. I'd be a little concerned that people initially buy it, then think for themselves and conclude "But if we design a really clever niceness test, then it'd almost always work - all we need is clever people to work for a while on some good tests".
Combined with (3), this might seem like a decent solution.
Overall, I think what's missing is that we'd expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn't going to have this intuition in the human-niceness-test case.
Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.
If you want to better understand counting arguments for deceptive alignment, my comment here might be a good place to start.
If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.