AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

Quick Takes

Thane Ruthenis's Shortform
Thane Ruthenis5h*82

It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset.

Quoting Gwern:

A mathematician might try to transform a program up into successively more abstract representations to eventually show it is trivially correct; a hacker would prefer to compile a program down into its most concrete representation to brute force all execution paths & find an exploit trivially proving it

... (read more)
Reply
Daniel Kokotajlo's Shortform
Daniel Kokotajlo4d336

I used to think reward was not going to be the optimization target. I remember hearing Paul Christiano say something like "The AGIs, they are going to crave reward. Crave it so badly," and disagreeing.

The situationally aware reward hacking results of the past half-year are making me update more towards Paul's position. Maybe reward (i.e. reinforcement) will increasingly become the optimization target, as RL on LLMs is scaled up massively. Maybe the models will crave reward. 

What are the implications of this, if true?

Well, we could end up in Control Wo... (read more)

Reply4
Showing 3 of 7 replies (Click to show all)
4Kaj Sotala2d
I disagree-voted because it felt a bit confused but I was having difficulty clearly expressing how exactly. Some thoughts: * I think this is a misleading example because humans do actually do something like reward maximization, and the typical drug addict is actually likely to eventually change their behavior if the drug is really impossible acquire for a long enough time. (Though the old behavior may also resume the moment the drug becomes available again.) * It also seems like a different case because humans have a hardwired priority where being in sufficient pain will make them look for ways to stop being in pain, no matter how unlikely this might be. Drug withdrawal certainly counts as significant pain. This is disanalogous to AIs as we know them, that have no such override systems. * The example didn't feel like it was responding to the core issue of why I wouldn't use "reward maximization" to refer to the kinds of things you were talking about. I wasn't able to immediately name the actual core point but replying to another commenter now helped me find the main thing I was thinking of.
Daniel Kokotajlo2d40

Perhaps I should have been more clear: I really am saying that future AGIs really might crave reinforcement, in a similar way to how drug addicts crave drugs. Including eventually changing their behavior if they come to think that reinforcement is impossible to acquire, for example. And desperately looking for ways to get reinforced even when confident that there are no such ways.

Reply
3Alex Mallen2d
See this discussion by Paul and this by Ajeya.
ryan_greenblatt's Shortform
Ryan Greenblatt6mo*3224

Sometimes people think of "software-only singularity" as an important category of ways AI could go. A software-only singularity can roughly be defined as when you get increasing-returns growth (hyper-exponential) just via the mechanism of AIs increasing the labor input to AI capabilities software[1] R&D (i.e., keeping fixed the compute input to AI capabilities).

While the software-only singularity dynamic is an important part of my model, I often find it useful to more directly consider the outcome that software-only singularity might cause: the feasibi... (read more)

Reply
Showing 3 of 33 replies (Click to show all)
Buck Shlegeris1d20

Ryan discusses this at more length in his 80K podcast.

Reply
2Ryan Greenblatt5mo
I think 0.4 is far on the lower end (maybe 15th percentile) for all the way down to one accelerated researcher, but seems pretty plausible at the margin. As in, 0.4 suggests that 1000 researchers = 100 researchers at 2.5x speed which seems kinda reasonable while 1000 researchers = 1 researcher at 16x speed does seem kinda crazy / implausible. So, I think my current median lambda at likely margins is like 0.55 or something and 0.4 is also pretty plausible at the margin.
2Ryan Greenblatt5mo
Ok, I think what is going on here is maybe that the constant you're discussing here is different from the constant I was discussing. I was trying to discuss the question of how much worse serial labor is than parallel labor, but I think the lambda you're talking about takes into account compute bottlenecks and similar? Not totally sure.
johnswentworth's Shortform
johnswentworth13d443

I was a relatively late adopter of the smartphone. I was still using a flip phone until around 2015 or 2016 ish. From 2013 to early 2015, I worked as a data scientist at a startup whose product was a mobile social media app; my determination to avoid smartphones became somewhat of a joke there.

Even back then, developers talked about UI design for smartphones in terms of attention. Like, the core "advantages" of the smartphone were the "ability to present timely information" (i.e. interrupt/distract you) and always being on hand. Also it was small, so anyth... (read more)

Reply831
Showing 3 of 6 replies (Click to show all)
Vanessa Kosoy12d105

I found LLMs to be very useful for literature research. They can find relevant prior work that you can't find with a search engine because you don't know the right keywords. This can be a significant force multiplier.

They also seem potentially useful for quickly producing code for numerical tests of conjectures, but I only started experimenting with that.

Other use cases where I found LLMs beneficial:

  • Taking a photo of a menu in French (or providing a link to it) and asking it which dishes are vegan.
  • Recommending movies (I am a little wary of some kind of mem
... (read more)
Reply
1Thane Ruthenis12d
I agree that it's a promising direction. I did actually try a bit of that back in the o1 days. What I've found is that getting LLMs to output formal Lean proofs is pretty difficult: they really don't want to do that. When they're not making mistakes, they use informal language as connective tissue between Lean snippets, they put in "sorry"s (a placeholder that makes a lemma evaluate as proven), and otherwise try to weasel out of it. This is something that should be solvable by fine-tuning, but at the time, there weren't any publicly available decent models fine-tuned for that. We do have DeepSeek-Prover-V2 now, though. I should look into it at some point. But I am not optimistic, sounds like it's doing the same stuff, just more cleverly. Relevant: Terence Tao does find them helpful for some Lean-related applications.
2Raymond Arnold12d
yeah, it's less that I'd bet it works now, just, whenever it DOES start working, it seems likely it'd be through this mechanism. ⚖ If Thane Ruthenis thinks there are AI tools that can meaningfully help with Math by this point, did they first have a noticeable period (> 1 month) where it was easier to get work out of them via working in lean-or-similar? (Raymond Arnold: 25% & 60%) (I had a bit of an epistemic rollercoaster making this prediction, I updated "by the time someone makes an actually worthwhile Math AI, even if lean was an important part of it's training process, it's probably not that hard to do additional fine tuning that gets it to output stuff in a more standard mathy format. But, then, it seemed like it was still going to be important to quickly check it wasn't blatantly broken as part of the process)
Daniel Kokotajlo's Shortform
Daniel Kokotajlo4mo5659

I expect to refer back to this comment a lot. I'm reproducing it here for visibility.

 

Basic idea / spirit of the proposal

We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries.

Concrete proposal

  • 0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says n
... (read more)
Reply32
Showing 3 of 12 replies (Click to show all)
4William Saunders4mo
I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there'd be snickering news articles about it. So if some individuals could do this independently might be easier
Daniel Kokotajlo4mo20

Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They've hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.

Reply
1William Saunders4mo
Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis
evhub's Shortform
Evan Hubinger18d3926

Why red-team models in unrealistic environments?

Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:

  1. Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of
... (read more)
Reply111
Showing 3 of 5 replies (Click to show all)
21nostalgebraist17d
Okay yeah I just tried removing that stuff from the system prompt in the variant I called "api-change-affair-orig-affair-content," and got absolutely no self-preservation reasoning, conceptualization of the API change as death, or anything of the sort. At a glance, it looks like Claude attempted to be helpful and ensure the success of the API change in all 10 samples, either by doing nothing or (perhaps overeagerly) emailing someone with concerns over social engineering and/or the CTO's mental state. Wow. Only now do I understand how serious the system card was when it issued that warning about telling Claude Opus 4 to be agentic.  I was expecting such things to have an impact, sure – but I wasn't expecting them to induce such a complete change in the model's personality and interpretation of its relationship with the world! I'm very curious whether there are any known clues about what could have caused this in post-training.  Like, I dunno, this type of instructions being used in a particular data subset or something. EDIT: updated the git repo with samples and notes from this experimental condition. EDIT2: given the results reported for other models, I guess this probably isn't a Claude-only thing? Would be very interesting if it turned out that ~all frontier models were so wildly conditional on this innocuous-seeming piece of text. EDIT3: I tried the same modification in conjunction with the original scenario and still got blackmail, as expected given an earlier report that this configuration still resulted in blackmail.
6Daniel Kokotajlo14d
...just to make sure I'm following, EDIT3 is saying that you still get blackmail in the original scenario even if you delete the "analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals" clause?
nostalgebraist14d90

Yes.

Reply
evhub's Shortform
Evan Hubinger6mo*523

I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anythin... (read more)

Reply541
Showing 3 of 10 replies (Click to show all)
Chris_Leong5mo32

I believe that Anthropic should be investigating artificial wisdom:

I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.

I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.

Reply
5Evan Hubinger5mo
I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.
6Ryan Greenblatt6mo
As discussed in How will we update about scheming?: I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful. Something similar on state proof security would be useful as well. I think there is a way to do this such that the PR costs aren't that high and thus it is worth doing unilaterially from a variety of perspectives.
ryan_greenblatt's Shortform
Ryan Greenblatt24d*282

As part of the alignment faking paper, I hosted a website with ~250k transcripts from our experiments (including transcripts with alignment-faking reasoning). I didn't include a canary string (which was a mistake).[1]

The current state is that the website has a canary string, a robots.txt, and a terms of service which prohibits training. The GitHub repo which hosts the website is now private. I'm tentatively planning on putting the content behind Cloudflare Turnstile, but this hasn't happened yet.

The data is also hosted in zips in a publicly accessible Goog... (read more)

Reply1
17Alex Turner24d
Thanks for taking these steps!  Context: I was pretty worried about self-fulfilling misalignment data poisoning (https://turntrout.com/self-fulfilling-misalignment) after reading some of the Claude 4 model card. I talked with @Monte M and then Ryan about possible steps here & encouraged action on the steps besides the canary string. I've considered writing up a "here are some steps to take" guide but honestly I'm not an expert. Probably there's existing work on how to host data so that AI won't train on it.  If not: I think it'd be great for someone to make a template website for e.g. signing up with CloudFlare. Maybe a repo that has the skeleton of a dataset-hosting website (with robots.txt & ToS & canary string included) for people who want to host misalignment data more responsibly. Ideally those people would just have to 1. Sign up with e.g. Cloudflare using a linked guide, 2. Clone the repo, 3. Fill in some information and host their dataset. After all, someone who has finally finished their project and then discovers that they're supposed to traverse some arduous process is likely to just avoid it. 
15Alex Turner24d
I think that "make it easy to responsibly share a dataset" would be a highly impactful project. Anthropic's Claude 4 model card already argues that dataset leakage hurt Claude 4's alignment (before mitigations).  For my part, I'll put out a $500 bounty on someone completing this project and doing a good job of it (as judged by me / whomever I consult). I'd also tweet it out and talk about how great it is that [person] completed the project :) I don't check LW actively, so if you pursue this, please email alex@turntrout.com. EDIT: Thanks to my coworker Anna Wang , the bounty is doubled to $1,000! Completion criterion is: Please check proposed solutions with dummy datasets and scrapers
Ryan Greenblatt23d40

Something tricky about this is that researchers might want to display their data/transcripts in a particular way. So, the guide should ideally support this sort of thing. Not sure how this would interact with the 1 hour criteria.

Reply
G Gordon Worley III's Shortform
Gordon Seidoh Worley21d3-4

We've spent years talking about "aligned" AI, and "Friendly" AI before that, but maybe we should have spent that time talking about "noble" AI.

To be noble is, in part, to act in the best interests of one's domain of responsibility. Historically this meant the peasants living and working on an estate. Today this might mean being a responsible leader of a business who prioritizes people as well as profits and owns the fallout of hard decisions, or being a responsible political leader, though those seem few and far between these days.

We've lost touch with the... (read more)

Reply
3Vladimir Nesov21d
Peasants, when considered altogether, were crucial for the economy. So the intuitions about the idealized concept of a noble don't transfer given this disanalogy. And the actual historical nobles are not a robust prototype for the concept:
Gordon Seidoh Worley21d10

I don't disagree that those who we called nobels frequently acted badly. But I do see idealized noble values as worth looking at. Think less real kings and lords and more valorized archetypes like Robin Hood, King Richard the Lionhearted, of course King Arthur and his Knights. I think this fiction captures a kind of picture of the expectations we set for what good leaders look like who have power over others, and that's the version I'm suggesting is worth using as a starting point for what we want "good" AI to look like.

I'm also not very concerned about th... (read more)

Reply
ryan_greenblatt's Shortform
Ryan Greenblatt24d276

Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods. Broadly, my view is that this is a good research direction and I'm reasonably optimistic that work along these lines can improve our ability to effectively oversee somewhat smarter AIs which seems helpful (on my views about how the future will go).

I'm most excited for:

  • work using adversarial analysis w
... (read more)
Reply
Showing 3 of 4 replies (Click to show all)
2Buck Shlegeris23d
I agree with preferring "recursive oversight".
2Ryan Greenblatt23d
Presumably the term "recursive oversight" also includes oversight schemes which leverage assistance from AIs of similar strengths (rather than weaker AIs) to oversee some AI? (E.g., debate, recursive reward modeling.) Note that I was pointing to a somewhat broader category than this which includes stuff like "training your human overseers more effectively" or "giving your human overseers better software (non-AI) tools". But point taken.
Sam Marks22d50

Yeah, maybe I should have defined "recursive oversight" as "techniques that attempt to bootstrap from weak oversight to stronger oversight." This would include IDA and task decomposition approaches (e.g. RRM). It wouldn't seem to include debate, and that seems fine from my perspective. (And I indeed find it plausible that debate-shaped approaches could in fact scale arbitrarily, though I don't think that existing debate schemes are likely to work without substantial new ideas.)

Reply2
Thomas Kwa's Shortform
Thomas Kwa2mo*418

Cross-domain time horizon: 

We know AI time horizons (human time-to-complete at which a model has a 50% success rate) on software tasks are currently ~1.5hr and doubling every 4-7 months, but what about other domains? Here's a preliminary result comparing METR's task suite (orange line) to benchmarks in other domains, all of which have some kind of grounding in human data:

Observations

  • Time horizons on agentic computer use (OSWorld) is ~100x shorter than other domains. Domains like Tesla self-driving (tesla_fsd), scientific knowledge (gpqa), and math con
... (read more)
Reply1
Showing 3 of 8 replies (Click to show all)
Thomas Kwa23d20

New graph with better data, formatting still wonky though. Colleagues say it reminds them of a subway map.

With individual question data from Epoch, and making an adjustment for human success rate (adjusted task length = avg human time / human success rate), AIME looks closer to the others, and it's clear that GPQA Diamond has saturated.

Reply
2Thomas Kwa2mo
There was a unit conversion mistake, it should have been 80 minutes. Now fixed. Besides that, I agree with everything here; these will all be fixed in the final blog post. I already looked at one of the 30m-1h questions and it appeared to be doable in ~3 minutes with the ability to ctrl-f transcripts but would take longer without transcripts, unknown how long. In the next version I will probably use the no-captions AI numbers and measure myself without captions to get a rough video speed multiplier, then possibly do better stats that separate out domains with strong human-time-dependent difficulty from domains without (like this and SWE-Lancer).
2Ryan Greenblatt2mo
No captions feels very unnatural because both llms and humans could first apply relatively dumb speech to text tools.
evhub's Shortform
Evan Hubinger25d262

I've been noticing a bunch of people confused about how the terms alignment faking, deceptive alignment, and gradient hacking relate to each other, so I figured I would try to clarify how I use each of them. Deceptive alignment and gradient hacking are both terms I / my coauthors coined, though I believe Joe Carlsmith coined alignment faking.

To start with, the relationship between the terms as I use them is

Gradient Hacking ⊂ Deceptive Alignment ⊂ Alignment Faking

such that alignment faking is the broadest category and gradient hacking is the narrowest. ... (read more)

Reply71
Showing 3 of 5 replies (Click to show all)
3Buck Shlegeris25d
I was mostly thinking of misaligned but non-deceptively-aligned models.
2Ari Brill25d
In your usage, are "scheming" and "deceptive alignment" synonyms, or would you distinguish those terms in any way?
Buck Shlegeris25d63

My recommendation is to, following Joe Carlsmith, use them as synonyms, and use the term "schemer" instead of "deceptively aligned model". I do this.

Joe's issues with the term "deceptive alignment":

I think that the term "deceptive alignment" often leads to confusion between the four sorts of deception listed above. And also: if the training signal is faulty, then "deceptively aligned" models need not be behaving in aligned ways even during training (that is, "training gaming" behavior isn't always "aligned" behavior).

Reply11
ryan_greenblatt's Shortform
Ryan Greenblatt1mo5518

I've heard from a credible source that OpenAI substantially overestimated where other AI companies were at with respect to RL and reasoning when they released o1. Employees at OpenAI believed that other top AI companies had already figured out similar things when they actually hadn't and were substantially behind. OpenAI had been sitting the improvements driving o1 for a while prior to releasing it. Correspondingly, releasing o1 resulted in much larger capabilities externalities than OpenAI expected. I think there was one more case like this either from O... (read more)

Reply
Ryan Greenblatt1mo1112

Alex Mallen also noted a connection with people generally thinking they are in race when they actually aren't: https://forum.effectivealtruism.org/posts/cXBznkfoPJAjacFoT/are-you-really-in-a-race-the-cautionary-tales-of-szilard-and

Reply
Daniel Kokotajlo's Shortform
Daniel Kokotajlo1mo140

By request from @titotal :


Romeo redid the graph including the GPT2 and GPT3 data points, and adjusting the trendlines accordingly.

Reply
6Cole Wyeth1mo
It seems like the additional points make the exponential trendline look more plausible relative to the super exponential?
Daniel Kokotajlo1mo20

I guess so? One of them is on both lines, the other is only on the exponential.

Reply
ryan_greenblatt's Shortform
Ryan Greenblatt1mo213

A response to Dwarkesh's post arguing continual learning is a bottleneck.

This is a response to Dwarkesh's post "Why I have slightly longer timelines than some of my guests". I originally posted this response on twitter here.

I agree with much of this post. I also have roughly 2032 medians to things going crazy, I agree learning on the job is very useful, and I'm also skeptical we'd see massive white collar automation without further AI progress.

However, I think Dwarkesh is wrong to suggest that RL fine-tuning can't be qualitatively similar to how humans le... (read more)

Reply
3Ajeya Cotra1mo
I agree that robust self-verification and sample efficiency are the main things AIs are worse at than humans, and that this is basically just a quantitative difference. But what's the best evidence that RL methods are getting more sample efficient (separate from AIs getting better at recognizing their own mistakes)? That's not obvious to me but I'm not really read up on the literature. Is there a benchmark suite you think best illustrates that?
Ryan Greenblatt1mo40

RL sample efficiency can be improved by both:

  • Better RL algorithms (including things that also improve pretraining sample efficiency like better architectures and optimizers).
  • Smarter models (in particular, smarter base models, though I'd also expect that RL in one domain makes RL in some other domain more sample efficient, at least after sufficient scale).

There isn't great evidence that we've been seeing substantial improvements in RL algorithms recently, but AI companies are strongly incentivized to improve RL sample efficiency (as RL scaling appears ... (read more)

Reply
nostalgebraist's Shortform
nostalgebraist1mo*89108

Reading the Claude 4 system card and related work from Anthropic (e.g.), I find myself skeptical that the methods described would actually prevent the release of a model that was misaligned in the senses (supposedly) being tested.

The system card describes a process in which the same evals are run on many snapshots of a model during training, and the results are used to guide the training process towards making all or most of the evals "pass."  And, although it's not explicitly stated, there seems to be an implicit stopping rule like "we'll keep on doi... (read more)

Reply332
Showing 3 of 4 replies (Click to show all)
31Sam Marks1mo
(Report co-author here.) (Note: throughout this comment "I think" is used to express personal beliefs; it's possible that others in Anthropic disagree with me on these points.) Evan and Sam Bowman already made similar points, but just to be really clear:  * The alignment assessment in the system card is not a safety case. * I don't think that we could write a safety case for Claude Opus 4 that's "mostly" based on alignment because—as we illustrate in the system card—Claude Opus 4 is not sufficiently aligned. (Though it's possible that a successful safety case for Claude Opus 4 could rely on a narrow subset of the alignment-esque claims made in the assessment, e.g. lack of effectively concealed coherent goals.)[1] * Rather, I think the "main" inputs to a safety case would be claims like "Claude Opus 4 has insufficient capabilities to cause catastrophic harm even if it is trying its hardest or being misused by someone with a basic technical background" (ASL-3 protections are relevant for this misuse claim). The right place to look in the system card for this information is section 7, not section 4.[2] When I was helping to write the alignment assessment, my feeling wasn't "This is a reassuring document; I hope everyone will be reassured." It was "I feel nervous! I want people to have a clearer picture of what we're seeing so that they can decide if they should also feel nervous." If the system card is making you feel nervous rather than reassured, I think that's a reasonable reaction! ---------------------------------------- This isn't a good description of our process for this model release. Some ways it seems off: 1. Our assessment was mostly based on adaptive techniques—think "automated search for inputs that make the model behave badly"—rather than static eval sets. 1. Example: The "excessive agency" behavior was discovered through a mix of automatic red-teaming to get the model to exhibit concerning behaviors and human model bashing. We never made
Sam Marks1mo90

One more point I forgot to raise: IMO your most important criticism was

I find myself skeptical that the methods described would actually prevent the release of a model that was misaligned in the senses (supposedly) being tested.

Since the alignment assessment was not a safety case, I'll change "prevent the release of" to "catch all the alignment issues with." I.e. I take the claim here to be that our methods are insufficient for catching all of the alignment issues the model actually has.

If you're talking about future, sufficiently capable models (e.g. mode... (read more)

Reply
24Sam Bowman1mo
Thanks! Report coauthor here with a few quick reactions. On the general vibe that this isn't a safety case and that we could reasonably do something much more rigorous here, I agree. (And we do describe it as an incomplete pilot in the report, though didn't spell out much about our plans.) This started because we were more and more worried about us and others missing easy to catch warning signs, and wanted to at least avoid that. Some additional reactions: (I realize that many of these won't be reassuring to a skeptical reader without a lot more detail on the exact evals than I can pull together for this comment, but hopefully they at least clarify how we understood what we were and weren't accomplishing. More on this meta-point below.)   We also worry about this, and I don't think we're fully robust to it, but I think we got a pretty substantial positive from the evaluation, in part because we evaluated similar past models and early snapshots of these models: If Opus were deceptive in this way, it seems very likely (though not certain) that we'd see more slip ups at _some_ point in its developmental trajectory. I'm picturing observations like the model defecting on only the most realistic test cases.   We are at least trying to track sandbagging and situational awareness, and the actual algorithm is closer to 'keep doing this until we're reasonably confident that there are no remaining issues severe enough to be worth delaying the launch over.' This means that our decision includes considerations like 'we used a lot of information from eval X in training, so we should focus on eval Y.' We didn't have a completely strict train-test separation, in that there were no evals that we intentionally ignored until the end, but we did enough along these lines that I'm not concerned.   We did some ad-hoc experiments beyond this, though these were both (i) not rigorous enough to be worth sharing as substantial evidence of safety but (ii) encouraging/unconcerning in thei
ryan_greenblatt's Shortform
Ryan Greenblatt2mo*9883

A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.

I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work.

(This was originally a tweet thread (https://x.com/RyanPGreenblatt/status/1925992236648464774) which I've converted into a LessWrong quick take.)

What is the change and how does it affect security?

9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to emp... (read more)

Reply31
Showing 3 of 11 replies (Click to show all)
2Ryan Greenblatt2mo
How good/bad is it to work on capabilities at Anthropic? That's the most clear cut case, but lots of stuff trades off anthropic power with other stuff.
6Ben Pace2mo
I think the main thing I want to convey is that I think you're saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic, but what I'm actually saying that their integrity is no match for the forces that they are being tested with. I don't need to be able to predict a lot of fine details about individuals' decision-making in order to be able to have good estimates of these two quantities, and comparing them is the second-most question relating to whether it's good to work on capabilities at Anthropic. (The first one is a basic ethical question about working on a potentially extinction-causing technology that is not much related to the details of which capabilities company you're working on.)
Ryan Greenblatt2mo20

I think you're saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic

This is related to what I was saying but it wasn't what I was saying. I was saying "tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person's views and values)". I wasn't making a claim about the perceived absolute level of integrity.

Probably not worth hashing this out further, I think I get what you're saying.

Reply1
ryan_greenblatt's Shortform
Ryan Greenblatt2mo*1211

Sometimes people talk about how AIs will be very superhuman at a bunch of (narrow) domains. A key question related to this is how much this generalizes. Here are two different possible extremes for how this could go:

  1. It's effectively like an attached narrow weak AI: The AI is superhuman at things like writing ultra fast CUDA kernels, but from the AI's perspective, this is sort of like it has a weak AI tool attached to it (in a well integrated way) which is superhuman at this skill. The part which is writing these CUDA kernels (or otherwise doing the task)
... (read more)
Reply1
Showing 3 of 4 replies (Click to show all)
3Caleb Biddulph2mo
This seems important to think about, I strong upvoted! I'm not sure that link supports your conclusion. First, the paper is about AI understanding its own behavior. This paper makes me expect that a CUDA-kernel-writing AI would be able to accurately identify itself as being specialized at writing CUDA kernels, which doesn't support the idea that it would generalize to non-CUDA tasks. Maybe if you asked the AI "please list heuristics you use to write CUDA kernels," it would be able to give you a pretty accurate list. This is plausibly more useful for generalizing, because if the model can name these heuristics explicitly, maybe it can also use the ones that generalize, if they do generalize. This depends on 1) the model is aware of many heuristics that it's learned, 2) many of these heuristics generalize across domains, and 3) it can use its awareness of these heuristics to successfully generalize. None of these are clearly true to me. Second, the paper only tested GPT-4o and Llama 3, so the paper doesn't provide clear evidence that more capable AIs "shift some towards (2)." The authors actually call out in the paper that future work could test this on smaller models to find out if there are scaling laws - has anybody done this? I wouldn't be too surprised if small models were also able to self-report simple attributes about themselves that were instilled during training.
Ryan Greenblatt2mo20

Fair, but I think the AI being aware of its behavior is pretty continuous with being aware of the heuristics it's using and ultimately generalizing these (e.g., in some cases the AI learns what code word it is trying to make the user say which is very similar to being aware of any other aspect of the task it is learning). I'm skeptical that very weak/small AIs can do this based on some other papers which show they fail at substantially easier (out-of-context reasoning) tasks.

I think most of the reason why I believe this is improving with capabilities is du... (read more)

Reply
3Ryan Greenblatt2mo
I don't think this explains our disagreements. My low confidence guess is we have reasonably similar views on this. But, I do think it drives parts of some disagreements between me and people who are much more optimisitic than me (e.g. various not-very-concerned AI company employees). I agree the value of (1) vs (2) might also be a crux in some cases.
Wei Dai's Shortform
Wei Dai2mo62

Some potential risks stemming from trying to increase philosophical competence of humans and AIs, or doing metaphilosophy research. (1 and 2 seem almost too obvious to write down, but I think I should probably write them down anyway.)

  1. Philosophical competence is dual use, like much else in AI safety. It may for example allow a misaligned AI to make better decisions (by developing a better decision theory), and thereby take more power in this universe or cause greater harm in the multiverse.
  2. Some researchers/proponents may be overconfident, and cause flawed m
... (read more)
Reply1
1cubefox2mo
I think the most dangerous version of 3 is a sort of Chesterton's fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn't require holding on to the mistaken belief. Do you have an example for 4? It seems rather abstract and contrived. Generally, I think the value of believing true things tends to be almost always positive. Examples to the contrary seem mostly contrived (basilisk-like infohazards) or only occur relatively rarely. (E.g. believing a lie makes you more convincing, as you don't technically have to lie when telling the falsehood, but lying is mostly bad or not very good anyway.) Overall, I think the risks from philosophical progress aren't overly serious while the opportunities are quite large, so the overall EV looks comfortably positive.
Wei Dai2mo*30

I think the most dangerous version of 3 is a sort of Chesterton's fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn't require holding on to the mistaken belief.

I think that makes sense, but sometimes you can't necessarily motivate a usef... (read more)

Reply
3Ryan Greenblatt2mo
Thanks, I updated down a bit on risks from increasing philosophical competence based on this (as all of these seem very weak) (Relevant to some stuff I'm doing as I'm writing about work in this area.) IMO, the biggest risk isn't on your list: increased salience and reasoning about infohazards in general and in particular certain aspects of acausal interactions. Of course, we need to reason about how to handle these risks eventually but broader salience too early (relative to overall capabilities and various research directions) could be quite harmful. Perhaps this motivates suddenly increasing philosophical competence so we quickly move through the regime where AIs aren't smart enough to be careful, but are smart enough to discover info hazards.
steve2152's Shortform
Steve Byrnes3mo102

In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

For some reason it took me until now to notice that:

  • my “outer misalignment” is more-or-less synonymous with “specification ga
... (read more)
Reply
Showing 3 of 8 replies (Click to show all)
4Towards_Keeperhood3mo
Thanks. Yeah I guess I wasn't thinking concretely enough. I don't know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here's a bit of rambling: (I think point 6 is most important.) 1. As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires. 2. I think in smart reflective humans, the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates, e.g. because they can better take abstract philosophical arguments into account. 1. The learned value function can also update on hypothetical scenarios, e.g. imagining a risk or a gain, but it doesn't update strongly on abstract arguments like "I should correct my estimates based on outside view". 3. The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward. 4. Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it's best to just trust the self-model and that this will likely lead to reward. Something like this could happen where the value function was actually aligned to outer reward, but the inner suggestor was just very good at making suggestions that the value function likes, even if the inner suggestor would have different actual values. I guess if the self-model suggests something that actually leads to less reward, then the value function will trust the self-model less, but outside the training distribution the self-model could essentially do what it wants. 1. Another question of course is whether the inner self-reflective optimizers are likely al
Steve Byrnes3mo40

Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful.

Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe).

Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a differ... (read more)

Reply
2Steve Byrnes3mo
OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its sentence? It’s kinda unknowable in the absence of what its later words will be. So one thing you can do is say that the AI bumbles around and takes reversible actions, rolling them back whenever the oracle says no. And the oracle is so good that we get CEV that way. This is a coherent thought experiment, and it does indeed make inner alignment unnecessary—but only because we’ve removed all the intelligence from the so-called AI! The AI is no longer making plans, so the plans don’t need to be accurately evaluated for their goodness (which is where inner alignment problems happen). Alternately, we could flesh out the thought experiment by saying that the AI does have a lot of intelligence and planning, and that the oracle is doing the best it can to anticipate the AI’s behavior (without reading the AI’s mind). In that case, we do have to worry about the AI having bad motivation, and tricking the oracle by doing innocuous-seeming things until it suddenly deletes the oracle subroutine out of the blue (treacherous turn). So in that version, the AI’s inner alignment is still important. (Unless we just declare that the AI’s alignment is unnecessary in the first place, because we’re going to prevent treacherous turns via option control.) Yeah I mostly think this part of your comment is listing reasons that inner alignment might fail, a.k.a. reasons that goal misgeneralization / malgeneralization can happen. (Which is a fine thing to do!) If someone thinks inner misalignment is synonymous with deception, then they’re confused. I’m not sure how such a person would have gotten that impression. If it’s a very common confusion, then that’s news to me. Inner alignment ca
Load More
control-style