AI ALIGNMENT FORUM
AF

All of Charbel-Raphaël's Comments + Replies

AI-enabled coups: a small group could use AI to seize power

Well done - this is super important. I think this angle might also be quite easily pitchable to governments.

Short Timelines Don't Devalue Long Horizon Research

I disagree, Short Timelines Devalue at least a bit Long Horizon Research, and I think that practically this reduces the usefulness by probably a factor of 10.

Yes, having some thought put into a problem is likely better than zero thought. Giving a future AI researcher a half-finished paper on decision theory is probably better than giving it nothing. The question is how much better, and at what cost?

Opportunity Cost is Paramount: If timelines are actually short (months/few years), then every hour spent on deep theory with no immediate application is an hour... (read more)

What convincing warning shot could help prevent extinction from AI?

Charbel-Raphael Segerie2mo1-7

Coming back to this comment: we got a few clear examples, and nobody seems to care:

"In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning." - Anthropic, in the Alignment Faking paper.

This time we catched it. Next time, maybe we won't be able to catch it.

ryan_greenblatt's Shortform

Charbel-Raphael Segerie3mo10

Yeah, fair enough. I think someone should try to do a more representative experiment and we could then monitor this metric.

btw, something that bothers me a little bit with this metric is the fact that a very simple AI that just asks me periodically "Hey, do you endorse what you are doing right now? Are you time boxing? Are you following your plan?" makes me (I think) significantly more strategic and productive. Similar to I hired 5 people to sit behind me and make me productive for a month. But this is maybe off topic.

4Ryan Greenblatt3mo

Yes, but I don't see a clear reason why people (working in AI R&D) will in practice get this productivity boost (or other very low hanging things) if they don't get around to getting the boost from hiring humans.

ryan_greenblatt's Shortform

Charbel-Raphael Segerie3mo10

I was saying 2x because I've memorised the results from this study. Do we have better numbers today? R&D is harder, so this is an upper bound. However, since this was from one year ago, so perhaps the factors cancel each other out?

Summary of the experiment process and results (described in following paragraph)

5Ryan Greenblatt3mo

This case seems extremely cherry picked for cases where uplift is especially high. (Note that this is in copilot's interest.) Now, this task could probably be solved autonomously by an AI in like 10 minutes with good scaffolding. I think you have to consider the full diverse range of tasks to get a reasonable sense or at least consider harder tasks. Like RE-bench seems much closer, but I still expect uplift on RE-bench to probably (but not certainly!) considerably overstate real world speed up.

ryan_greenblatt's Shortform

Charbel-Raphael Segerie3mo20

How much faster do you think we are already? I would say 2x.

6Ryan Greenblatt3mo

I'd guess that xAI, Anthropic, and GDM are more like 5-20% faster all around (with much greater acceleration on some subtasks). It seems plausible to me that the acceleration at OpenAI is already much greater than this (e.g. more like 1.5x or 2x), or will be after some adaptation due to OpenAI having substantially better internal agents than what they've released. (I think this due to updates from o3 and general vibes.)

Against Almost Every Theory of Impact of Interpretability

Charbel-Raphael Segerie4mo*378Review for 2023 Review

Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have.

First, I believe the post's general motivation of red-teaming a ... (read more)

Davidad's Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphael Segerie4mo35Review for 2023 Review

Ok, time to review this post and assess the overall status of the project.

Review of the post

What i still appreciate about the post: I continue to appreciate its pedagogy, structure, and the general philosophy of taking a complex, lesser-known plan and helping it gain broader recognition. I'm still quite satisfied with the construction of the post—it's progressive and clearly distinguishes between what's important and what's not. I remember the first time I met Davidad. He sent me his previous post. I skimmed it for 15 minutes, didn't really understand... (read more)

Consciousness as a conflationary alliance term for intrinsically valued internal experiences

Charbel-Raphael Segerie4mo99Review for 2023 Review

I often find myself revisiting this post—it has profoundly shaped my philosophical understanding of numerous concepts. I think the notion of conflationary alliances introduced here is crucial for identifying and disentangling/dissolving many ambiguous terms and resolving philosophical confusion. I think this applies not only to consciousness but also to situational awareness, pain, interpretability, safety, alignment, and intelligence, to name a few.

I referenced this blog post in my own post, My Intellectual Journey to Dis-solve the Hard Problem of Conscio... (read more)

Are we dropping the ball on Recommendation AIs?

Charbel-Raphael Segerie6mo12

I'm a bit surprised this post has so little karma and engagement. I would be really interested to hear from people who think this is a complete distraction.

Covert Malicious Finetuning

Charbel-Raphael Segerie7mo10

Thank you for this post and study. It's indeed very interesting.

I have two questions:

In what ways is this threat model similar to or different from learned steganography? It seems quite similar to me, but I’m not entirely sure.

If it can be related to steganography, couldn’t we apply the same defenses as for steganography, such as paraphrasing, as suggested in this paper? If paraphrasing is a successful defense, we could use it in the control setting, in the lab, although it might be cumbersome to apply paraphrasing for all users in the api.

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Charbel-Raphael Segerie7mo10

Interesting! Is it fair to say that this is another attempt at solving a sub problem of misgeneralization?

Here is one suggestion to be able to cluster your SAEs features more automatically between gender and profession.

In the past, Stuart Armstrong with alignedAI also attempted to conduct works aimed at identifying different features within a neural network in such a way that the neural network would generalize better. Here is a summary of a related paper, the DivDis paper that is very similar to what alignedAI did:

https://github.com/EffiSciencesResearch/challenge_data_ens_2023/blob/main/assets/DivDis.png?raw=true

The DivDis paper presents a simple ... (read more)

Finding the Wisdom to Build Safe AI

Charbel-Raphael Segerie10mo22

It might not be that impossible to use LLM to automatically train wisdom:

Look at this: "Researchers have utilized Nvidia’s Eureka platform, a human-level reward design algorithm, to train a quadruped robot to balance and walk on top of a yoga ball."

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Charbel-Raphael Segerie10mo2123

Strongly agree.

Related: It's disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don't grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, "When you write this kind of software, you always control what's going to happen, all the outputs the software can have." As long as such ... (read more)

kave10mo917

It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion)

One issue is there's also a difference between "AI X-Safety" and "AI Safety". It's very natural for people working on all kinds of safety from and with AI systems to call their field "AI safety", so it seems a bit doomed to try and have that term refer to x-safety.

We might be dropping the ball on Autonomous Replication and Adaptation.

Charbel-Raphael Segerie11mo10

doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed".

I agree that's a bit too much, but it seems to me that we're not at all on the way to stopping open source development, and that we need to stop it at some point; maybe you think ARA is a bit early, but I think we need a red line before AI becomes human-level, and ARA is one of the last arbitrary red lines before everything accelerates.

But I still think no return to loss of control because it might be ... (read more)

We might be dropping the ball on Autonomous Replication and Adaptation.

Charbel-Raphael Segerie11mo*3-2

Thanks for this comment, but I think this might be a bit overconfident.

constantly fighting off the mitigations that humans are using to try to detect them and shut them down.

Yes, I have no doubt that if humans implement some kind of defense, this will slow down ARA a lot. But:

1) It’s not even clear people are going to try to react in the first place. As I say, most AI development is positive. If you implement regulations to fight bad ARA, you are also hindering the whole ecosystem. It’s not clear to me that we are going to do something about open

... (read more)

3Richard Ngo11mo

I think this just depends a lot on how large-scale they are. If they are using millions of dollars of compute, and are effectively large-scale criminal organizations, then there are many different avenues by which they might get detected and suppressed. A world which can pause AI development is one which can also easily throttle ARA AIs. This seems like a weak central point. "Pretty annoying" and some people dying is just incredibly small compared with the benefits of AI. And "it might be a problem in an indefinite number of years" doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed". An extended analogy: suppose the US and China both think it might be possible to invent a new weapon far more destructive than nuclear weapons, and they're both worried that the other side will invent it first. Worrying about ARAs feels like worrying about North Korea's weapons program. It could be a problem in some possible worlds, but it is always going to be much smaller, it will increasingly be left behind as the others progress, and if there's enough political will to solve the main problem (US and China racing) then you can also easily solve the side problem (e.g. by China putting pressure on North Korea to stop). Link here, and there are other comments in the same thread. Was on my laptop, which has twitter blocked, so couldn't link it myself before.

We might be dropping the ball on Autonomous Replication and Adaptation.

Charbel-Raphael Segerie11mo10

Why not! There are many many questions that were not discussed here because I just wanted to focus on the core part of the argument. But I agree details and scenarios are important, even if I think this shouldn't change too much the basic picture depicted in the OP.

Here are some important questions that were voluntarily omitted from the QA for the sake of not including stuff that fluctuates too much in my head;

would we react before the point of no return?
Where should we place the red line? Should this red line apply to labs?
Is this going to be exponential?

Charbel-Raphael Segerie2y11

This model is a proof of concept of powerful implicit mesa optimizer, which is evidence towards "current architectures could be easily inner misaligned".

2Adam Jermyn2y

This indeed sure seems like there's an inner optimizer in there somewhere...

1Logan Riggs Smith2y

Notably the model was trained across multiple episodes to pick up on RL improvement. Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.

Benchmark for successful concept extrapolation/avoiding goal misgeneralization

Charbel-Raphael Segerie3y02

Interesting. This dataset could be a good idea of hackathon.

Is there an online list with this type of datasets of interest for alignment? I am trying to organize a hackathon, I am looking for ideas

2AdamGleave3y

A related dataset is Waterbirds, described in Sagawa et al (2020), where you want to classify birds as landbirds or waterbirds regardless of whether they happen to be on a water or land background. The main difference from HappyFaces is that in Waterbirds the correlation between bird type and background is imperfect, although strong. By contrast, HappyFaces has perfect spurious correlation on the training set. Of course you could filter Waterbirds to make the spurious correlation perfect to get an equally challenging but more natural dataset.

High-stakes alignment via adversarial training [Redwood Research report]

Charbel-Raphael Segerie3y40

Ah, "The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example."

Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)

3Paul Christiano3y

I think that 3 orders of magnitude is the comparison between "time taken to find a failure by randomly sampling" and "time taken to find a failure if you are deliberately looking using tools."

High-stakes alignment via adversarial training [Redwood Research report]

Charbel-Raphael Segerie3y00

I do not understand the "we only doubled the amount of effort necessary to generate an adversarial counterexample.". Aren't we talking about 3oom?

4Charbel-Raphael Segerie3y

Ah, "The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example." Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)