User Comment Replies — AI Alignment Forum

Why imperfect adversarial robustness doesn't doom AI control

I always assume when thinking about future dangerous models that they have access to some sort of black-box memory. Do we think there is a non-negligible chance that an AI that doesn't have hidden memory, only English-language CoT, will be able to evade our monitoring and execute a rouge deployment? (Not a rhetorical question, there might be a way I haven't thought of.)

So I think that assuming the AI being stateless when thinking about future risk is not a good idea, as I think the vast majority of the risk comes from AIs for which this assumption is not t... (read more)

4Buck Shlegeris4mo

Yes, I do think this. I think the situation looks worse if the AI has hidden memory, but I don't think we're either fine if the model doesn't have it or doomed if it does.

o1 is a bad idea

David Matolcsi4mo6-2

GPT4 does not engage in the sorts of naive misinterpretations which were discussed in the early days of AI safety. If you ask it for a plan to manufacture paperclips, it doesn't think the best plan would involve converting all the matter in the solar system into paperclips.

I'm somewhat surprised by this paragraph. I thought the MIRI position was that they did not in fact predict AIs behaving like this, and the behavior of GPT4 was not an update at all for them. See this comment by Eliezer. I mostly bought that MIRI in fact never worried about AIs goi... (read more)

3Abram Demski4mo

I more-or-less agree with Eliezer's comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn't know him in 1996). I have a small beef with his bolded "MIRI is always in every instance" claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they've ever said while employed at MIRI). What I am trying to say is something looser and more gestalt. I do think what I am saying contains some disagreement with some spirit-of-MIRI, and possibly some specific others at MIRI, such that I could say I've updated on the modern progress of AI in a different way than they have. For example, in my update, the modern progress of LLMs points towards the Paul side of some Eliezer-Paul debates. (I would have to think harder about how to spell out exactly which Eliezer-Paul debates.) One thing I can say is that I myself often argued using "naive misinterpretation"-like cases such as the paperclip example. However, I was also very aware of the Eliezer-meme "the AI will understand what the humans mean, it just won't care". I would have predicted difficulty in building a system which correctly interprets and correctly cares about human requests to the extent that GPT4 does. This does not mean that AI safety is easy, or that it is solved; only that it is easier than I anticipated at this particular level of capability. Getting more specific to what I wrote in the post: My claim is that modern LLMs are "doing roughly what they seem like they are doing" and "internalize human intuitive concepts". This does include some kind of claim that these systems are more-or-less ethical (they appear to be trying to be helpful and friendly, therefore they "roughly are"). The reason I don't think this contradicts with Eliezer's bolded claim ("Getting a shape into the AI's preferences is differ

AI ALIGNMENT FORUM
AF

All of David Matolcsi's Comments + Replies