Stuart_Armstrong — AI Alignment Forum

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Thanks for the suggestion; that's certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what 'insecure' does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of 'insecure'.

Using Prompt Evaluation to Combat Bio-Weapon Research

Stuart_Armstrong7mo20

The mundane prompts were blocked 0% of the time. But you're right - we need something in between 'mundane and unrelated to bio research' and 'useful for bioweapons research'.

But I'm not sure what - here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.

Acausal trade: Introduction

Stuart_Armstrong2y30

Thanks!

SolidGoldMagikarp (plus, prompt generation)

Stuart_Armstrong3y20

As we discussed, I feel that the tokens were added for some reason but then not trained on; hence why they are close to the origin, and why the algorithm goes wrong on them, because it just isn't trained on them at all.

Good work on this post.

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Stuart_Armstrong3y43

I'll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)

Namely that I don't think we can talk sensibly about an AI having "beneficial goal-directedness" without situational awareness. For instance, it's of little use to have an AI with the goal of "ensuring human flourishing" if it doesn't understand the meaning of flourishing or human. And, without situational awareness, it can't understand either; at best we could have some proxy or pointer towards these key concepts.

The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.

So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"

Testing The Natural Abstraction Hypothesis: Project Intro

Stuart_Armstrong3y20

Here's the review, though it's not very detailed (the post explains why):

https://www.lesswrong.com/posts/dNzhdiFE398KcGDc9/testing-the-natural-abstraction-hypothesis-project-update?commentId=spMRg2NhPogHLgPa8

Testing The Natural Abstraction Hypothesis: Project Update

Stuart_Armstrong3y50Review for 2021 Review

A good review of work done, which shows that the writer is following their research plan and following up their pledge to keep the community informed.

The contents, however, are less relevant, and I expect that they will change as the project goes on. I.e. I think it is a great positive that this post exists, but it may not be worth reading for most people, unless they are specifically interested in research in this area. They should wait for the final report, be it positive or negative.

Testing The Natural Abstraction Hypothesis: Project Intro

Stuart_Armstrong3y30

I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).

Testing The Natural Abstraction Hypothesis: Project Intro

Stuart_Armstrong3y40Review for 2021 Review

A decent introduction to the natural abstraction hypothesis, and how testing it might be attempted. A very worthy project, but it isn't that easy to follow for beginners, nor does it provide a good understanding of how the testing might work in detail. What might consist a success, what might consist a failure of this testing? A decent introduction, but only an introduction, and it should have been part of a sequence or a longer post.

Large language models can provide "normative assumptions" for learning human preferences

Stuart_Armstrong3y20

Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)?

For myself, I was thinking of using CHATGPT-style approaches with multiple queries - what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments