User Comment Replies — AI Alignment Forum

Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

I feel like I broadly agree with most of the points you make, but I also feel like accident vs misuse are still useful concepts to have.

For example, disasters caused by guns could be seen as:

Accidents, e.g. killing people by mistaking real guns for prop guns, which may be mitigated with better safety protocols
Misuse, e.g. school shootings, which may be mitigated with better legislations and better security etc.
Other structural causes (?), e.g. guns used in wars, which may be mitigated with better international relations

Nevertheless, all of the above ... (read more)

2David Scott Krueger2y

Yes it may be useful in some very limited contexts. I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing. AI is highly non-analogous with guns.

Don't design agents which exploit adversarial inputs

Chin Ze Shen2y10

In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited.

Similar to the evaluator-child who's trying to win his mom's approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn't a perfect grader solve the problem?

2Alex Turner2y

One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger. Yes, in theory. In practice, I think the answer is "no", for reasons outlined in this post.

Don't design agents which exploit adversarial inputs

Chin Ze Shen2y10

I'm probably missing something, but doesn't this just boil down to "misspecified goals lead to reward hacking"?

2Alex Turner2y

Nope! Both "misspecified goals" and "reward hacking" are orthogonal to what I'm pointing at. The design patterns I highlight are broken IMO.

[AN #112]: Engineering a Safer World

Chin Ze Shen2y30

I got the book (thanks to Conjecture) after doing the Intro to ML Safety Course where the book was recommended. I then browsed through the book and thought of writing a review of it - and I found this post instead, which is a much better review than I would have written, so thanks a lot for this!

Let me just put down a few thoughts that might be relevant for someone else considering picking up this book.

Target audience: Right at the beginning of the book, the author says "This book is written for the sophisticated practitioner rather than the academic... (read more)

My Thoughts on the ML Safety Course

Chin Ze Shen2y10

Thanks for the comment!

You can read more about how these technical problems relate to AGI failure modes and how they rank on importance, tractability, and crowdedness in Pragmatic AI Safety 5. I think the creators included this content in a separate forum post for a reason.

I felt some of the content in the PAIS series would've been great for the course, though the creators probably had a reason to exclude them and I'm not sure why.

The second group doesn't necessarily care about why each research direction relates to reducing X-risk.

In this case I fee... (read more)

1Josh Clymer2y

From what I understand, Dan plans to add more object-level arguments soon.

AI ALIGNMENT FORUM
AF

All of zeshen's Comments + Replies