Ben Pace

I'm an admin of LessWrong. Here are a few things about me.

  • I generally feel more hopeful about a situation when I understand it better.
  • I have signed no contracts nor made any agreements whose existence I cannot mention.
  • I believe it is good take responsibility for accurately and honestly informing people of what you believe in all conversations; and also good to cultivate an active recklessness for the social consequences of doing so.

(Longer bio.)

Sequences

AI Alignment Writing Day 2019
AI Alignment Writing Day 2018

Wiki Contributions

Load More

Comments

Sorted by

Third, we gave the model a secret “scratchpad”—a workspace where it could record its step-by-step reasoning. We told the model to use the scratchpad to reason about what it should do (as far as the model was aware, the scratchpad was invisible to users and to Anthropic). This allowed us to observe whether the model planned to fake alignment, and how it would go about doing so.

This feels like amusingly like tricking a child. "Remember kiddo, you can reason out loud about where you're going to hide and I won't hear it. Now let's play hide and seek!"

I don't know how to quickly convey why I find this point so helpful, but I find this to be a helpful pointer to a key problem, and the post is quite short, and I hope someone else positively votes on it. +4.

Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)

Curated. I thought this was a valuable list of areas, most of which I haven't thought that much about, and I've certainly never seen them brought together in one place before, which I think itself is a handy pointer to the sort of work that needs doing.

Ben Pace40

You explicitly assume this stuff away, but I believe under this setup that the subagents would be incentivized to murder each other before the button is pressed (to get rid of that annoying veto).

I also note that if one agent becomes way way smarter than the other, that this balance may not work out.

Even if it works, I don't see how to set up the utility functions such that humans aren't disempowered. That's a complicated term!

Overall a very interesting idea.

+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.

Someone working full-time on an approach to the alignment problem that they feel optimistic about, and writing annual reflections on their work, is something that has been sorely lacking. +4

Load More