I'm an admin of this site; I work full-time on trying to help people on LessWrong refine the art of human rationality. (Longer bio.)
I generally feel more hopeful about a situation when I understand it better.
Non-disclosure agreements I have signed: Around 2017 I signed an NDA when visiting the London DeepMind offices for lunch, one covering sharing any research secrets, that was required by all guests before we were allowed me access to the building. I do not believe I have ever signed another NDA.
You explicitly assume this stuff away, but I believe under this setup that the subagents would be incentivized to murder each other before the button is pressed (to get rid of that annoying veto).
I also note that if one agent becomes way way smarter than the other, that this balance may not work out.
Even if it works, I don't see how to set up the utility functions such that humans aren't disempowered. That's a complicated term!
Overall a very interesting idea.
+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.
Someone working full-time on an approach to the alignment problem that they feel optimistic about, and writing annual reflections on their work, is something that has been sorely lacking. +4
I don't want to double the comment count I submit to Recent Discussion, so I'll just update this comment with the things I've cut.
12/06/2023 Comment on Originality vs. Correctness
It's fun to take the wins of one culture and apply them to the other, people are very shocked that you found some hidden value to be had (though it often isn't competitive value / legible to the culture). And if you manage to avoid some terrible decison people speak about how wise you are to have noticed.
(Those are the best cases, often of course people are like "this is odd, I'm going to pretend I didn't see this" and then move on.)
For too long, I have erred on the side of writing too much.
The first reason I write is in order to find out what I think.
This often leaves my writing long and not very defensible.
However, editing the whole thing is so much extra work after I already did all the work figuring out what I think.
Sometimes it goes well if I just scrap the whole thing and concisely write my conclusion.
But typically I don't want to spend the marginal time.
Another reason my writing is too long is because I have extra thoughts I know most people won't find useful.
But I've picked up a heuristic that says it's good to share actual thinking because sometimes some people find it surprisingly useful, so I hit publish anyway.
Nonetheless, I endeavor to write shorter.
So I think I shall experiment with cutting the bits off of comments that represent me thinking aloud, but aren't worth the space in the local conversation.
And I will put them here, as the dregs of my cognition. I shall hopefully gather data over the next month or two and find out whether they are in fact worthwhile.
I just gave this a re-read, I forgot what a trip it is to read the thoughts of Eliezer Yudkowsky. It continues to be some of my favorite stuff in recent years written on LessWrong.
It's hard to relate to the world with a level of mastery over basic ideas as Eliezer has. I don't mean with this to vouch that his perspective is certainly correct, but I believe it is at least possible, and so I think he aspires to a knowledge of reality that I rarely if ever aspire to. Reading it inspires me to really think about how the world works, and really figure out what I know and what I don't. +9
(And the smart people dialoguing with him here are good sports for keeping up their side of the argument.)
They are not being treated worse than foot soldiers, because they do not have an enemy army attempting to murder them during the job. (Unless 'foot soldiers' itself more commonly used as a metaphor for 'grunt work' and I'm not aware of that.)
I am surprised to see the Open Philanthropy network taking all of the powerful roles here.
The initial Trustees are:
- Jason Matheny: CEO of the RAND Corporation
- Kanika Bahl: CEO & President of Evidence Action
- Neil Buddy Shah: CEO of the Clinton Health Access Initiative (Chair)
- Paul Christiano: Founder of the Alignment Research Center
- Zach Robinson: Interim CEO of Effective Ventures US
In case it's not apparent:
And, as a reminder for those who've forgotten, Holden Karnofsky's wife Daniela Amodei is President of Anthropic (and so Holden is the brother-in-law of Dario Amodei).
I think the argument here basically implies that language models will not produce any novel, useful concepts in any existing industries or research fields that get substantial adoption (e.g. >10% of ppl use it, or a widely cited paper) in those industries, in the next 3 years, and if it did this, then the end would be nigh (or much nigher).
To be clear, you might get new concepts from language models about language if you nail some Chris Olah style transparency work, but the language model itself will not output ones that aren't about language in the text.
I don’t think it applies to safety researchers at AI Labs though, I am shocked how much those folks can make.