[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA

[-]TurnTrout4y50

Big agreement & signal boost & push for funding on The “Reverse-engineer human social instincts” research program: Yes, please, please figure out how human social instincts are generated! I think this is incredibly important, for reasons which will become obvious due to several posts I'll probably put out this summer.

[-]Not Relevant4y40

Steve, your AI safety musings are my favorite thing tonally on here. Thanks for all the effort you put into this series. I learned a lot.

To just ask the direct question, how do we reverse-engineering human social instincts? Do we:

Need to be neuroscience PhDs?
Need to just think a lot about what base generators of human developmental phenomena are, maybe by staring at a lot of babies?
Guess, and hope we get to build enough AGIs that we notice which ones seem to be coming out normal-acting before one of them kills us?
Something else you've thought of?

I don't have a great sense for the possibility space.

[-]Steven Byrnes4y*20

Thanks!

how do we reverse-engineering human social instincts?

I don't know! Getting a better idea is high on my to-do list. :)

I guess broadly, the four things are (1) “armchair theorizing” (as I was doing in Post #13), (2) reading / evaluating existing theories, (3) reading / evaluating existing experimental data (I expect mainly neuroscience data, but perhaps also psychology etc.), (4) doing new experiments to gather new data.

As an example of (3) & (4), I can imagine something like “the connectomics and microstructure of the something-or-other nucleus of the hypothalamus” providing a helpful hint about what's going on; this information might or might not already be in the literature.

Neuroscience experiments are presumably best done by academic groups. I hope that neuroscience PhDs are not necessary for the other things, because I don’t have one myself :-P

AFAICT, in a neuroscience PhD, you might learn lots of facts about the hypothalamus and brainstem, but those facts almost definitely won’t be incorporated into a theoretical framework involving (A) calculating reward functions for RL (as in Section 15.2.1.2), (B) the symbol grounding problem (as in Post #13). I really like that theoretical framework, but it seems uncommon in the literature.

FYI, here on lesswrong, “Gunnar_Zarncke” & “jpyykko” have been trying to compile a list of possible instincts, or something like that, Gunnar emailed me but I haven’t had time to look closely and have an opinion; just wanted to mention that.

[-]Gunnar_Zarncke4y30

Thank you for mentioning us. In fact, the list of candidate instincts got longer. It isn't in a presentable form yet, but please message me if you want to talk about it.

The list is more theoretical, and I want to prove that this is not just theoretical speculation by operationalizing it. jpyykko is already working on something more on the symbolic level.

Rohin Shaw recommended that I find people to work with me on alignment, and I teamed up with two LWers. We just started work on a project to simulate instinct-cued learning in a toy-world. I think this project fits research point 15.2.1.2, and I wonder now how to apply for funding - we would probably need it if we want to simulate with somewhat larger NNs.

[-]Linda Linsefors4y20

I'm also interested to se the list of candidate instincts.

Regarding funding, how much money do you need? Just order of magnitude. There lots of diffrent grants and where you want to appy depends on the size of your budget.

[+][comment deleted]4y10

[-]Zach Stein-Perlman3y30

How optimistic should we be about alignment & safety for brain-like-AGI, relative to prosaic AGI?

[-]Steven Byrnes3y30

That’s a hard question for me to answer, because I have a real vivid inside-view picture of researchers eventually building AGI via the “brain-like” route, and what the resulting AGI would look like, whereas when I try to imagine other R&D routes to AGI, I can’t, except by imagining that future researchers will converge towards the brain-like path. :-P

In particular:

I think a model trained purely on self-supervised learning (not RL) would be safer than brain-like AGI. But I don’t think a model trained purely on self-supervised learning would be “AGI” in the first place. (For various reasons, one of which is the discussion of “RL-on-thoughts” here.) And those two beliefs are very related!! So then I do Murphyjitsu by saying to myself: OK but if I’m wrong, and self-supervised learning did scale to AGI, how did that happen? Then I imagine future models acquiring, umm, “agency”, either by future programmers explicitly incorporating RL etc. deeply into the training / architecture, or else by agency emerging somehow e.g. because it’s “simulating” agential humans., and either of those brings us much closer to brain-like AGI, and thus I stop feeling like it’s safer than brain-like AGI.
I do think the Risks-From-Learned-Optimization model could in principle create AGI (obviously, that’s how evolution made humans). But I don’t think it would happen, for reasons in Post 8. If it did happen, the only way I can concretely imagine it happening is that the inner model is a brain-like AGI. In that case, I think it would be worse than making brain-like AGI directly, for reasons in §8.3.3.

[-]Raemon4y30

Curated. Thanks to Steve for writing up all these thoughts throughout the sequence.

Normally when we curate a post-from-a-sequence-that-represents-the-sequence, we end up curating the first post, which points roughly to where the sequence is going. I like the fact that this time, there was a post that does a particularly nice job tying-everything-together, while sending people off with a roadmap of further work to do.

I appreciate the honesty about your epistemic state about the "Is Steve full of crap research program?". :P

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

32

[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA

32

15.1 Post summary / Table of contents

15.2 Open problems

15.2.1 Open problems that look like normal neuroscience

15.2.1.1 The “Is Steve full of crap when he talks about neuroscience?” research program — ⭐⭐⭐⭐

15.2.1.2 The “Reverse-engineer human social instincts” research program — ⭐⭐⭐⭐⭐

15.2.2 Open problems that look like normal computer science

15.2.2.1 The “Make the biggest and best open-source human-legible world-model / web-of-knowledge that we can” research program — ⭐⭐⭐

15.2.2.2 The “Easy-to-use super-secure sandbox for AGIs” research program — ⭐⭐⭐

15.2.3 Open problems that involve explicitly talking about AGIs

15.2.3.1 The “Edge-cases / conservatism / concept extrapolation” research program — ⭐⭐⭐⭐⭐

15.2.3.2 The “Rigorously prove anything whatsoever about the meaning of things in a learned-from-scratch world-model” research program — ⭐⭐⭐⭐⭐

15.2.3.3 The “Solving the whole problem” research program — ⭐⭐⭐⭐⭐

15.3 How to get involved

15.3.1 Funding situation

15.3.2 Jobs, organizations, training programs, community, etc.

15.3.2.1 …For AGI safety (a.k.a. AI alignment) in general

15.3.2.2 …More specifically related to this series

15.4 Conclusion: 8 takeaway messages

Changelog