User Comment Replies — AI Alignment Forum

My AGI safety research—2024 review, ’25 plans

I love your stuff and I'm very excited to see where you go next.

I would be very curious to hear what you have to say about more multi-polar threat scenarios and extending theories of agency into the collective intelligence frame.

What are your takes on Michael Levin's work on agency and "morphologenesis" in relation to your neuroscience ideas? What do you think about claims of hierarchical extension of these models? How does this affect multipolar threat models? What are the fundamental processes that we should care about? When should we expand these concepts cognitively, when should we constrain them?

2Steve Byrnes4mo

(Giving some answers without justification; feel free to follow up.) I haven’t found that work to be relevant or useful for what I’m doing. Biology is full of cool things. It’s fun. I’ve been watching zoology videos in my free time. Can’t get enough. Not too work-relevant though, from my perspective. I don’t think I’ve heard such claims, or if I did, I probably would have ignored it as probably-irrelevant-to-me. I don’t have any grand conceptual framework for that, and tend to rely on widespread common-sense concepts like “race-to-the-bottom” and “incentives” and “competition” and “employees who are or aren’t mission-aligned” and “miscommunication” and “social norms” and “offense-defense balance” and “bureaucracy” and “selection effects” and “coordination problems” and “externalities” and “stag hunt” and “hard power” and “parallelization of effort” and on and on. I think that this is a good general approach; I think that grand conceptual frameworks are not what I or anyone needs; and instead we just need to keep clarifying and growing and applying this collection of ideas and considerations and frames. (…But perhaps I just don’t know what I’m missing.)

2. Corrigibility Intuition

Jonas Hallgren 10mo40

Very interesting, I like the long list of examples as it helped me get my head around it more.

So, I've been thinking a bit about similar topics, but in relation to a long reflection on value lock-in.

My basic thesis was that the concept of reversibility should be what we optimise for in general for humanity, as we want to be able to reach as large a part of the "moral searchspace" as possible.

The concept of corrigibility you seem to be pointing towards here seems very related to notions of reversibility. You don't want to take actions that cannot later be r... (read more)

1Max Harms10mo

1) I'm pretty bearish on standard value uncertainty for standard MIRI reasons. I think a correct formulation of corrigibility will say that even if you (the agent) knows what the principal wants, deep in their heart, you should not optimize for it unless they direct you to do so. I explore this formally in 3b, when I talk about the distinction between sampling counterfactual values from the actual belief state over values ("P") vs a simplicity-weighted distribution ("Q"). I do think that value "uncertainty" is important in the sense that it's important for the agent to not be anchoring too heavily on any particular object-level optimization target. (I could write more words, but I suspect reading the next posts in my sequence would be a good first step if you want more of my perspective.) 2) I think reversibility is probably best seen as an emergent desideratum from corrigibility rather than vice versa. There are plenty of instances where the corrigible thing to do is to take an irreversible action, as can be seen in many of the stories, above. You're welcome! I'm glad you're enjoying it. ^_^

AI ALIGNMENT FORUM
AF

All of Jonas Hallgren's Comments + Replies