LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.
Curated. It seems like in the current regime of frontier models, it's worth making more explicit models of what architectures we can expect, if dangerous capabilities develop in the near future.
This post feels like it spells out a model that matches my general understanding of the state-of-the-art, but draws several inferences about it I hadn't previously thought about.
I'd be interested in other people who've thought about current generation deployment setups chiming in with their takes, if they disagree or think there's important elements missing.
Some additional thoughts:
A thing that is different from a forecasting bot, is an "operationalization" bot, that scans threads with disagreements for plausible operationalizations of the disagreement.
I think good discourse is partly bottlenecked on "it's pretty expensive and annoying to operationalize many disagreements, in ways that actually help get to the heart of the matter."
I'd be pretty interested in experimenting with operationalization-bot and forecasting-bot combos (once they get sufficiently good to not be annoying), that scan disagreements, suggest little cloud-bubbles of "is this a cruxy operationalization you could bet on?", which maybe disappear if nobody clicks "yep".
Then, people are encouraged to put their own probabilities on it, and then, afterwards (to avoid anchoring), a forecasting bot gives it's own guess.
How much does it cost to run one instance of this right now?
One obvious idea is to build integrations for Manifold, Fatebook, Metaculus etc where you automatically have this bot give an answer. I'm curious how much that costs when you're just doing it all over the place.
(I don't know that I believe the current generation of bots are as good as implied here for reasons Habryka/Gwern are saying, but, regardless, I expect at least pretty decent bots of this form to be available within a year if they aren't already, and for it to be useful for various sites to have SOTA forecasting bots plugged into them if they're at least nontrivially better than chance)
Curated. "What would actually be persuasive that scheming is a problem?" is one of the important practical questions for AI governance. I appreciate Buck noting places he changed his mind somewhat during the SB 1047 discourse.
I appreciate that the post has detailed worked examples.
I haven't read the Shard Theory work in comprehensive detail. But, fwiw I've read at least a fair amount of your arguments here and not seen anything that bridged the gap between "motivations are made of shards that are contextually activated" and "we don't need to worry about Goodhart and misgeneralization of human values at extreme levels of optimization."
I've heard you make this basic argument several times, and my sense is you're pretty frustrated that people still don't seem to have "heard" it properly, or something. I currently feel like I have heard it, and don’t find it compelling.
I did feel compelled by your argument that we should look to humans as an example of how "human values" got aligned. And it seems at least plausible that we are approaching a regime where the concrete nitty-gritty of prosaic ML can inform our overall alignment models in a way that makes the thought experiments of 2010 outdate.
But, like, a) I don't actually think most humans are automatically aligned if naively scaled up (though it does seem safer than naive AI scaling), and b) while human-value-formation might be simpler than the Yudkowskian model predicts, it still doesn't seem like the gist of "look to humans" gets us to a plan that is simple in absolute terms, c) it seems like there are still concrete reasons to expect training superhuman models to be meaningfully quite different from the current LLMs, which aren't at a stage where I'd expect them to exhibit any of the properties I'd be worried about.
(Also, in your shard theory post, you skip over the example of 'embarassment' because you can't explain it yet, and switch to sugar, and I'm like 'but, the embarrassment one was much more cruxy and important!')
I don't expect to get to agreement in the comments here today, but, it feels like the current way you're arguing this point just isn't landing or having the effect you want and... I dunno what would resolve things for you or anyone else but I think it'd be better if you tried some different things for arguing about this point.
If you feel like you've explained the details of those things better in the second half of one of your posts, I will try giving it a more thorough read. (It's been awhile since I read your Diamond Maximizer post, which I don't remember in detail but don't remember finding compelling at the time)
(If we had a robust check for misalignment, we could iterate or train against it.)
This seems technically true but I wanna flag the argument "it seems rally hard to be confident that you have robust enough checks that training against them is good, instead of bad (because it trains the AI to hide better)".
From the perspective of risk management, compared to normal computer security problems, AI deployments involve vastly fewer moving parts that you have vastly less understanding of.
I don't get why this is "vastly fewer moving parts you have vastly less undestanding of" as opposed to "vastly more (or, about the same?) number of moving parts that you have vastly less understanding of."
I'd naively model each parameter in a model as a "part". I agree that, unlike most complex engineering or computer security, we don't understand what each part does. But seems weird to abtract it as "fewer parts."
(I asked about this at a recent talk you gave that included this bit. I don't remember your response at the time very well)
Curated.
The overall point here seems true and important to me.
I think I either disagree, or am agnostic about, some of the specific examples given in the Myth vs Reality section. I don't think they're loadbearing for the overall point. I may try to write those up in more detail later.
Ah yeah sorry I didn't mean to convey that. For now I'm (clumsily) edited the original comment to be more clear.
I think two major cruxes for me here are:
My current best guess is that Anthropic is still under the threshold for good enough safety culture (despite seeming better than I expected in a number of ways), and meanwhile that Deepmind is just too intractably far gone.
I think people should be hesitant to work at any scaling lab, but, I think Anthropic might be possible to make "the one actually good scaling lab", and I don't currently expect that to be tractable at Deepmind and I think "having at least one" seems good for the world (although it's a bit hard for me to articulate why at the moment)
I am interested in hearing details about Deepmind that anyone thinks should change my mind about this.
This viewpoint is based on having spent at least 10s of hours trying to learn and about influence both org's culture, at various times.
In both cases, I don't get the sense that people at the orgs really have a visceral sense that "decisionmaking processes can be fake", I think they will be fake by default and the org is better modeled as following general incentives, and DeepMind has too many moving people and moving parts at a low enough density that it doesn't seem possible to fix. For me to change my mind about this, I would need to someone there to look me in the eye and explain that they do have a visceral sense of how organizational decisionmaking processes can be fake, and why they nonetheless think DeepMind is tractable to fix. I assume it's hard for @Rohin Shah and @Neel Nanda can't really say anything publicly that's capable of changing my mind for various confidentiality and political reasons, but, like, that's my crux.
(conving me in more general terms "Ray, you're too pessimistic about org culture" would hypothetically somehow work, but, you have a lot of work to do given how thoroughly those pessimistic predictions came true about OpenAi)
I think Anthropic also has this problem, but the threshold of almost-aligned-leadership and actually-pretty-aligned people that it feels at least possible to me for the to fix it. The main things that would persuade me that they are over the critical threshold is if they publicly spent social capital on clearly spelling out why the x-risk problem is hard, and made explicit plans to not merely pause for a bit when they hit an RSP threshold, but (at least in some circumstances) advocate strongly for global government shutdown for like 20+ years.