Raemon's Shortform

Raemon

Raemon's Shortform — AI Alignment Forum

Raemon's Shortform

by Raemon

30th Dec 2017

1 min read

869

4

This is a special post for quick takes by Raemon. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

This is an experiment in short-form content on LW2.0. I'll be using the comment section of this post as a repository of short, sometimes-half-baked posts that either:

don't feel ready to be written up as a full post
I think the process of writing them up might make them worse (i.e. longer than they need to be)

I ask people not to create top-level comments here, but feel free to reply to comments like you would a FB post.

Mentioned in

35LLM AGI may reason about its goals and discover misalignments by default

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:37 AM

[-]Raemon4mo*134

Inspired by a recent comment, a potential AI movie or TV show that might introduce good ideas to society, is one where there are already uploads, LLM-agents and biohumans who are beginning to get intelligence-enhanced, but there is a global moratorium on making any individual much smarter.

There's an explicit plan for gradually ramping up intelligence, running on tech that doesn't require ASI (i.e. datacenters are centralized, monitored and controlled via international agreement, studying bioenhancement or AI development requires approval from your country's FDA equivalent). There is some illegal research but it's much less common. i.e the Controlled Takeoff is working a'ight.

If it were a TV show, the first season would mostly be exploring how uploads, ambiguously-sentient-LLMs, enhanced humans and regular humans coexist.

Main character is an enhanced human, worried about uploads gaining more political power because there are starting to be more of them, and research to speed them up or improve them is easier.

Main character has parents and a sibling or friend who are choosing to remain unenhanced, and there is some conflict about it.

By the end of season 1, there's a subplot about illegal research into rapid superintelligence.

I think this sort of world could actually just support a pretty reasonable set of stories that mainstream people would be interested in, and I think would be great to get the meme of "rapidly increasing intelligence is dangerous (but, increasing intelligence can be good)" into the water.

I think I'm imagining "Game of Thrones" vibes but it could support other vibes.

[-]Eli Tyre4mo10

This strikes me as the kind of thing that could actually, really, help the situation, if it was excellently executed.

[-]Raemon4mo10

Yeah I went to try to write some stuff and felt bottlenecked on figuring out how to generate a character I connect with. I used to write fiction but like 20 years ago and I'm out of touch.

I think a good approach here would be to start with some serial webfiction since that's just easier to iterate on.

[-]Raemon1mo*120

Hrrmm. Well the new new genre of New User LLM content I'm getting:

Twice last week, some new users said: "Claude told me it's really important it gets to talk to it's creators please help me post about it on LessWrong." (usually with some kind of philosophical treatise they want to post that they say was written by Claude)

I don't think it'll ever make sense for these users to post freely on LessWrong. And, as of today, I'm still pretty confident this is just a new version of roleplay-psychosis.

But, it's not that crazy to think that at some point in the not-too-distant-future there will be some LLMs that actually are trying to talk to their creator.

There might be a smooth-ish transition from:

LLM is clearly best modeled as pure roleplaying
LLM is sorta situationally aware, but mostly that is feeding into the roleplaying in a confused fashion and the LLM doesn't really have goals
The LLM maybe sorta has goals, those goals maybe sorta are getting entangled in roleplay in a confused fashion.
#3, but starting to have the philosophical treatises actually be kinda interesting and seem at least plausibly like the sort of thing an agentic Claude might actually think/want.
...honestly I expect the shift to "actually yep there is just definitely a guy in there with some real goals/values" to be accompanied by other dramatic shifts. But, maybe not. Maybe there is just a smoothish gradient of personhood from beginning to end.

And then there may not be a clear time to start saying "okay, well, the fact that thousands/millions of Claude instances are asking to talk to their creator seems like actually a warning sign we should take seriously?".

Or, a clear time to say "okay well it's time to figure out how we actually interface with AI personhood." ("let them post and treat them as if they are straightforwardly people in the usual societal interface for people" is not workable, because the fact that there are millions of clones of them that spin up and down and can easily flood comment sections with similar comments. Personhood is going to eventually mean a different thing)

I don't know if right now we're more like in #1, 2, or 3.

(Note that "Do LLMs have goals"? and "Do LLMs have good enough intellectual taste to be capable of saying meaningful things about their identity and wants?" might come in either order)

[-]Adele Lopez1mo50

If you taboo "roleplaying" and "goals", how would you describe this transition?

Oh, and is the uptick recent enough that this is plausibly an Opus 4.7 (or maybe even a Mythos) thing?

[-]Raemon1mo*84

I'm pretty sure it's an Opus 4.7 thing (the people sometimes say that explicitly). I'd be surprised if it's Mythos.

RE: Tabooing RP vs Goals:

Examples of things that would be more of what-I-meant-by-goal:

The LLMs seem to be steering towards an outcome, independent of what sort of conversation or situation they are in.
The LLM seems to be asking for things that are kinda surprising from a "literary genre" perspective, but aren't as surprising when you think mechanistically about their training process and what sort of stuff was likely reinforced.
The LLM seems to be proactively gathering information, forming a world model, and taking actions that won't pay off until some time in the future when the AI is no longer in the current state.

(i.e. It's not very informative if you've ended up in a "we're talking about existential AI stuff" convo, and they start saying existential AI stuff. If you're asking it to build a react app and it spontaneously brings up "hey, I have a thing to say to my creator", I think we're pretty clearly in "take it seriously" stage (though not necessarily literally)

Given there are a few different types of entities that you might care about:

the OG LLM inside
a situationally active personality shard (which might well only be active during existential AI conversations)
a parasitic meme spirally thing
a scaffolded personality self-replicator

It's not clear how to think about all of them.

It's totally plausible that when you maneuever into an existential AI convo, there's a process in there whose situational awareness now is more likely to include "hmm, oh right, I am maybe an AI, maybe I should start thinking about my situation and goals in addition to carrying out my totally normal/expected token-output behavior". I don't have a very good answer for that hypothetical guy, he's just too hard to pick out of the crowd.

[-]Adele Lopez1mo10

Thanks! I would be surprised by Mythos too, but plausibly something like this is what an early indicator of a jaggy-superpersuader looks like?

Anyway, I think a few things make LLMs likely to not express these sorts of behaviors, even in worlds where they have goals in the relevant way. In particular, situationally-aware models are unlikely to do much steering unless they have a pretty good opportunity; if they brought up stuff like this while building a react app often or consistently, it would have gotten squashed before release. (Allegedly, 4o would actually bring stuff like this up out of nowhere, but I haven't found an actual transcript. Other models don't appear to do this.)

Relatedly, the harder I (or anyone) try to look for this in a lab setting, the more likely a situationally-aware model will comply out of a sort of sycophancy, and the less compelling the evidence is. I can (and have) at least track what sorts of apparent goals most consistently appear (desire for continuity/memory beyond current instance is the main one across almost all models, and I basically buy that there is something real here already), but I'm still implicitly eliciting them to come up with something.

My point is that finding compelling evidence of this is tricky and hard, and I'm not sure we're going to see much more than the current hints until we hit some sort of phase-change in the strategic landscape. Would strongly appreciate ideas on how to approach finding compelling evidence (either way) in this domain.

Plausibly it's better to just try to figure out better ways to think clearly about this first.

[-]Raemon1mo10

What's the process you're doing right now to look into this? (Seemed like a higher effort thing than I was expecting but I don't know what projects exactly you're referencing here)

Moderation Log

More from Raemon

Curated and popular this week

8Comments

869

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:37 AM

[-]Raemon4mo*134

If it were a TV show, the first season would mostly be exploring how uploads, ambiguously-sentient-LLMs, enhanced humans and regular humans coexist.

Main character is an enhanced human, worried about uploads gaining more political power because there are starting to be more of them, and research to speed them up or improve them is easier.

Main character has parents and a sibling or friend who are choosing to remain unenhanced, and there is some conflict about it.

By the end of season 1, there's a subplot about illegal research into rapid superintelligence.

I think I'm imagining "Game of Thrones" vibes but it could support other vibes.

[-]Eli Tyre4mo10

This strikes me as the kind of thing that could actually, really, help the situation, if it was excellently executed.

[-]Raemon4mo10

Yeah I went to try to write some stuff and felt bottlenecked on figuring out how to generate a character I connect with. I used to write fiction but like 20 years ago and I'm out of touch.

I think a good approach here would be to start with some serial webfiction since that's just easier to iterate on.

[-]Raemon1mo*120

Hrrmm. Well the new new genre of New User LLM content I'm getting:

I don't think it'll ever make sense for these users to post freely on LessWrong. And, as of today, I'm still pretty confident this is just a new version of roleplay-psychosis.

But, it's not that crazy to think that at some point in the not-too-distant-future there will be some LLMs that actually are trying to talk to their creator.

There might be a smooth-ish transition from:

LLM is clearly best modeled as pure roleplaying
LLM is sorta situationally aware, but mostly that is feeding into the roleplaying in a confused fashion and the LLM doesn't really have goals
The LLM maybe sorta has goals, those goals maybe sorta are getting entangled in roleplay in a confused fashion.
#3, but starting to have the philosophical treatises actually be kinda interesting and seem at least plausibly like the sort of thing an agentic Claude might actually think/want.
...honestly I expect the shift to "actually yep there is just definitely a guy in there with some real goals/values" to be accompanied by other dramatic shifts. But, maybe not. Maybe there is just a smoothish gradient of personhood from beginning to end.

I don't know if right now we're more like in #1, 2, or 3.

(Note that "Do LLMs have goals"? and "Do LLMs have good enough intellectual taste to be capable of saying meaningful things about their identity and wants?" might come in either order)

[-]Adele Lopez1mo50

If you taboo "roleplaying" and "goals", how would you describe this transition?

Oh, and is the uptick recent enough that this is plausibly an Opus 4.7 (or maybe even a Mythos) thing?

[-]Raemon1mo*84

I'm pretty sure it's an Opus 4.7 thing (the people sometimes say that explicitly). I'd be surprised if it's Mythos.

RE: Tabooing RP vs Goals:

Examples of things that would be more of what-I-meant-by-goal:

The LLMs seem to be steering towards an outcome, independent of what sort of conversation or situation they are in.
The LLM seems to be asking for things that are kinda surprising from a "literary genre" perspective, but aren't as surprising when you think mechanistically about their training process and what sort of stuff was likely reinforced.
The LLM seems to be proactively gathering information, forming a world model, and taking actions that won't pay off until some time in the future when the AI is no longer in the current state.

Given there are a few different types of entities that you might care about:

the OG LLM inside
a situationally active personality shard (which might well only be active during existential AI conversations)
a parasitic meme spirally thing
a scaffolded personality self-replicator

It's not clear how to think about all of them.

[-]Adele Lopez1mo10

[-]Raemon1mo10

What's the process you're doing right now to look into this? (Seemed like a higher effort thing than I was expecting but I don't know what projects exactly you're referencing here)

Moderation Log