I know of no rigorous proposals. The general challenge such proposals face is that if you are relying on fooling your AGI about something to keep control over it, and it's constantly and rapidly getting smarter and wiser... that's a recipe for your scheme to fail suddenly and silently (when it stops being fooled), which is a recipe for disaster.

Another type of proposal relies on making it actually true that it might be in a simulation--or to put it more precisely perhaps, making it actually the case that future aligned superintelligences will make simulations so accurate that even a baby superintelligence can't tell the difference. However, two can play at that game; more generally this just becomes a special case of acausal trade stuff which will be wild and confusing and very important once AIs are smart enough to take it seriously.

[-]David Scott Krueger (formerly: capybaralet)1y11

Not necessarily fooling it, just keeping it ignorant. I think such schemes can plausibly scale to very high levels of capabilities, perhaps indefinitely, since intelligence doesn't give one the ability to create information from thin air...

Reply

2Daniel Kokotajlo1y

Are you describing something that would fit within my 'Another type of proposal...' category?

3David Scott Krueger (formerly: capybaralet)1y

No, I was only responding to the the first part.

Ramana Kumar

Sep 26, 2024

10

Vaguely related perhaps is the work on Decoupled Approval: https://arxiv.org/abs/2011.08827

Moderation Log

Curated and popular this week

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

7

[ Question ]

Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

7

2 Answers sorted by
top scoring

Sep 04, 2024

Sep 26, 2024

7

[ Question ]

Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

7

2 Answers sorted by top scoring

Sep 04, 2024

Sep 26, 2024

2 Answers sorted by
top scoring