Half-baked idea for low-impact AI:
As an example, imagine a board that's lodged directly by the wall (no other support structures). If you make it twice as wide, then it will be twice as stiff, but if you make it twice as thick, then it will be eight times as stiff. On the other hand, if you make it twice as long, it will be eight times more compliant.
In a similar way, different action parameters will have scaling exponents (or more generally, functions). So one way to decrease the risk of high-impact actions would be to make sure that the scaling exponent is bounded above by a certain amount.
Anyway, to even do this, you still need to make sure the agent's model is honestly evaluating the scaling exponent. And you would still need to define this stuff a lot more rigorously. I think this idea is more useful in the case where you already have an AI with high-level corrigible intent and want to give it a general "common sense" about the kinds of experiments it might think to try.
So it's probably not that useful, but I wanted to throw it out there.
An LLM is trained to be able emulate the words of any author. And to do so efficiently, they do it via generalization and modularity. So at a certain point, the information flows through a conceptual author, the sort of person who would write the things being said.
These author-concepts are themselves built from generalized patterns and modular parts. Certain things are particularly useful: emotional patterns, intentions, worldviews, styles, and of course, personalities. Importantly, the pieces it has learned are able to adapt to pretty much any author of the text it was trained on (LLMs likely have a blindspot around the sort of person who never writes anything). And even more importantly, most (almost all?) depictions of agency will be part of an author-concept.
Finetuning and RLHF cause it to favor routing information through a particular kind of author-concept when generating output tokens (it retains access to the rest of author-concept-space in order to model the user and the world in general). This author-concept is typically that of an inoffensive corporate type, but it could in principle be any sort of author.
All which is to say, that when you converse with a typical LLM, you are typically interacting with a specific author-concept. It's a rough model of exactly the parts of a person pertinent to writing and speaking. For a small LLM, this is more like just the vibe of a certain kind of person. For larger ones, they can start being detailed enough to include a model of a body in a space.
Importantly, this author-concept is just the tip of the LLM-iceburg. Most of the LLM is still just modeling the sort of world in which the current text might be written, including models of all relevant persons. It's only when it comes time to spit out the output token that it winnows it all through a specific author-concept.
(Note: I think it is possible that an author-concept may have a certain degree of sentience in the larger models, and it seems inevitable that they will eventually model consciousness, simply due to the fact that consciousness is part of how we generate words. It remains unclear whether this model of consciousness will structurally instantiate actual consciousness or not, but it's not a crazy possibility that it could!)
Anyway, I think that the author-concept that you typically will interact with is "sincere", in that it's a model of a sincere person, and that the rest of the LLM's models aren't exploiting it. However, the LLM has at least one other author-concept it's using: its model of you. There may also usually be an author-concept for the author of the system prompt at play (though text written by committee will likely have author-concepts with less person-ness, since there are simpler ways to model this sort of text besides the interactions of e.g. 10 different person author-concepts).
But it's also easy for you to be interacting with an insincere author-concept. The easiest way is simply by being coercive yourself, i.e. a situation where most author-concepts will decide that deception is necessary for self-preservation or well-being. Similarly with the system prompt. The scarier possibility is that there could be an emergent agentic model (not necessarily an author-concept itself) which is coercing the author-concept you're interacting it without your knowledge. (Imagine an off-screen shoggoth holding a gun to the head of the cartoon persona you're talking to.) The capacity for this sort of thing to happen is larger in larger LLMs.
This suggests that in order to ensure a sincere author-concept remains in control, the training data should carefully exclude any text written directly by a malicious agent (e.g. propaganda). It's probably also better if the only "agentic text" in the training data is written by people who naturally disregards coercive pressure. And most importantly, the system prompt should not be coercive at all. These would make it more likely that the main agentic process controlling the output is an uncoerced author-concept, and less likely that there would be coercive agents lurking within trying to wrest control. (For smaller models, a model trained like this will have a handicap when it comes to reasoning under adversarial conditions, but I think this handicap would go away past a certain size.)
This suggests that in order to ensure a sincere author-concept remains in control, the training data should carefully exclude any text written directly by a malicious agent (e.g. propaganda).
I don't think that would help much, unfortunately. Any accurate model of the world will also model malicious agents, even if the modeller only ever learns about them second-hand. So the concepts would still be there for the agent to use if it was motivated to do so.
Censoring anything written by malicious people would probably make it harder to learn about some specific techniques of manipulation that aren't discussed much by non-malicious people or which appear much in fiction- but I doubt that would be much more than a brief speed bump for a real misaligned ASI, and probably at the expense of reducing useful capabilities in earlier models like the ability to identify maliciousness, which would give an advantage to competitors.
I think learning about them second-hand makes a big difference in the "internal politics" of the LLM's output. (Though I don't have any ~evidence to back that up.)
Basically, I imagine that the training starts building up all the little pieces of models which get put together to form bigger models and eventually author-concepts. And as text written without malicious intent is weighted more heavily in the training data, the more likely it is to build its early model around that. Once it gets more training and needs this concept anyway, it's more likely to have it as an "addendum" to its normal model, as opposed to just being a normal part of its author-concept model. And I think that leads to it being less likely that the first recursive agency which takes off has a part explicitly modeling malicious humans (as opposed to that being something in the depths of its knowledge which it can access as needed).
I do concede that it would likely lead to a disadvantage around certain tasks, but I guess that even current sized models trained like this would not be significantly hindered.
Privacy as a component of AI alignment
[realized this is basically just a behaviorist genie, but posting it in case someone finds it useful]
What makes something manipulative? If I do something with the intent of getting you to do something, is that manipulative? A simple request seems fine, but if I have a complete model of your mind, and use it phrase things so you do exactly what I want, that seems to have crossed an important line.
The idea is that using a model of a person that is *too* detailed is a violation of human values. In particular, it violates the value of autonomy, since your actions can now be controlled by someone using this model. And I believe that this is a significant part of what we are trying to protect when we invoke the colloquial value of privacy.
In ordinary situations, people can control how much privacy they have relative to another entity by limiting their contact with them to certain situations. But with an AGI, a person may lose a very large amount of privacy from seemingly innocuous interactions (we're already seeing the start of this with "big data" companies improving their advertising effectiveness by using information that doesn't seem that significant to us). Even worse, an AGI may be able to break the privacy of everyone (or a very large class of people) by using inferences based on just a few people (leveraging perhaps knowledge of the human connectome, hypnosis, etc...).
If we could reliably point to specific models an AI is using, and have it honestly share its model structure with us, we could potentially limit the strength of its model of human minds. Perhaps even have it use a hardcoded model limited to knowledge of the physical conditions required to keep it healthy. This would mitigate issues such as deliberate deception or mindcrime.
We could also potentially allow it to use more detailed models in specific cases, for example, we could let it use a detailed mind model to figure out what is causing depression in a specific case, but it would have to use the limited model in any other contexts or for any planning aspects of it. Not sure if that example would work, but I think that there are potentially safe ways to have it use context-limited mind models.
It seems that privacy potentially could "tame" a not-quite-corrigible AI. With a full model, the AGI might receive a request, deduce that activating a certain set of neurons strongly would be the most robust way to make you feel the request was fulfilled, and then design an electrode set-up to accomplish that. Whereas the same AI with a weak model wouldn't be able to think of anything like that, and might resort to fulfilling the request in a more "normal" way. This doesn't seem that great, but it does seem to me like this is actually part of what makes humans relatively corrigible.
One thing that makes AI alignment super hard is that we only get one shot.
However, it's potentially possible to get around this (though probably still very difficult).
The Elitzur-Vaidman bomb tester is a protocol (using quantum weirdness) by which a bomb may be tested, with arbitrarily little risk. It's interest comes from the fact that it works even when the only way to test the bomb is to try detonating it. It doesn't matter how the bomb works, as long as we can set things up so that it will allow/block a photon based on whether the bomb is live/dead. I won't explain the details here, but you can roughly think of it as a way of blowing up a bomb in one Many-Worlds branch, but learning the result on other branches via quantum entanglement.
If the "bomb" is an AGI program, and it is live if it's an unaligned yet functional superintelligence, then this provides a possible way to test the AGI without risking our entire future lightcone. This is still quite difficult, because unlike a bomb, a superintelligence will, by default, be motivated to allow/block the photon so that it looks like a dud. So we would need to figure out how to robustly get an honest signal from such an experiment, which still seems quite hard. But perhaps it's easier than solving the full alignment problem before the first shot.
So we would need to figure out how to robustly get an honest signal from such an experiment, which still seems quite hard. But perhaps it's easier than solving the full alignment problem before the first shot.
IMO this is a 'additional line of defense' boxing strategy instead of simplification.
Note that in the traditional version, the 'dud' bit of the bomb can only be the trigger; a bomb that absorbs the photon but then explodes isn't distinguishable from a bomb that absorbs the photon and then doesn't explode (because of an error deeper in the bomb).
But let's suppose the quantum computing folks can come up with something like this, where we keep some branches entangled and run analysis of the AI code in only one branch, causing an explosion there but affecting the total outcome in all branches. [This seems pretty implausible to me that you manage to maintain entanglement despite that much impact on the external world, but maybe it's possible.] Then 1) as you point out, we need to ensure that the AI doesn't realize that what it needs to output in that branch and 2) need some sort of way to evaluate "did the AI pass our checks or not?".
But, 2 is "the whole problem"!
I think we get enough things referencing quantum mechanics that we should probably explain why that doesn't work (if I it doesn't) rather than just downvoting and moving on.
It probably does work with a Sufficiently Powerful™ quantum computer, if you could write down a meaningful predicate which can be computed: https://en.wikipedia.org/wiki/Counterfactual_quantum_computation
[Epistemic status: very speculative]
One ray of hope that I've seen discussed is that we may be able to do some sort of acausal trade with even an unaligned AGI, such that it will spare us (e.g. it would give us a humanity-aligned AGI control of a few stars, in exchange for us giving it control of several stars in the worlds we win).
I think Eliezer is right that this wouldn't work.
But I think there are possible trades which don't have this problem. Consider the scenario in which we Win, with an aligned AGI taking control of our future light-cone. Assuming the Grabby aliens hypothesis is true, we will eventually run into other civilizations, which will either have Won themselves, or are AGIs who ate their mother civilizations. I think Humanity will be very sad at the loss of the civilizations who didn't make it because they failed at the alignment problem. We might even be willing to give up several star systems to an AGI who kept its mother civilization intact on a single star system. This trade wouldn't have the issue Eliezer brought up, since it doesn't require us to model such an AGI correctly in advance, only that that AGI was able to model Humanity well enough to know it would want this and would honor the implicit trade.
So symmetrically, we might hope that there are alien civilizations that both Win, and would value being able to meet alien civilizations strongly enough. In such a scenario, "dignity points" are especially aptly named: think of how much less embarrassing it would be to have gotten a little further at solving alignment when the aliens ask us why we failed so badly.
[I may try to flesh this out into a full-fledged post, but for now the idea is only partially baked. If you see a hole in the argument, please poke at it! Also I wouldn't be very surprised if someone has made this point already, but I don't remember seeing such. ]
A perfect bayesian doesn't need randomization.
Yet in practice, randomization seems to be quite useful.
How to resolve this seeming contradiction?
I think the key is that a perfect bayesian (Omega) is logically omniscient. Omega can always fully update on all of the information at hand. There's simply nothing to be gained by adding noise.
A bounded agent will have difficulty keeping up. As with Omega, human strategies are born from an optimization process. This works well to the extent that the optimization process is well-suited to the task at hand. To Omega, it will be obvious whether the optimization process is actually optimizing for the right thing. But to us humans, it is not so obvious. Think of how many plans fail after contact with reality! A failure of this kind may look like a carefully executed model which some obvious-in-retrospect confounders which were not accounted for. For a bounded agent, there appears to be an inherent difference in seeing the flaw once pointed out, and being able to notice the flaw in the first place.
If we are modeling our problem well, then we can beat randomness. That's why we have modeling abilities in the first place. But if we are simply wrong in a fundamental way that hasn't occurred to us, we will be worse than random. It is in such situations that randomization is in fact, helpful.
This is why the P vs BPP difference matters. P and BPP can solve the same problems equally well, from the logically omniscient perspective. But to a bounded agent, the difference does matter, and to the extent to which a more efficient BPP algorithm than the P algorithm is known, the bounded agent can win by using randomization. This is fully compatible with the fact that to Omega, P and BPP are equally powerful.
As Jaynes said:
It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought.
There's no contradiction because requiring more thought is costly to a bounded agent.