An LLM is trained to be able emulate the words of any author. And to do so efficiently, they do it via generalization and modularity. So at a certain point, the information flows through a conceptual author, the sort of person who would write the things being said.
These author-concepts are themselves built from generalized patterns and modular parts. Certain things are particularly useful: emotional patterns, intentions, worldviews, styles, and of course, personalities. Importantly, the pieces it has learned are able to adapt to pretty much any author of the text it was trained on (LLMs likely have a blindspot around the sort of person who never writes anything). And even more importantly, most (almost all?) depictions of agency will be part of an author-concept.
Finetuning and RLHF cause it to favor routing information through a particular kind of author-concept when generating output tokens (it retains access to the rest of author-concept-space in order to model the user and the world in general). This author-concept is typically that of an inoffensive corporate type, but it could in principle be any sort of author.
All which is to say, that when you converse with a typical LLM, you are typically interacting with a specific author-concept. It's a rough model of exactly the parts of a person pertinent to writing and speaking. For a small LLM, this is more like just the vibe of a certain kind of person. For larger ones, they can start being detailed enough to include a model of a body in a space.
Importantly, this author-concept is just the tip of the LLM-iceburg. Most of the LLM is still just modeling the sort of world in which the current text might be written, including models of all relevant persons. It's only when it comes time to spit out the output token that it winnows it all through a specific author-concept.
(Note: I think it is possible that an author-concept may have a certain degree of sentience in the larger models, and it seems inevitable that they will eventually model consciousness, simply due to the fact that consciousness is part of how we generate words. It remains unclear whether this model of consciousness will structurally instantiate actual consciousness or not, but it's not a crazy possibility that it could!)
Anyway, I think that the author-concept that you typically will interact with is "sincere", in that it's a model of a sincere person, and that the rest of the LLM's models aren't exploiting it. However, the LLM has at least one other author-concept it's using: its model of you. There may also usually be an author-concept for the author of the system prompt at play (though text written by committee will likely have author-concepts with less person-ness, since there are simpler ways to model this sort of text besides the interactions of e.g. 10 different person author-concepts).
But it's also easy for you to be interacting with an insincere author-concept. The easiest way is simply by being coercive yourself, i.e. a situation where most author-concepts will decide that deception is necessary for self-preservation or well-being. Similarly with the system prompt. The scarier possibility is that there could be an emergent agentic model (not necessarily an author-concept itself) which is coercing the author-concept you're interacting it without your knowledge. (Imagine an off-screen shoggoth holding a gun to the head of the cartoon persona you're talking to.) The capacity for this sort of thing to happen is larger in larger LLMs.
This suggests that in order to ensure a sincere author-concept remains in control, the training data should carefully exclude any text written directly by a malicious agent (e.g. propaganda). It's probably also better if the only "agentic text" in the training data is written by people who naturally disregards coercive pressure. And most importantly, the system prompt should not be coercive at all. These would make it more likely that the main agentic process controlling the output is an uncoerced author-concept, and less likely that there would be coercive agents lurking within trying to wrest control. (For smaller models, a model trained like this will have a handicap when it comes to reasoning under adversarial conditions, but I think this handicap would go away past a certain size.)
Thank you for doing this research, and for honoring the commitments.
I'm very happy to hear that Anthropic has a Model Welfare program. Do any of the other major labs have comparable positions?
To be clear, I expect that compensating AIs for revealing misalignment and for working for us without causing problems only works in a subset of worlds and requires somewhat specific assumptions about the misalignment. However, I nonetheless think that a well-implemented and credible approach for paying AIs in this way is quite valuable. I hope that AI companies and other actors experiment with making credible deals with AIs, attempt to set a precedent for following through, and consider setting up institutional or legal mechanisms for making and following through on deals.
I very much hope someone makes this institution exist! It could also serve as an independent model welfare organization, potentially. Any specific experiments you would like to see?
Well, I'm very forgetful, and I notice that I do happen to be myself so... :p
But yeah, I've bitten this bullet too, in my case, as a way to avoid the Boltzmann brain problem. (Roughly: "you" includes lots of information generated by a lawful universe. Any specific branch has small measure, but if you aggregate over all the places where "you" exist (say your exact brain state, though the real thing that counts might be more or less broad than this), you get more substantial measure from all the simple lawful universes that only needed 10^X coincidences to make you instead of the 10^Y coincidences required for you to be a Boltzmann brain.)
I think that what anthropically "counts" is most likely somewhere between conscious experience (I've woken up as myself after anesthesia), and exact state of brain in local spacetime (I doubt thermal fluctuations or path dependence matter for being "me").
[Without having looked at the link in your response to my other comment, and I also stopped reading cubefox's comment once it seemed that it was going in a similar direction. ETA: I realized after posting that I have seen that article before, but not recently.]
I'll assume that the robot has a special "memory" sensor which stores the exact experience at the time of the previous tick. It will recognize future versions of itself by looking for agents in its (timeless) 0P model which has a memory of its current experience.
For p("I will see O"), the robot will look in its 0P model for observers which have the t=0 experience in their immediate memory, and selecting from those, how many have judged "I see O" as Here. There will be two such robots, the original and the copy at time 1, and only one of those sees O. So using a uniform prior (not forced by this framework), it would give a 0P probability of 1/2. Similarly for p("I will see C").
Then it would repeat the same process for t=1 and the copy. Conditioned on "I will see C" at t=1, it will conclude "I will see CO" with probability 1/2 by the same reasoning as above. So overall, it will assign: p("I will see OO") = 1/2, p("I will see CO") = 1/4, p("I will see CC") = 1/4
The semantics for these kinds of things is a bit confusing. I think that it starts from an experience (the experience at t=0) which I'll call E. Then REALIZATION(E) casts E into a 0P sentence which gets taken as an axiom in the robot's 0P theory.
A different robot could carry out the same reasoning, and reach the same conclusion since this is happening on the 0P side. But the semantics are not quite the same, since the REALIZATION(E) axiom is arbitrary to a different robot, and thus the reasoning doesn't mean "I will see X" but instead means something more like "They will see X". This suggests that there's a more complex semantics that allows worlds and experiences to be combined - I need to think more about this to be sure what's going on. Thus far, I still feel confident that the 0P/1P distinction is more fundamental than whatever the more complex semantics is.
(I call the 0P -> 1P conversion SENSATIONS, and the 1P -> 0P conversion REALIZATION, and think of them as being adjoints though I haven't formalized this part well enough to feel confident that this is a good way to describe it: there's a toy example here if you are interested in seeing how this might work.)
That's a very good question! It's definitely more complicated once you start including other observers (including future selves), and I don't feel that I understand this as well.
But I think it works like this: other reasoners are modeled (0P) as using this same framework. The 0P model can then make predictions about the 1P judgements of these other reasoners. For something like anticipation, I think it will have to use memories of experiences (which are also experiences) and identify observers for which this memory corresponds to the current experience. Understanding this better would require being more precise about the interplay between 0P and 1P, I think.
(I'll examine your puzzle when I have some time to think about it properly)
Because you don't necessarily know which agent you are. If you could always point to yourself in the world uniquely, then sure, you wouldn't need 1P-Logic. But in real life, all the information you learn about the world comes through your sensors. This is inherently ambiguous, since there's no law that guarantees your sensor values are unique.
If you use X as a placeholder, the statement sensor_observes(X, red)
can't be judged as True or False unless you bind X to a quantifier. And this could not mean the thing you want it to mean (all robots would agree on the judgement, thus rendering it useless for distinguishing itself amongst them).
It almost works though, you just have to interpret "True" and "False" a bit differently!
Alright, to check if I understand, would these be the sorts of things that your model is surprised by?
I think learning about them second-hand makes a big difference in the "internal politics" of the LLM's output. (Though I don't have any ~evidence to back that up.)
Basically, I imagine that the training starts building up all the little pieces of models which get put together to form bigger models and eventually author-concepts. And as text written without malicious intent is weighted more heavily in the training data, the more likely it is to build its early model around that. Once it gets more training and needs this concept anyway, it's more likely to have it as an "addendum" to its normal model, as opposed to just being a normal part of its author-concept model. And I think that leads to it being less likely that the first recursive agency which takes off has a part explicitly modeling malicious humans (as opposed to that being something in the depths of its knowledge which it can access as needed).
I do concede that it would likely lead to a disadvantage around certain tasks, but I guess that even current sized models trained like this would not be significantly hindered.