AI
Frontpage

23

4 min read

23

One way in which I think current AI models are sloppy is that LLMs are trained in a way that messily merges the following "layers":

  • The "dream machine" layer: LLMs are pre-trained on lots of slop from the internet, which creates an excellent "prior".
  • The "truth machine": LLMs are trained to "reduce hallucinations" in a variety of ways, including RLHF and the more recent reasoning RL.
  • The "good machine": The same RLHF and reasoning RL training also aims to train good outputs (eg helpful, honest, harmless). 

I've quoted Andrej Karpathy before, but I'll do it again: 

I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
[...]
I know I'm being super pedantic but the LLM has no "hallucination problem". Hallucination is not a bug, it is LLM's greatest feature. The LLM Assistant has a hallucination problem, and we should fix it.

- Andrej Karpathy

Failing to properly distinguish the "dream machine" capabilities from other (truth-oriented or good-oriented) capabilities hobbles today's LLMs by mixing these things together. If you ask Claude to write fiction, there's a high tendency to mix in the "Claude voice" with the fiction being generated. More generally, the base model (IE, only the generative pre-training) is great at extrapolating text; the subsequent training hobbles this capability, because care is not taken to preserve it.

Habryka mentions this with respect to experiments with LLM-augmented text editing:

Using base models has at least so far been essential for getting any useful writing work out of LLMs, with the instruction-tuned models reliably producing obtuse corpo-speak when asked to engage in writing tasks. 

I expect that mixing truth-orientation with good-orientation has similar problematic consequences. 

A Modest Proposal

Dream Machine Layer

My basic idea here is not new: instead of pre-training on lots and lots of text from the internet in an unstructured way, I think it would be better to take a more structured approach which accounts for metadata via inferred author vectors, date vectors, etc. I don't have a specific proposed architecture, but the result should still be able to do raw (unlabeled) text prediction well, by inferring the author vectors and other latents as it goes. I have in mind a semi-supervised approach; some real metadata labels are provided when authorship is known, but labels are inferred where they are absent.[1] 

This gives us much better "handles" into what the generative hallucination is doing; instead of trying to cleverly prompt it, we can set the latent metadata vectors to whatever we want. We can, for example, interpolate the vectors of several authors to see what that looks like. We can also mix-and-match vectors in a more semantic way, by looking for meaningful dimensions in the vectors (EG doing PCA across authors and trying to interpret the resulting dimensions).

This is a scientifically interesting project as well; in line with microscope AI, we get to learn things about the world. The inferred author vectors and date vectors give us interesting information, and the structure of the vector space also gives us interesting information. This is similar to the recently-announced project to deliberately attempt to model the entire world via AI. We can query location and date vectors which are realistic, but which never existed, to see what the AI model has inferred about that part of the world -- what could have been written at that time and location, if someone had written it down. (This is a weak ancestor simulation; we can try to construct author vectors for historical figures who didn't write anything down.)

Multimodal capabilities could of course dramatically expand this, producing artificial photos or video etc from different times and locations.

Truth Machine Layer

To build a truth-machine layer on top of this, we fine-tune the system in a truth-oriented way. Conceptually, we are looking for an author vector that knows as much as possible; if there turns out to be a "knowledgeable" dimension in the author-vector space, we'd be turning that up to its maximum (or, if there are multiple dimensions for knowledge in various fields, we're maximizing all of them). More realistically, we might need to fine-tune the whole network to support the existence of a maximally knowledgeable author-vector.

This should be done in such a way as to only increase the capabilities of the network; IE, it should still be good at "dreaming" via other author-vectors, even as it gets better at telling the truth via the truth-oriented author-vector. After all, the truth-oriented author-vector is a real author-vector in the real world: it's the author corresponding to this AI we're trying to train (or more specifically, its truth-oriented layer). So, in some sense, this stage of training is just providing evidence about one more real-world author.

This special truth-oriented author-vector should also be capable of directly reproducing the capabilities of the whole network; IE, one of many question-answer tasks it is trained on is "act like author X" for all of the author-vectors in the system. This type of training attempts to import all of the implicit world-knowledge of the rest of the system into the truth-oriented author-vector. You can think of it as a sort of introspective capability; this specific author-vector accurately reflects the whole rest of the system.

The author-vector also allows us to explore multiple different notions of truth, perhaps customized to individual users who have different beliefs about what truth-standards should apply.

My proposal for the detailed workings of the truth-oriented layer would be inspired by logical induction, but one could imagine many different forms of truth-oriented training, closer to or further from the currently-dominant paradigm.

Good Machine Layer

Finally, the Good Machine. This can be thought of as yet another author-vector, which is trained on the full "helpful, honest, harmless" type objective. We leverage the truth layer to reason about what is good. This would be the layer that most users get to talk to; it should avoid doing dangerous things like helping the user create weapons of mass destruction.

Again, this could be tuned to multiple different notions of good, representing different value-systems and belief-systems. There could be overarching principles which apply to all such author-vectors, so that users can tweak the vectors driving the system for them personally to represent their concept of good and truth, without being able to jailbreak the system. (Or, more realistically, without being able to do it very easily... this architecture alone will not completely eradicate  jailbreaking.)

  1. ^

    More specifically, there's a distinction between author vectors (which are entirely inferred) and text labels of attribution (which give author information as a string). There needs to be a learned model which transforms between the two.

AI
Frontpage
New Comment
9 comments, sorted by Click to highlight new comments since:

LLMs compute probability of a sequence, but truth/good distinction is captured by two-dimensional Jeffrey-Bolker measure (I'm calling its components "probability" and "shouldness", their ratio is the expected utility of an event). Shouldness is reconstructed from probability and expected utility as their product, so plausibly it behaves on long sequences similarly to probability, it generally gets lower for longer sequences, but tends to be higher for simpler sequences.

The analogy between probability and shouldness suggests that some form of pretraining might be able to create models for either of them (as opposed to a base model that learns something inbetween from raw data with no supervision from preference data). Then expected utility is the ratio, that is instead of looking at logits of one LLM, we look at differences of logits for two LLMs, a shouldness-LLM and a probability-LLM (with some regularization that anchors to a base model instead of goodharting towards high approximate expected utility low probability sequences). Possibly this needs interspersing preference training with pretraining, rather than only applying preference training during post-training, so that there are two different pretrained models that nurture different collections of circuits (for probability and for shouldness).

(Some kind of Solomonoff induction analogy for probability/shouldness should be a clearer thing to express, might be more relevant in decision theory context, where you start with description lengths of programs in two different languages, a language of probability-programs and another language of shouldness-programs, and then convert these into probability and shouldness distributions over sequences, enabling both probability induction and shouldness induction for the next element of a sequence. Solomonoff induction ignores distinctions between languages in the limit, but this kind of probability/shouldness induction works with pairs of languages and the distinction between two languages in a given pair is the most important thing, as it defines expected utility.)

There's a critical (and interesting) question about how you generate the latent space of authors, and/or how it is inferred from the text. Did you have thoughts on how this would be done?

My idea is very similar to paragraph vectors: the vectors are trained to be useful labels for predicting the tokens.

To differentiate author-vectors from other types of metadata, the author vectors should be additionally trained to predict author labels, with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author. There's also the author-vector-to-text-author-attribution network, which should be pre-trained to have a good "prior" over author-names (so we're not getting a bunch of nonsense strings out). During training, the text author-names are being estimated alongside the vectors (where author labels are not available), so that we can penalize different author-vectors which map to the same name. (Some careful thinking should be done about how to handle people with the actual same name; perhaps some system of longer author IDs?)

Other meta-data would be handled similarly.

This seems reasonable, though efficacy of the learning method seems unclear to me.

But:

with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author

This seems wrong. To pick on myself, my peer reviewed papers, my substack, my lesswrong posts, my 1990s blog posts, and my twitter feed are all substantively different in ways that I think the author vector should capture.

My guess is that we want to capture those differences with the time&date meta-data instead (and to some extent, location and other metadata). That way, we can easily query what you-in-particular would say at other periods in your life (such as the future). However, I agree that this is at least not obvious. 

Maybe a better way to do it would be to explicitly take both approaches, so that there's an abstract-you vector which then gets mapped into a particular-you author space via combination with your age (ie with date&time). This attempts to explicitly capture the way you change over time (we can watch your vector move through the particular-author space), while still allowing us to query what you would say at times where we don't have evidence in the form of writing from you. 

Ideally, imagining the most sophisticated version of the setup, the model would be able to make date&time attributions very fine-grained, guessing when specific words were written & constructing a guessed history of revisions for a document. This complicates things yet further. 

I really like this general diretion of work: suggestions for capabilities that would also help with understanding and controlling network behavior.  That would in turn be helpful for real alignment of network-based AGI. Proposing dual-use capabilities advances seems like a way to get alignment ideas actually implemented. That's what I've done in System 2 Alignment, although that's also a prediction about what developers might try for alignment by default.

Whether the approach you outline here would work is an empirical question, but it sounds likely enough that teams might actually put some effort into it. Preprocessing data to identify authors and similar categories wouldn't be that hard.

This helps with the problem Nate Soares characterized as making cognition aimable at all - having AI pursue one coherent goal, (separately from worrying about whether you can direct that "goal slot" toward something that actually works). I think that's the alignment issue you're addressing (along with slop potentially leading to bad AI-assisted alignment). I briefly describe the LLM agent alignment part of that issue in Seven sources of goals in LLM agents

I hope I'm reading you right about why you think reducing AI slop would help with alignment.

Yeah, this is effectively a follow-up to my recent post on anti-slop interventions, detailing more of what I had in mind. So, the dual-use idea is very much what I had in mind.

Isn't honesty a part of the "truth machine" rather than the "good machine"? Confabulation seems to be a case of the model generating text which it doesn't "believe", in some sense.

Yeah. I'm saying that the "good machine" should be trained on all three; it should be honest, but, constrained by helpfulness and harmlessness. (Or, more realistically, a more complicated constitution with more details.)

Curated and popular this week