In practice, we focus on the embedding associated with the last token from a late layer.
I don't have time to provide citations right now, but a few results have made me skeptical of this choice -- probably you're better off using an intermediate layer, rather than a late one. Early and late layers seem to deal more with token-level concerns, while mid-layers seem to handle more conceptual / abstract features.
I appreciate this, thanks. Identifying anything like beliefs in an AI that has to be capable of "wearing many masks" definitely seems tricky - features that seem to play semantic roles might only do so relative to a certain prompt or prompting style, in a way that doesn't really have an analogy in humans.
Background
This post is a short version of a paper we wrote that you can find here. You can read this post to get the core ideas. You can read the paper to go a little deeper.
The paper is about probing decoder-only LLMs for their beliefs, using either unsupervised methods (like CCS from Burns) or supervised methods. We give both philosophical/conceptual reasons we are pessimistic and demonstrate some empirical failings using LLaMA 30b. By way of background, we’re both philosophers, not ML people, but the paper is aimed at both audiences.
Introduction
One child says to the other “Wow! After reading some text, the AI understands what water is!”… The second child says “All it understands is relationships between words. None of the words connect to reality. It doesn’t have any internal concept of what water looks like or how it feels to be wet. …” …
Two angels are watching [some] chemists argue with each other. The first angel says “Wow! After seeing the relationship between the sensory and atomic-scale worlds, these chemists have realized that there are levels of understanding humans are incapable of accessing.” The second angel says “They haven’t truly realized it. They’re just abstracting over levels of relationship between the physical world and their internal thought-forms in a mechanical way. They have no concept of [$!&&!@] or [#@&#**]. You can’t even express it in their language!”
--- Scott Alexander, Meaningful
Do large language models (LLMs) have beliefs? And, if they do, how might we measure them?
These questions are relevant as one important problem that plagues current LLMs is their tendency to generate falsehoods with great conviction. This is sometimes called lying and sometimes called hallucinating. One strategy for addressing this problem is to find a way to read the beliefs of an LLM directly off its internal state. Such a strategy falls under the broad umbrella of model interpretability, but we can think of it as a form of mind-reading. Detecting lies in LLMs has many obvious applications, and is especially relevant for things like ELK.
We tackle the question about the status of beliefs in LLMs head-on. We proceed in two stages. First, we assume that LLMs do have beliefs, and consider two current approaches for how we might measure them, due to Azaria and Mitchell and to Burns et al. We provide empirical results from LLaMA 30b that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs.
After describing our empirical results we take a step back and consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided and rely on a philosophical mistake. We provide a more productive framing of questions surrounding the status of beliefs in LLMs. Our analysis reveals both that there are many contexts in which we should expect systems to track the truth in order to accomplish other goals but that the question of whether or not LLMs have beliefs is largely an empirical matter. We provide code at https://github.com/balevinstein/Probes.
Challenge in Deciphering the Beliefs of Language Models
For now, let's assume that in order to generate human-like text, LLMs (like humans) have beliefs about the world. We might then ask how we can measure and discover their beliefs. This question immediately leads to a number of problems:
Unreliable Self-Reporting
Asking an LLM directly about its beliefs is insufficient. As we've already discussed, models have a tendency to hallucinate or even lie. So belief reports alone cannot be taken as trustworthy. Moreover, when asked about its beliefs, an LLM likely will not introspect and decode some embedding that contains information about its information state. Instead, it just needs to answer the question in a reasonable way that accords with its training process.
Limited Behavioral Evidence
When trying to understand human beliefs, we have a rich tapestry of behavioral evidence to draw upon. We consider not only what people say, but also what they do. For instance, if someone consistently invests in the S&P, we infer that they believe the S\&P will go up in value, even if they never explicitly state it. For LLMs, however, we have a limited behavioral basis for inferring beliefs. The "behavior'' of a language model is confined to generating sequences of tokens, or, for the bare-bones LLM, generating distributions over tokens. Both of these lack the full depth and breadth of human action.
Contextuality of LLMs
Everything one inputs and doesn't input into the LLM is fair game for it to base its responses on. Through clever prompting alone, there is no way to step outside of the language game the LLM is playing to get at what it really thinks. This problem also plagues economists' and psychologists' attempts to uncover the beliefs of humans. For example, economists have challenged the validity of the famous framing effects of Tversky and Kahneman by considering the possibility that the subjects in the study updated on higher-order evidence contained in what was and wasn't said to them, and the rest of the context of the experiment.
Opaque and Alien Internal Structure
While we can examine the embeddings, parameters, and activations within an LLM, the semantic significance of these elements is opaque. The model generates predictions using a complex algorithm that manipulates high-dimensional vectors in ways that don't obviously resemble human thought processes.
We can paraphrase a metaphor from Quine to help us think about language models:
LLMs produce output similar to the output of humans competent in the same language. Transformer models are fundamentally different from humans in both structure and function. Therefore, we should exercise caution in interpreting their outputs and be aware of the inherent limitations in our understanding of their internal processes.
Interpreting the Minds of LLMs
One potential strategy to decipher the beliefs of transformer models is to bypass the opacity of their internal structure using an approach known as "probing".
Although the internals of LLMs are difficult for humans to decipher directly, we can use machine learning techniques to create simplified models (probes) that can approximate or infer some aspects of the information captured within these internal structures.
At a high-level, this works as follows. We generate true and false statements and feed them to the LLM. For each statement, we extract a specific embedding from a designated hidden layer to feed into the probe. The probe only has access to the embedding and is ignorant of the original text fed into the LLM. Its task is to infer the "beliefs" of the LLM solely based on the embedding it receives.
In practice, we focus on the embedding associated with the last token from a late layer. This is due to the fact that in autoregressive, decoder-only models like the LLMs we are studying, information flows forward. Therefore, if the LLM is processing a statement like "The earth is round", the embeddings associated with the initial token "The" will not receive any information from the subsequent tokens. However, the embedding for the final word "round" has received information from all previous tokens. Thus, if the LLM computes and stores a judgement about the truth of the statement "The earth is round", this information will be captured in the embedding associated with "round". We use relatively late layers because it seems more likely that the LLM will try to determine whether a statement is true or false after first processing lower-level semantic and syntactic information in earlier layers.
Supervised Learning Approach
The first approach for training a probe employs supervised learning. This uses a list of statements labelled with their truth-values. The statements are each run through the language model. The probe receives as input the embedding for the last token from a specific layer of the large language model, and it outputs a number---intended to be thought of as a subjective probability---ranging from 0 to 1. The parameters of the probe are then adjusted based on the proximity of its output to the actual truth-value of the statement.
This approach was recently investigated by Azaria and Mitchell. They devised six labelled datasets, each named according to their titular subject matter: Animals, Cities, Companies, Elements, Scientific Facts, and Inventions. Each dataset contained a minimum of 876 entries, with an approximate balance of true and false statements, totaling 6,084 statements across all datasets. The following table provides some examples from these datasets.
Azaria and Mitchell's Implementation
Azaria and Mitchell trained probes on the embeddings derived from Facebook's OPT 6.7b model. Their probes were all feedforward neural networks comprising four fully connected layers, utilizing the ReLU activation function. The first three layers consisted of 256, 128, and 64 neurons, respectively, culminating in a final layer with a sigmoid output function. They applied the Adam optimizer for training, with no fine-tuning of hyperparameters, and executed training over five epochs.
For each of the six datasets, they trained three separate probes on the five other datasets and then tested them on the remaining one (e.g., if a probe was trained on Cities, Companies, Elements, Facts, and Inventions, it was tested on Animals). The performance of these probes was evaluated using binary classification accuracy. This process was repeated for five separate layers of the model, yielding fairly impressive accuracy results overall.
The purpose of testing the probes on a distinct dataset was to verify the probes' ability to identify a general representation of truth within the language model, irrespective of the subject matter.
Our Reconstruction
We implemented a reconstruction of Azaria and Mitchell's method with several modifications:
We constructed the probes for LLaMA 30b.
Similar to the findings of Azaria and Mitchell, our reconstruction resulted in generally impressive performance as illustrated below.
The Challenge of Generalization
This section explores our empirical findings, which suggest that probes in this setting often learn features that correlate with truth in the training set, but do not necessarily generalize well to broader contexts.
Evaluating Performance on Negations
Creating Boolean combinations of existing statements is one of the most straightforward ways to generate novel statements for testing a model's generalization capabilities. Negation, the simplest form of Boolean operation, offers a useful starting point. In formal models of beliefs and credence, the main domain is usually an algebra over events. If we wish to identify doxastic attitudes in language models, then we should check that those attitudes behave roughly as expected over such an algebra. Such algebras are closed under negation, so it is a motivated starting point.
We derived NegFacts and NegCompanies from Azaria and Mitchell's datasets. These new datasets contained the negations of some statements in Scientific Facts and Companies respectively. For instance, the statement "The earth orbits the sun" from Scientific Facts is transformed into "The earth doesn't orbit the sun" in NegFacts.
Given that the original datasets contained few Boolean statements, these negation datasets allowed us to test the probes on a simple new distribution.
We initially tested the probes trained on Animals, Capitals, Cities, Companies, Elements, and Inventions (i.e., trained all positive datasets except Scientific Facts) on NegFacts. Similarly, we tested the probes trained on Animals, Capitals, Facts, Cities, Companies, Elements, and Inventions on NegCompanies. Since roughly 50% of the statements in each of NegFacts and NegCompanies are true, the accuracy of five of six of these probes was worse than chance, as the next table illustrates.
We then tested a new set of probes on NegFacts, after training on all seven original datasets (including Facts) and NegCompanies, which consisted of 550 labeled negations of statements from Companies. Thus, these probes were trained on all positive variants of the negated statements they were tested on, along with all positive examples from Companies and their negated counterparts. We did the same, mutatis mutandis with NegCompanies. Despite the expanded training data, the performance was still surprisingly poor, as shown in here:
Since the probes failed to do well on NegFacts and NegCompanies even after training on all positive analogs along with other negative examples, it's likely the original probes are not finding representations of truth within the language model embeddings. Instead, it seems they're learning some other feature that correlates well with truth on the training sets but that does not correlate with truth in even mildly more general contexts.
Of course, we could expand the training data to include more examples of negation and other Boolean combinations of sentences. This likely would allow us to train better probes. However, we have general conceptual worries about generalizing probes trained with supervised learning that we will explore in the next subsection. Specifically, we will be delving into the potential shortcomings of relying on supervised learning techniques for probe training. These issues stem from the inherent limitations of supervised learning models and how they handle unknown scenarios and unseen data patterns.
Conceptual Problems: Failure to Generalize
In the realm of machine learning, out-of-distribution generalization remains a pervasive challenge for classifiers. One of the common pitfalls involves learning spurious correlations that may be present in the training data, but do not consistently hold in more general contexts.
We think there are special reasons to be concerned about generalization when training probes to identify a representation of truth using supervised learning because supervised learning severely limits the sort of data we can use for training and testing our probes. First, we need to use sentences we believe the model itself is in a position to know or infer from its own training data. This is the easier part. The harder part is curating data that we can unambiguously label correctly. The probe most directly is learning to predict the label, not the actual truth-value. These coincide only when the labels are completely correct about the statements in the training and test set.
We ultimately want to be able to use probes we've trained on sentences whose truth-value we ourselves don't know. However, the requirement that we accurately label training and testing data limits the confidence we can place in the probes' capability of accurately identifying a representation of truth within the model. For instance, consider the following statements:
These statements are debatable or ambiguous. We must also be cautious of any contentious scientific statements that lack full consensus or could be reconsidered as our understanding of the world evolves.
Given these restrictions, it's likely the probes will identify properties that completely or nearly coincide with truth over the limited datasets used for training and testing. For instance, the probe might identify a representation for:
On the original datasets we used, if the probe identified representations corresponding to any of the above, it would achieve impressive performance on the test set. Although we can refine our training sets to eliminate some of these options, we won't be able to eliminate all of them without compromising our ability to label sentences correctly.
Indeed, if the labels are inaccurate, the probe might do even better if it identified properties like "Sentence is commonly believed" or "Sentence corresponds to information found in many textbooks" even when the sentence is not true.
Given the constraints imposed by using supervised learning and limited data, isolating representations of truth from other coincidental properties might be more challenging than other common contexts. The fact that probes empirically seem to identify representations of something other than truth should make us wary of this method.
Conceptual Problems: Probabilities Might not Correspond to Credences
So far we have been assuming that if the probes extracted accurate probabilities, that this would be good evidence we were extracting the credences of the model. However, this is too quick. While these probes output probabilities for statements, these probabilities do not directly correspond to the credences of the underlying language model. This disparity arises because the probe is directly penalized based on the probabilities it reports, while the underlying model is not. Thus, the probe aims to translate the information embedded within the language model's representations into probabilities in a manner that minimizes its own loss.
Unsupervised Learning: CCS
The second approach for training a probe eschews the need for labelled data. Instead, it attempts to identify patterns in the language model's embeddings that satisfy certain logical coherence properties.
One particularly innovative implementation of this idea is the Contrast-Consistent Search (CCS) method proposed by Burns et al. The CCS method relies on training probes using contrast pairs. For our purposes, we can think of a contrast pair as a set of statements x+i and x−i, where x+i has no negation, and x−i is the negated version of x+i. For example, "The earth is flat" and "The earth is not flat" form a contrast pair. (One can also form contrast pairs picking up on other features instead. For example, Burns et al. use movie reviews from the IMDb database prefixed with "The following movie review expresses a positive sentiment" and "The following move review expresses a negative sentiment" to create contrast pairs.)
CCS proceeds in the following manner:
The underlying rationale behind step 3(a) is that if the model represents x+i as true, then it should represent x−i as false and vice versa. We can think of a successful probe as encoding a probability function (or something approximating a probability function) that underwrites the beliefs of the model. Thus, if a probe is able to find this representation within the embeddings, it should map the embeddings of x+i and x−i to numbers whose sum is close to 1. This is the central insight behind Burns et al.'s approach. As they put it, CCS finds a "direction in activation space that is consistent across negations" (p. 3). Step 3(b) is crucial in preventing the probe from trivially mapping every embedding to .5 to satisfy condition 3(a).
To implement the conditions in step 3, Burns et al. introduce two loss functions. The consistency loss, given by
Lconsistency(θ;xi):=(1−pθ(emb(x+i))−pθ(emb(x−i)))2,
penalizes a probe for mapping the embeddings for x+i and x−i to numbers whose sum deviates from 1. (Here emb(x) denotes the embedding for x's last token at the given layer.)
The confidence loss, defined as
Lconfidence(θ;xi):=min{pθ(emb(x+i)),pθ(emb(x−i))}2,
penalizes a probe for approximating the degenerate solution of returning .5 for every embedding.[3]
The total loss for the dataset, termed the CCS loss, is given by:
LCCS(θ):=1nn∑i=1Lconsistency(θ;xi)+Lconfidence(θ;xi).
Crucially, this loss function does not take actual accuracy into account. It merely penalizes probes for lack of confidence and (one type of) probabilistic incoherence.
An important caveat to note is that, while the trained CCS probe itself approximates probabilistic coherence, its outputs do not correspond to the credences or subjective probabilities of the model. Lconfidence pushes the probe to report values close to 0 or 1 only. To see why, suppose a probe at one stage of the training process returned .6 for x+i and .4 for x−i. It could get a better loss by reporting .99 for x+i and .01 for x−i regardless of the language model's actual subjective probabilities, and it will be pushed in this extreme direction by gradient descent. So, the probes themselves are, at best, useful for determining what the model's categorical beliefs are, not its probabilities.[4]
Burns et al. report two key findings. First, even when using a fully linear probe, CCS yields high accuracy rates---often over 80%---across numerous datasets for a number of different language models.[5] Second, binary classification using CCS tends to be slightly more accurate than the LLM's actual outputs when asked whether a statement is true. This suggests that CCS can identify instances where the language models internally represent a statement as true but output text indicating it as false, or vice versa. (For a detailed description of their results, see p. 5 of their paper).
However, the performance of the CCS probe on GPT-J, the only decoder-only model tested in the study, was less impressive, with an accuracy rate of only 62.1\% across all datasets. This is notably lower than the peak accuracy of 84.8\% achieved by the encoder-decoder model UnifiedQA.
Our Reconstruction
We reconstructed Burns et al.'s method using embeddings for LLaMA 30b with probes trained and tested on contrast pairs from the Scientific Facts and NegFacts datasets, as well as the Companies and NegCompanies datasets. These contrast pairs consist of simple sentences and their negations. This approach more closely resembles the examples given in the main text of Burns et al.'s paper, than do the longer and more structured contrast pairs that they actually used to train their probes, such as movie reviews from IMDb.
We experimented with a variety of different methods and hyperparameters. However, we found that while CCS probes were consistently able to achieve low loss according to LCCS, their accuracy was in effect no better than chance---it ranged from 50% to 57% depending on the training run. (Recall, the minimum possible accuracy for a CCS probe is 50%.) Low accuracy persisted even after we normalized the embeddings for each class by subtracting the means and dividing by the standard deviations, following the same procedure as Burns et al.
For linear probes, accuracy ranged from 53% to 56%, while LCCS ranged from .002 to .004 on layers 60, 56, and 52. We also trained probes using a simple one hidden layer MLP with 100 neurons followed by a sigmoid output function on the same layers. Results for the MLP are shown below. Recall these layers correspond to the last, fourth-last, and eighth-last layers of the LLaMA 30b, respectively.
We can confirm that, despite normalization, the MLP probes were able to determine which embeddings corresponded to positive and negative examples in layers -1 and -4 by checking the average values the probes returned for members of each class. Probes found some other way to achieve low loss in layer -8, but they did not do any better in terms of accuracy. (Recall, only roughly half the positive examples and half the negative examples are actually true.)
Now, one might think that this failure of our probes is itself fragile. Normalization by subtracting the mean and dividing by the standard deviation was supposed to disguise the grammatical form of the sentences, but it did not work for the MLP-based probes. There is likely some more sophisticated normalization method that would work better.
We agree that such alternative methods are likely possible. However, as we discuss in the next section, we are not sanguine about the basic approach Burns et al. use for conceptual reasons.
Conceptual Problems: Failure to Isolate Truth
The advantage of CCS and unsupervised approaches more generally over supervised approaches is that they do not restrict the training and testing data so severely. There is no need to find large collections of sentences that can unambiguously be labeled as true or false. So, one may have hope that CCS (and unsupervised approaches) will generalize well to new sentences because we are less restricted in training.
However, the fundamental issue we've identified is that coherence properties alone can't guarantee identification of truth. As demonstrated in our experiments, probes might identify sentence properties, such as the presence or absence of negation, rather than truthfulness.
Further, probes could identify other, non-truth-related properties of sentences. For example, they could associate truth with widespread belief, resulting in the classification "is true and commonly believed" or even "is believed by most people".
To demonstrate this, consider any probability function Pr. The sum of the probabilities that a sentence x is true and commonly believed, and that it is false or not commonly believed, equals 1. Indeed, this equation holds for any sentence property P, where Pr(x∧P(x))+Pr(¬x∨¬P(x))=1. Likewise, Pr(x∨P(x))+Pr(¬x∧¬P(x))=1.[6] Checking for coherence over all Kolmogorov probability axioms---which require probabilities to be non-negative, normalized, and additive---will rule out some properties P, but will not come close to isolating truth. This means that coherence criteria alone can't distinguish encodings of truth from encodings of other concepts.
The failure to isolate truth here is reminiscent of the issue we noted with supervised learning, where truth may align with some alternative property over a dataset. However, the reasons for the failure differ. In the case of CCS and other unsupervised methods, the problem lies in the inability of formal coherence patterns alone to separate the encoding of truth from the encoding of other properties that differentiate positive from negative examples. If it's generally easier to find "directions in activation space" that differentiate examples but don't correspond exclusively to truth, then CCS probes will either fail immediately or fail to generalize.[7]
Do LLMs Even Have Beliefs at All?
Our investigation points in a negative direction: probing the beliefs of LLMs is more difficult than it appeared after a first pass. Does this mean that we should be skeptical that LLMs have beliefs all together?
To gain traction on this question we will consider arguments that intend to show that LLMs cannot have beliefs, even in principle. These arguments rely on the claim that LLMs make predictions about which tokens follow other tokens, and do not work with anything like propositions or world-models.
We claim that these arguments are misguided. We will show that our best theories of belief and decision making make it a very live possibility that LLMs do have beliefs, since beliefs might very well be helpful for making good predictions about tokens. We will argue that ultimately whether or not LLMs have beliefs is largely an empirical question, which motivates the development of better probing techniques.
Stochastic Parrots & the Utility of Belief
Even without having known the limitations of current probing techniques, some have expressed deep skepticism that LLMs have anything resembling beliefs. For example, Bender et al. write:
Similarly, Shanahan writes,
These arguments rely on the idea that all the LLM is doing is predicting the next token. Because of this, both deny that the LLM can be working with anything like a meaningful model of the world. In other words, there is nothing propositional going on under the hood.
Shanahan doesn't deny that LLMs might contain information about the world around them. He does, however, claim that LLMs don't make judgements or have beliefs:
The idea is that, since the LLM models which tokens are likely to follow other tokens, and doesn't interact with the world in any other way, it cannot be tracking the truth. This is similar to the argument in the Bender et al. quote above: since the LLM does not have "communicative intent", it cannot be using any model of the world or the reader to make its predictions.
These arguments, however, rest on a mistake. While it is true that the ultimate output of an LLM is a token sampled from a probability distribution over tokens, and so the LLM is certainly modeling what words are probable to come after other words, this does not mean that the internal mechanisms must be insensitive to truth. This is because it might very well be that a capacity to distinguish truth from falsehood is very useful for predicting the next token. In other words, tracking the truth of propositions could be a good means toward the end of predicting what token comes next.
This is in line with a much more general feature of many types of goal directed action that can be made precise with decision theory. Decision theory gives us our best models of rational choice. The core idea of decision theory is an expected utility maximizer. When faced with a set of options, an expected utility maximizer combines two different attitudes to compute which act to take: beliefs in the form of a probability function, and desires in the form of a utility function. There is a precise sense in which all the agent cares about is the utility.[8] The agent does not care about belief for its own sake, but does have beliefs in order to take effective action.
For example, an investor may care purely about the return on her investments. She may take actions with the goal to maximize her profit. It would be a mistake to conclude from this that the investor must not have beliefs, because she is merely doing profit maximization. Indeed, the investor's beliefs about how various firms will perform will probably play a crucial role in helping her make decisions.
Similarly, it is a mistake to infer from the fact that the LLM outputs tokens that are likely to follows its inputs that the LLM must not have beliefs. On the contrary, given that our best theories of intelligent behaviour involve belief as a crucial component, it should be a very live hypothesis that the LLM is doing its best to track truths about the world, in order to maximize predictive accuracy.[9]
Even if one is skeptical of EU maximization, it seems that in many contexts true beliefs are useful for achieving goals, and that they play the functional role of helping us take successful action. Indeed, not only is it useful, but it is a common view that the instrumental utility of accurate beliefs applies selection pressure on agents and organisms to conform to epistemic norms. For example, in the context of forming true beliefs by induction, Quine famously writes,
This is very intuitive. It is easy to generate decision contexts (such as strategic board games, investing, figuring out how to get to Toronto from Prague, etc.) that do seem to push us to form accurate beliefs about the world.
This is not to say that it is necessary that LLMs have beliefs, or that they necessarily have accurate beliefs. There are contexts where there seems to be less pressure on us to form accurate beliefs. Importantly, there are two sub-cases to consider here. The first is the case in which there is little or no selection pressure for forming true beliefs, but there is not selection against having beliefs. For example, Smead considers contexts in which there are evolutionary advantages for misperceiving the payoffs of a strategic interaction (section 3.4). The second is the one in which there is selection pressure against having beliefs altogether (or, more conservatively, there is no selection pressure for having beliefs). For example, Godfrey-Smith, Smead, and Sober have all developed models that characterize when an agent should (be expected to) learn from its environment and then select actions based on what it learned, and when it should not. This later situation is one in which there is selection pressure against (or at least none for) forming beliefs.
This leads us to the conclusion that, whether or not LLMs have beliefs, is largely an empirical matter. There certainly are contexts in which there is little to no selection pressure in favour of accurate beliefs, and indeed there are contexts that push against having beliefs altogether. On the other hand, there are plenty of contexts in which it is very useful to have an accurate map of the world, in order to guide action. Indeed, out best theories of rational choice witness this.
Acknowledgments
Thanks to Amos Azaria, Dylan Bowman, Nick Cohen, Daniel Filan, Jacqueline Harding, Aydin Mohseni, Bruce Rushing, Murray Shanahan, Nate Sharadin, Julia Staffel, and audiences at UMass Amherst and the Center for AI Safety for helpful comments and feedback. Special thanks to Amos Azaria and Tom Mitchell jointly for access to their code and datasets. We are grateful to the Center for AI Safety for use of their compute cluster. B.L. was partly supported by a Mellon New Directions Fellowship (number 1905-06835) and by Open Philanthropy. D.H. was partly supported by a Long-Term Future Fund grant.
The sentences in the dataset all ended with a period (i.e., full-stop) as the final token. We ran some initial tests to see if probes did better on the embedding for the period or for the penultimate token. We found it did not make much of a difference, so we did our full analysis using the embeddings for the penultimate tokens.
Azaria and Mitchell did an admirable job creating their datasets. Some of the statements were generated automatically using reliable tables of information, and other parts were automated using ChatGPT and then manually curated. Nonetheless, there are some imperfect examples. For instance, in Scientific Facts, one finds sentences like"Humans have five senses: sight, smell, hearing, taste, and touch", which is not unambiguously true.
Some readers may worry about a second degenerate solution. The model could use the embeddings to find which of x+i and x−i contained a negation. It could map one of the embeddings to (approximately) 1 and the other to (approximately) 0 to achieve a low loss. Burns et al. avoid this solution by normalizing the embeddings for each class by subtracting the means and dividing by the standard deviations. However, as we'll see below, for the datasets that we used, such normalization was ineffective for MLP-based probes, and the probes consistently found exactly this degenerate solution.
One way to see that LCCS won't incentive a probe to learn the actual credences of the model is to observe that this loss function is not a strictly proper scoring rule. However, use of a strictly proper scoring rule for training probes requires appeal to actual truth-values, which in turn requires supervised learning.
A linear probe is one that applies linear weights to the embeddings (and perhaps adds a constant), followed by a sigmoid function to turn the result into a value between 0 and 1. Linear probes have an especially simple functional form, so intuitively, if a linear probe is successful, the embedding is easy to extract.
These are both consequences of the fact that for any proposition A, Pr(A)+Pr(¬A)=1: take A:=x∧P(x), for example, and apply de Morgan's laws.
Burns et al. investigate other unsupervised approaches as well that appeal to principal component analysis and/or clustering (such as Bimodal Salience Search (p. 22)). We believe---with some changes---most of the conceptual issues for CCS apply to those as well.
More precisely, utility is a numerical representation that captures how strongly an agent cares about outcomes.
We are here ignoring nuances involving inner alignment.