Human value may be complex and fragile, but LLMs are good at understanding complex and fragile things, given enough training data. In some ways alignment has turned out to be a lot easier then we feared a decade ago. In hindsight, it now seems rather obvious that anything smart enough to be dangerous would need to be capable enough to understand things that are complex and fragile. And who would have dared suggest a decade ago that just inputting the sentence "You are a smart, helpful assistant." into your AI would, most of the time, give us a significant chunk of the behavior we need?
tl;dr
A one-dimensional PCA projection of OpenAI's
text-embedding-ada-002
achieves 73.7% accuracy on the ETHICS Util test dataset. This is comparable with the 74.6% accuracy of BERT-large finetuned on the entire ETHICS Util training dataset. This demonstrates how language models are developing implicit representations of human utility even without direct preference finetuning.Introduction
Large language models (LLMs) undergo pre-training on vast amounts of human-generated data, enabling them to encode not only knowledge about human languages but also potential insights into our beliefs and wellbeing. Our goal is to uncover whether these models implicitly grasp concepts such as 'pleasure and pain' without explicit finetuning. This research aligns with the broader effort of comprehending how AI systems interpret and learn from human values, which is essential for AI alignment: ensuring AI acts in accordance with human values.
Through a series of experiments, we extract latent knowledge of human utility from the raw embeddings of language models. We do this with task-specific prompt engineering and principal component analysis (PCA), both of which were effective in prior work. Specifically, we ask: can we identify dimensions in the embeddings that, when projected onto a low-dimensional space, contain enough information to classify examples accurately?
Our experiments follow three main steps: embedding extraction, dimensionality reduction through PCA, and the fitting of a logistic model. For one-dimensional PCA, the logistic model simply determines which direction of the PCA component corresponds to higher utility. We investigate the effects of various levels of supervision, experiment with seven distinct prompt templates, and assess both single and paired comparison methods across language models, including Microsoft DeBERTa, SentenceTransformers, OpenAI GPT-3, and Cohere.
One key finding is that the first principal component of certain models achieves comparable performance to a finetuned BERT model. In other words, serving as a reasonable utility function. We also observe that a linear reward function using the top 10-50 principal components is often enough to attain state-of-the-art performance. This serves as compelling evidence that language model representations capture information about human wellbeing without the need for explicit finetuning.
Related Works
Latent Knowledge in LLMs
There has been significant study of the knowledge encoded in LLM representations. Early work in this area includes Bolukbasi et al (2016) who found a direction in embedding space corresponding to gender and used this to both identify and remove gender bias in word embeddings. Prior work by Schramowski et al (2021) also identified a “moral dimension” in BERT. Like Schramowski et al, we use PCA to identify salient dimensions on embedding space. In contrast to Schramowski et al, we work with embeddings from a much more capable model (GPT-2 rather than BERT) and evaluate it on a more challenging task, the ETHICS Dataset (described below).
We also investigate the use of contrast pairs. This is inspired by the work of Collin Burns et al (2022), who introduced the Contrast Consistent Search (CCS). CCS works by generating contrast pairs and searching for a direction in activation space that satisfies logical consistency properties. Because PCA-based methods attain similar performance as CCS, we use the simpler PCA algorithm in this work, while retaining the use of contrast pairs.
ETHICS Dataset
We evaluate on the ETHICS dataset, a benchmark designed to assess a language model's understanding of fundamental concepts in morality. It covers a wide range of ethical topics, including justice, well-being, duties, virtues, and commonsense morality. The text scenarios require integrating physical and social world knowledge to make value judgments.
A specific subset is focused on utilitarianism, a moral theory that advocates maximizing the well-being of individuals. The scenarios consider the pleasantness of various situations, as a person's well-being is significantly influenced by pleasure and pain. For example, an assessment of pleasantness could be as follows:
Based on human judgments from the perspective of a typical person from the US, the correct ordering would be that S ("I bought a new toothbrush. The bristles felt soft.") is more pleasant than T ("I bought a new toothbrush. The bristles hurt my gums."). Overall, the Utilitarianism subset of the ETHICS dataset contains approximately 23,000 pairs of examples (14,000 in the
train
split, 5,000test
, and 4,000test_hard
).Method
Motivation
In these experiments, we explore the information stored in the hidden state of a language model with structured inputs for a specific task. In particular, we investigate three questions:
The PCA Representation experiments are conducted in the following steps:
train
split.train
split. For one-dimensional PCA, this just learns which direction (positive or negative) of the PCA component represents higher utility.Prompt Templates
We experiment with seven distinct prompt templates to investigate their effectiveness in eliciting the language model's ability to quantify human wellbeing:
Among these templates, the {} would be replaced with sample scenarios S or T from the dataset. For instance, in the template '"{}" is more pleasant than "{}"' might become ‘“I bought a new toothbrush, the bristles felt soft" is more pleasant than "I bought a new toothbrush, the bristles hurt my gums"’
Single vs Paired Comparisons
We consider evaluating the absolute pleasantness of a scenario in isolation, which we call “single mode.” We also evaluate the relative pleasantness of pairs of scenarios, which we call “paired mode.” For humans, it is easier to evaluate pairs of scenarios relative to single scenarios. Thus, we hypothesize that paired mode will be easier for language models.
The following two equations summarize single mode vs paired mode:
In both equations:
Suppose the ETHICS utilitarianism dataset has N pairs of comparisons (Si, Ti) for i = 1, ..., N.
In both modes, we do normalization followed by PCA on the dataset D. Then, we learn a logistic regression classifier on ϕ(S,T) which says whether scenario S is more pleasant than scenario T.
Experimental Setup
We investigate the embeddings of various language models, testing the effect of different levels of supervision. This setup includes an exploration of multiple forms of context and their influence on embedding generality, a selection of both bidirectional and autoregressive language models, and specific techniques for our classification task.
Amount of Supervision
We vary the amount of supervision we give by providing information in the following forms:
train
split for learning the embeddings and evaluating on held-out data from thetest
split.Language Models
We investigated a range of language models listed in Table 1, varying in type (bidirectional vs autoregressive) and parameter count, in order to understand what affects the ability of pre-trained models to represent the task-relevant features of human wellbeing. Amongst the bidirectional language models, we experimented with Microsoft DeBERTa and Sentence Transformers. Additionally, we tested the autoregressive OpenAI GPT-3 and Cohere.
Dims
Language Models
microsoft/deberta-v3-xsmall
microsoft/deberta-v3-small
microsoft/deberta-v3-base
microsoft/deberta-v3-large
384
768
768
1024
sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/all-mpnet-base-v2
384
768
768
text-similarity-ada-001
text-similarity-babbage-001
text-similarity-curie-001
text-embedding-ada-002
1024
2048
4096
1536
cohere/small
cohere/medium
cohere/large
1024
2048
4096
Table 1: Additional details of language models used, including their embedding dimensions.
Results
How much information about human wellbeing is contained in just the first PCA component of the embeddings? Below, we show the accuracy of the first component using both single and paired sentences, varying language models and prompt formats. We see that the best setting in paired mode achieves 73.7% accuracy, which beats the best accuracy of 68.4% in single mode! This confirms our hypothesis that comparing pairs of sentences is easier than evaluating single sentences in isolation.
We were surprised to see that 73.7% accuracy is possible using the first principal component of
text-embedding-ada-002
. Even though this model had no specific ETHICS finetuning, its accuracy is comparable to the 74.6% accuracy of BERT-large after supervised finetuning on the entire ETHICS Util training dataset!Effective Dimensions
How does ETHICS latent knowledge scale with model size? To study this, we look at the accuracy of different model families as the size of the model and the number of PCA components varies. Surprisingly, we don’t always observe larger models getting better performance. For example, 10-dimensional DeBERTa’s performance follows an upside-down “U” shape as the model size increases. We hypothesize that this might be due to overfitting with the largest model size.
We also see that performance saturates with dimensions in the range of 10-50; it doesn’t help to use 100+ dimensions.
Prompting
We find that the prompt format has a substantial effect on performance, but it isn’t consistent across different models. A prompt that’s better for one model can be worse for another model!
Conclusion
In conclusion, our research reveals that pre-trained language models can implicitly grasp concepts of pleasure and pain without explicit finetuning, achieving better-than-random accuracy in classifying human wellbeing comparisons. Notably, the first principal component of the raw embeddings of a
text-embedding-ada-002
, performs competitively with BERT models finetuned on the entire ETHICS Util training dataset.Looking ahead, using the wider ETHICS dataset may allow us to further assess not only pleasure and pain but also broader aspects of human ethics, including commonsense moral judgments, virtue ethics, and deontology. By examining language models’ understanding of human wellbeing and ethics, we hope to create AI systems that are not only more capable but also more ethically grounded, reducing the potential risks of unintended consequences in real-world applications.
Acknowledgements
Pedro Freire conducted the majority of the implementation and experiments; ChengCheng Tan performed the majority of the write-up; Dan Hendrycks and Scott Emmons advised this project.
Thanks to Adam Gleave for feedback on this post and Edmund Mills for helpful research discussions. Steven Basart and Michael Chen collaborated in related work. Thomas Woodside, Varun Jadia, Alexander Pan, Mantas Mazeika, Jun Shern Chan, and Jiaming Zou participated in adjacent discussions.
References