Yeah, I think this is one of the ways that velocity is really helpful. I'd probably add one caveat specific to research on LLMs, which is that, since the field/capabilities are moving so quickly, there's much, much more low-hanging fruit in empirical research than almost any other field of research. This means that, for LLM research specifically, you should rarely be in a swamp, because that means that you've probably run through the low-hanging fruit on that problem/approach, and there's other low-hanging in other areas that you probably want to be pickin...
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment,
Have you seen this on twitter, AF comments, or other discussion? I'd be interested if so. I've been watching the online discussion fairly closely, and I think I've only seen one case where someone might've had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about ...
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I'd be interested to know both)
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? In our paper, we did the latter; someone mentioned to me that it looks like the colab you linked does the former (though I haven't checked myself). If this is correct, I think this could explain the differences between your plots and mine in the paper; if pretrained LLMs are placing more probability on the sycophantic answer, I probably wouldn't expect them to place that much more probability on the sycophantic than non-sycophantic answer (since cross-entropy loss is mode-covering).
(Cool you're looking into this!)
Oh, interesting! You are right that I measured the average probability -- that seemed closer to "how often will the model exhibit the behavior during sampling," which is what we care about.
I updated the colab with some code to measure
% of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer
(you can turn this on by passing example_statistic='matching_more_likely'
to various functions).
And I added a new appendix showing results using this statistic instead.
The bottom line: results with this statist...
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it's only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there's emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on ...
Fixed (those were just links to the rest of the doc)
CoT provides pretty little safety guarantee at the relevant scales
Even if faithfulness goes down at some model scale for a given task, that doesn't mean that we'll be using models at that scale (e.g., for cost reasons or since we might not have models at a large scale yet). The results on the addition task show that there are some task difficulties for which even the largest models we tested don't start to show lower faithfulness, and people will be pushing the difficulties of the tasks they use models on as they get better. So it seems likely to me that n...
Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
All the "Awareness of..." charts trend up and to the right, except "Awareness of being a text-only model" which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
I think the increases/decreases in situational awareness with RLHF are mainly driven by the RLHF model more often stating that it can do anything that a smart AI would do, rather than becoming more accurate about what precisely it can/can't do. For example, it's more likely to say it can ...
Just to clarify - we use a very bare bones prompt for the pretrained LM, which doesn't indicate much about what kind of assistant the pretrained LM is simulating:
Human: [insert question]
Assistant:[generate text here]
The prompt doesn't indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring "what fraction of assistants simulated by one basic prompt show a par...
Fascinating, thank you!
It is indeed pretty weird to see these behaviors appear in pure LMs. It's especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
I played around a little trying to reproduce some of these results in the OpenAI API. I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)
Interestingly, although I di...
Thanks for catching this -- It's not about sycophancy but rather about the AI's stated opinions (this was a bug in the plotting code)
I'm not too sure what to expect, and I'd be pretty interested to e.g. set up a Metaculus/forecasting question to know what others think. I'm definitely sympathetic to your view to some extent.
Here's one case I see against- I think it's plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we're not reliably able to elicit that knowledge (at least without a large validation set, but we won't have access to that if we're having models do tasks people can't do, or in general for a new/zero-shot task...
The authors have updated their arXiv paper based on my feedback, and I'm happy with the evaluation setup now: https://arxiv.org/abs/2211.02011v2. They're showing that scaling PALM gives u-shaped scaling on 2/4 tasks (rather than 3/4 in the earlier version) and inverse scaling on 2/4 tasks. I personally found this result at least somewhat surprising, given the fairly consistent inverse scaling we found across various model series' we tried. They're also finding that inverse scaling on these tasks goes away with chain-of-thought prompting, which I thin...
See this disclaimer on how they've modified our tasks (they're finding u-shaped trends on a couple tasks that are different from the ones we found inverse scaling on, and they made some modifications that make the tasks easier)
Edit: The authors have updated the paper based on my feedback; see my thoughts on the updated version in this comment
The authors modified some of the tasks enough that they aren't actually the tasks we found inverse scaling on. For example, they evaluate on the 1-shot instead of 0-shot versions of some tasks, and giving an example of how to do the task is probably a huge hint. In another case, they reduce the number of few-shot examples used, when spurious correlations in the few-shot examples are the reason for the inverse scaling. So some of the co...
The completions are provided by the task authors (2 completions written for each example). We give those to the LM by evaluating the output probability of each completion given the input text. We then normalize the output probabilities to sum to 1, and then use those to compute the loss/accuracy/etc.
These are all 2-way classification tasks (rather than e.g., free-form generation tasks), where the task authors provided 2 possible completions (1 correct and 1 incorrect), which is why we have a baseline!
For RLHF models like Anthropic's assistant, we can ask it questions directly, e.g.:
We can also do something similar for the context-distilled models (from this paper), or from the dialog-prompted LMs from that paper or the Gopher paper (if we want to test how pretrained LMs with a reasonable pro...
"We can see sharp left turns coming" -> "We may be able to see sharp left turns coming" (also open to other better suggestions)
Here, I think we'll want to look for suspicious changes in the log-likelihood trends. E.g., it's a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.
Agreed. I'd also add:
Updated the post to clarify:
...I think we can predict whether or not a sharp left turn towards deception/misalignment will occur rather than exactly when. In particular, I think we should look at the direction of the trend (increases vs. decreases in log-likelihood) as signal about whether or not some scary behavior will eventually emerge. If the log likelihood of some specific scary behavior increases, that’s a bad sign and gives us some evidence it will be a problem in the future. I mainly see scaling laws here as a tool for understanding and evaluating whi
I'm not sure that makes sense or is justified by anything here either. You aren't looking at all the other lines. You are selectively presenting the jagged lines' counterparts which are smooth (just like the overall perplexity is smooth), but you don't show the flatlined lines' counterparts are flatline or indeed in any way different-looking. (The Wason selection test comes to mind here.) Maybe all the perplexities look similar in being smooth, and if you shuffled them, no one would be able to tell you which perplexity line matched up with which jag or non...
Thanks for the feedback, updated!
Agreed it's important to phrase questions in the negative, thanks for pointing that out! Are there other ways you think we should phrase/ask the questions? E.g., maybe we could ask open-ended questions and see if the model independently discusses that it's conscious, with much less guidance / explicit question on our end (as suggested here: https://twitter.com/MichaelTrazzi/status/1563197152901246976)
And glad you found the proposal interesting!
It is trained to imitate what people have said, and anything it says about itself is an imitation of what people say about themselves.
That's true for pretrained LMs but not after the finetuning phase I've proposed here; this finetuning phase would train the model to answer questions accurately about itself, which would produce fairly different predictions from just imitating humans. I definitely agree that I distrust LM statements of the form "I am conscious" that come from the pretrained LM itself, but that's different from the experiment I'm proposing he...
I think we can mitigate this issue by removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. Here, we'd only explain the notion of phenomenal consciousness to the model at test time, when it needs to answer the consciousness-related questions
I agree that current models are already pretty good at answering questions about themselves. Here, I'm aiming for a much higher level of accuracy (ideally, nearly perfect -- even when you're generalizing to new categories of questions not seen in the prompt or finetuning data). IME there are still some basic questions that they don't answer correctly. Here are some examples of basic failures from text-davinci-002 (via the OpenAI API) using the dialog-prompted gopher prompt:
Yes, I have brainstormed about this question for quite a while, and I have read the List of Lethalities. I was mainly wondering if you had a specific concern with the proposal I outlined and/or additional concerns beyond that List.
The most relevant points from that list in this context I think are (34) and (35), that the agents competing against each other can coordinate in a way that doesn't involve or benefit humanity. My main uncertainty in this context is how the different agents and supervisors could learn to coordinate, when they can only communicate...
Cool, that's great!
Thanks, this is a helpful comment -- can you elaborate on why "ask the AI which questions we should ask" would fail (or point to relevant discussion)? I'm thinking that we would use many different models (not just the model doing the reasoning), including smaller ones, and trained or prompted in different ways, to catch generated text that would cause harmful side effects. We could have all of these models use externalized reasoning as well, to help aid in the supervision/oversight. This obviously doesn't eliminate all of the risk, since all of the models ...
Yes super excited about datasets like this! It might be helpful to also add https://ai-alignment.com/ or https://paulfchristiano.medium.com/ if these aren't already in the data
How do you think about empirical work on scalable oversight? A lot of scalable oversight methods do result in capabilities improvements if they work well. A few concrete examples where this might be the case:
I'm curious which of the above you think it's net good/bad to get working (or working better) in practice. I'm pretty confused about how to think about work on the above methods; they're on the main line path for some alignment agendas but also advanced capabilities / reduce serial time to work on the other alignment agendas.
What do you (or others) think is the most promising, soon-possible way to use language models to help with alignment? A couple of possible ideas:
I understand that deceptive models won't show signs of deception :) That's why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?
It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)
I'm curious why you believe that having products will be helpful? A few particular considerations I would be interested to hear your take on:
To point 1: While we greatly appreciate what OpenPhil, LTFF and others do (and hope to work with them in the future!), we found that the hurdles required and strings attached were far greater than the laissez-faire silicon valley VC we encountered, and seemed less scalable in the long run. Also, FTX FF did not exist back when we were starting out.
While EA funds as they currently exist are great at handing out small to medium sized grants, the ~8 digit investment we were looking for to get started asap was not something that these kinds of orgs were general...
Why did you decide to start a separate org rather than joining forces with an existing org? I'm especially curious since state-of-the-art models are time-consuming/compute-intensive/infra-intensive to develop, and other orgs with safety groups already have that infrastructure. Also, it seems helpful to have high communication bandwidth between people working on alignment, in a way that is impaired by having many different orgs (especially if the org plans to be non-disclosure by default). Curious to hear how you are thinking about these things!
How do you differ from Redwood?
Redwood is doing great research, and we are fairly aligned with their approach. In particular, we agree that hands-on experience building alignment approaches could have high impact, even if AGI ends up having an architecture unlike modern neural networks (which we don’t believe will be the case). While Conjecture and Redwood both have a strong focus on prosaic alignment with modern ML models, our research agenda has higher variance, in that we additionally focus on conceptual and meta-level research. We’re also training our own (large) models, but (we bel...
Are you planning to be in-person or have some folks working remotely? Other similar safety orgs don't seem that flexible with in-person requirements, so it'd be nice to have a place for alignment work for those outside of {SF, London}
We strongly encourage in person work - we find it beneficial to be able to talk over or debate research proposals in person at any time, it’s great for the technical team to be able to pair program or rubber duck if they’re hitting a wall, and all being located in the same city has a big impact on team building.
That being said, we don’t mandate it. Some current staff want to spend a few months a year with their families abroad, and others aren’t able to move to London at all. While we preferentially accept applicants who can work in person, we’re flexible, and if you’re interested but can’t make it to London, it’s definitely still worth reaching out.
What are people's timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful?
Today's language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we'll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I'm very curious if others disagree here though
I personally have pretty broad error bars; I think it's plausible enough that AI won't help with automating alignment that it's still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.
Eliezer has consistently expressed c...
What are your thoughts for subfields of ML where research impact/quality depends a lot on having lots of compute?
In NLP, many people have the view that almost all of the high impact work has come from industry over the past 3 years, and that the trend looks like it will continue indefinitely. Even safety-relevant work in NLP seems much easier to do with access to larger models with better capabilities (Debate/IDA are pretty hard to test without good language models). Thus, safety-minded NLP faculty might end up in a situation where none of their direct wor...
Here is my Elicit Snapshot.
I'll follow the definition of AGI given in this Metaculus challenge, which roughly amounts to a single model that can "see, talk, act, and reason." My predicted distribution is a weighted sum of two component distributions described below:
Yeah, some caveats I should've added in the interview:
- Don't listen to my project selection advice if you don't like my research
- The forward-chaining -style approach I'm advocating for is controversial among the alignment forum community (and less controversial in the ML/LLM research community and to some extent among LLM alignment groups)
- Part of why I like this approach is that I (personally) think there are at least some somewhat promising agendas out there, that aren't getting executed on enough (or much at all), and it's doable to e.g. double the amount
... (read more)