AI ALIGNMENT FORUM
AF

All of William_S's Comments + Replies

Daniel Kokotajlo's Shortform

William Saunders

Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis

Daniel Kokotajlo's Shortform

William Saunders

I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there'd be snickering news articles about it. So if some individuals could do this independently might be easier

2Daniel Kokotajlo20d

Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They've hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.

1William Saunders20d

Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis

Daniel Kokotajlo's Shortform

William Saunders

How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.

2Daniel Kokotajlo20d

I think the money is not at all the issue for the companies. Like, a million dollars a month is not very much to them. But e.g. suppose your AI says it wants to be assured that if it's having trouble solving a problem, it'll be given hints. Or suppose it says that it wants to be positively reinforced. That requires telling one of your engineers to write a bit of code and run it on your actual datacenters (because for security reasons you can't offload the job to someone else's datacenters.) That's annoying and distracts from all the important things your engineers are doing.

Buck's Shortform

William Saunders

Would be nice if it was based on "actual robot army was actually being built and you have multiple confirmatory sources and you've tried diplomacy and sabotage and they've both failed" instead of "my napkin math says they could totally build a robot army bro trust me bro" or "they totally have WMDs bro" or "we gotta blow up some Japanese civilians so that we don't have to kill more Japanese civilians when we invade Japan bro" or "dude I'm seeing some missiles on our radar, gotta launch ours now bro".

Buck's Shortform

William Saunders

Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes

Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.

Buck's Shortform

William Saunders

IMO it's unlikely that we're ever going to have a safety case that's as reliable as the nuclear physics calculations that showed that the Trinity Test was unlikely to ignite the atmosphere (where my impression is that the risk was mostly dominated by risk of getting the calculations wrong). If we have something that is less reliable, then will we ever be in a position where only considering the safety case gives a low enough probability of disaster for launching an AI system beyond the frontier where disastrous capabilities are demonstrated?
Thus, in practi... (read more)

2Buck Shlegeris9mo

I agree that "CEO vibes about whether enough mitigation has been done" seems pretty unacceptable. I agree that in practice, labs probably will go forward with deployments that have >5% probability of disaster; I think it's pretty plausible that the lab will be under extreme external pressure (e.g. some other country is building a robot army that's a month away from being ready to invade) that causes me to agree with the lab's choice to do such an objectively risky deployment.

3William Saunders9mo

Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.

William_S's Shortform

William Saunders

No comment.

Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?

(I would guess a "no comment" or lack of response or something to that degree implies a "yes" with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA's offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)

William_S's Shortform

William Saunders

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t... (read more)

16

10

3

2

Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?

Robustness of Model-Graded Evaluations and Automated Interpretability

William Saunders

2y32

Re: hidden messages in neuron explanations, yes it seems like a possible problem. A way to try to avoid this is to train the simulator model to imitate what a human would say given the explanation. A human would ignore the coded message, and so the trained simulator model should also ignore the coded message. (this maybe doesn't account for adversarial attacks on the trained simulator model, so might need ordinary adversarial robustness methods).

Does seem like if you ever catch your interpretability assistant trying to hide messages, you should stop and try to figure out what is going on, and that might be sufficient evidence of deception.

William_S's Shortform

William Saunders

2y30

From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.

However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, ... (read more)

Oversight Misses 100% of Thoughts The AI Does Not Think

William Saunders

3y21

So I do think you can get feedback on the related question of "can you write a critique of this action that makes us think we wouldn't be happy with the outcomes" as you can give a reward of 1 if you're unhappy with the outcomes after seeing the critique, 0 otherwise.

And this alone isn't sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn't be happy with the outcome, which is then where you'd need to get into recursive evaluation or debate or something. But this feels like "hard but potentially tractable pr... (read more)

Oversight Misses 100% of Thoughts The AI Does Not Think

William Saunders

3y31

If we can't get the AI to answer something like "If we take the action you just proposed, will we be happy with the outcomes?", why can we get it to also answer the question of "how do you design a fusion power generator?" to get a fusion power generator that does anything reliably in the world (including having consequences that kill us), rather than just getting out something that looks to us like a plan for a fusion generator but doesn't actually work?

3Ramana Kumar3y

Because doing something reliably in the world is easy to operationalise with feedback mechanisms, but us being happy with the outcomes is not. Getting some feedback mechanism (including "what do human raters think of this?" but also mundane things like "what does this sensor report in this simulation or test run?") to reliably output high scores typically requires intelligence/capability. Optimising for that is where the AI's ability to get stuff done in the world comes from. The problem is genuinely capturing "will we be happy with the outcomes?" with such a mechanism.

Reverse-engineering using interpretability

William Saunders

3y10

Image link is broken

Robustness to Scaling Down: More Important Than I Thought

William Saunders

3y20

Yep, that clarifies.

Robustness to Scaling Down: More Important Than I Thought

William Saunders

3y94

You define robustness to scaling down as "a solution to alignment keeps working if the AI is not optimal or perfect." but for interpretability you talk about "our interpretability is merely good or great, but doesn't capture everything relevant to alignment" which seems to be about the alignment approach/our understanding being flawed not the AI. I can imagine techniques being robust to imperfect AI but find it harder to imagine how any alignment approach could be robust if the approach/our implementation of the approach itself is flawed, do you have any example of this?

5Adam Shimi3y

That's a great point! There's definitely one big difference between how Scott defined it and how I'm using it, which you highlighted well. I think a better way of explaining my change is that in Scott's original example, the AI being flawed result in some sense in the alignment scheme (predict human values and do that) to be flawed too. I hadn't made the explicit claim in my head or in the post, but thanks to your comment, I think I'm claiming that the version I'm proposing generalize one of the interesting part of the original definition, and let it be applied to more settings. As for your question, there is a difference between flawed and not the strongest version. What I'm saying about interpretability and single-single is not that a flawed implementation of them would not work (which is obviously trivial), but that for the reductions to function, you need to solve a particularly ambitious form of the problem. And that we don't currently have a good reason to expect to solve this ambitious problem with enough probability to warrant trusting the reduction and not working on anything else. So an example of a plausible solution (of course I don't have a good solution at hand) would be to create sufficient interpretability techniques that, when combined with conceptual and mathematical characterizations of problematic behaviours like deception, we're able to see if a model will end up having these problematic behaviours. Notice that this possible solution requires working on conceptual alignment, which the reduction to interpretability would strongly discourage. To summarize, I'm not claiming that interpretability (or single-single) won't be enough if it's flawed, just that reducing the alignment problem (or multi-multi) to them is actually a reduction to an incredibly strong and ambitious version of the problem, that no one is currently tackling this strong version, and that we have no reason to expect to solve the strong version with such high probability that

A transparency and interpretability tech tree

William Saunders

3y10

Summary for 8 "Can we take a deceptively aligned model and train away its deception?" seems a little harder than what we actually need, right? We could prevent a model from being deceptive rather than trying to undo arbitrary deception (e.g. if we could prevent all precursors)

3Evan Hubinger3y

Yep, that's right—if you take a look at my discussion under (6), my claim is that we should only need (6), not (8).

A transparency and interpretability tech tree

William Saunders

3y30

Do you think we could basically go 1->4 and 2->5 if we could train a helper network to behaviourally clone humans using transparency tools and run the helper network over the entire network/training process? Or if we do critique style training (RL reward some helper model with access to the main model weights if it produces evidence of the property we don't want the main network to have)?

4Robert Kirk3y

The ability to go 1->4 or 2->5 by the behavioural-cloning approach would assume that the difficulty of interpreting all parts of the model are fairly similar, but it just takes time for the humans to interpret all parts, so we can automate that by imitating the humans. But if understanding the worst-case stuff is significantly harder than the best-case stuff (which seems likely to me) then I wouldn't expect the behaviourally-cloned interpretation agent to generalise to being able to correctly interpret the worse-case stuff.

AGI Ruin: A List of Lethalities

William Saunders

3y822

Could I put in a request to see a brain dump from Eliezer of ways to gain dignity points?

3y50

I'm not Eliezer, but my high-level attempt at this:

[...] The things I'd mainly recommend are interventions that:
Help ourselves think more clearly. (I imagine this including a lot of trying-to-become-more-rational, developing and following relatively open/honest communication norms, and trying to build better mental models of crucial parts of the world.)
Help relevant parts of humanity (e.g., the field of ML, or academic STEM) think more clearly and understand the situation.
Help us understand and resolve major disagreements. (Especially current disagreements

... (read more)

High-stakes alignment via adversarial training [Redwood Research report]

William Saunders

3y40

Take after talking with Daniel: for future work I think it will be easier to tell how well your techniques are working if you are in a domain where you care about minimizing both false-positive and false-negative error, regardless of whether that's analagous to the long term situation we care most about. If you care about both kinds of error then the baseline of "set a reallly low classifier threshold" wouldn't work, so you'd be starting from a regime where it was a lot easier to sample errors, hence it will be easier to measure differences in performance.

2dmz3y

Yeah, I think that might have been wise for this project, although the ROC plot suggests that the classifiers don't differ much in performance even at noticeably higher thresholds. For future projects, I think I'm most excited about confronting the problem directly by building techniques that can succeed in sampling errors even when they're extremely rare.

The case for becoming a black-box investigator of language models

William Saunders

3y10

One general piece of advice is that it seems like it might be useful to have an interface that shows you multiple samples for each prompt (the OpenAI playground just gives you one sample, if you use temperature > 0 then this sample could either be lucky or unlucky)

3y50

Yeah I wrote an interface like this for personal use, maybe I should release it publicly.

The case for becoming a black-box investigator of language models

William Saunders

3y80

Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:

without looking at the results or running systematic experiments on the dataset, predict which tasks on BIG-bench will be doable
make bets of the form "we'll reach X% performance on task A before we reach Y% performance on task B"
predict for some prompt what percentage of samples will satisfy some property, then take a large number of samples and then rate them

1Buck Shlegeris3y

Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.

[Link] A minimal viable product for alignment

William Saunders

3y60

Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you're being fooled). It's still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.

3Ethan Perez3y

What are people's timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful? Today's language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we'll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I'm very curious if others disagree here though

Prizes for ELK proposals

William Saunders

Suppose there are two worlds, world W1 and world W2.

In world W1, the question Q="Is there a diamond in the room?" is commonly understood to mean Q1="Is there actually a diamond in the room?"

In world W2 the question Q="Is there a diamond in the room?" is commonly understood to mean Q2="Do I believe there is a diamond in the room?"

Both worlds don't know how to construct a situation where these are different. So, they produce identical training sets for ELK. But the simulator is also trained on a bunch of science fiction novels that contain descriptions of im... (read more)

4Paul Christiano3y

I'd like to get different answers in those two worlds. That definitely requires having some term in the loss that is different in W1 and W2. There are three ways the kinds of proposals in the doc can handle this: * Consistency checks will behave differently in W1 and W2. Even if a human can never produce different answers to Q1 and Q2, they can talk about situations where Q1 and Q2 differ and describe how the answers to those questions relate to all the other facts about the world (and to the answer to Q). * If language is rich enough, and we are precise enough with the formulation of questions, then you may hope that lots of other questions have different interpretations in W1 and W2, i.e. such that the simplest way of answering other questions will generalize correctly to Q. * In the case of amplification/debate, Q2 = "Does a human with AI assistants believe a diamond is in the room?" and so we can hope that in fact Q1 and Q2 have the same answers in all situations. (Though we aren't optimistic about this.)

2Mark Xu3y

We generally assume that we can construct questions sufficiently well that there's only one unambiguous interpretation. We also generally assume that the predictor "knows" which world it's in because it can predict how humans would respond to hypothetical questions about various situations involving diamonds and sensors and that humans would say in theory Q1 and Q2 could be different. More concretely, our standard for judging proposals is exhibiting an unambiguous failure. If it was plausible you asked the wrong question, or the AI didn't know what you meant by the question, then the failure exhibited would be ambiguous. If humans are unable to clarify between two possible interpretations of their question, then the failure would be ambiguous.

Prizes for ELK proposals

William Saunders

Edit: think this isn't quite right in general, will try to make it more correct later

Here's a sketch of a strategy for trying to fix Strategy: penalize depending on “downstream” variables. Would appreciate feedback on whether it's modeling the difficulty correctly/seems possibly worth figuring out how to implement

It seems like the problem is:

On the training set, there are a number of implicit variables X that are indistinguishable (always all true or always all false)
- A. Is the diamond safe at time t-1
- B. Is the diamond safe at time t (the variable we actual

... (read more)

Call for research on evaluating alignment (funding + advice available)

William Saunders

3y10

I've been thinking of Case 2. It seems harder to establish "capable of distinguishing between situations where the user wants A vs B" on individual examples since a random classifier would let you cherrypick some cases where this seems possible without the model really understanding. Though you could talk about individual cases as examples of Case 2. Agree that there's some implicit "all else being equal" condition, I'd expect currently it's not too likely to change conclusions. Ideally you'd just have the category A="best answer according to user" B="all answers that are worse than the best answer according to the user" but I think it's simpler to analyze more specific categories.

Redwood Research’s current project

William Saunders

4y10

Link to contractor instructions implied in "You can read the instructions given to our contractors here" is missing.

2Buck Shlegeris4y

Thanks, I've added the link to the document.

Call for research on evaluating alignment (funding + advice available)

William Saunders

4y10

I don't think all work of that form would measure misalignment, but some work of that form might, here's a description of some stuff in that space that would count as measuring misalignment.

Let A be some task (e.g. add 1 digit numbers), B be a task that is downstream of A (to do B, you need to be able to do A, e.g. add 3 digit numbers), M is the original model, M1 is the model after finetuning.

If the training on a downstream task was minimal, so we think it's revealing what the model knew before finetuning rather than adding knew knowledge, then better per... (read more)

The case for aligning narrowly superhuman models

William Saunders

4y10

Even better than "Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand" would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.

The case for aligning narrowly superhuman models

William Saunders

4y20

One easy way to make people who can't solve the task for sandwiching is to take people who could solve the task and then give them insufficient time to solve it, or have them be uninformed of some relevant facts about the specific task they are trying to solve.

A simpler way to measure whether you are making progress towards sandwiching if you can't go there directly is to look at whether you can get people to provide better supervision with your tool than without your tool, that is accomplishing more on the task.

Both of these approaches feel like they aren... (read more)

Why I'm excited about Debate

William Saunders

4y20

If the High-Rated Sentence Producer was restricted to output only single steps of a mathematical proof and the single steps were evaluated independently, with the human unable to look at previous steps, then I wouldn't expect this kind of reward hacking to occur. In math proofs, we can build proofs for more complex questions out of individual steps that don't need to increase in complexity.

As I see it, debate on arbitrary questions could work if we figured out how to do something similar, having arguments split into single steps and evaluated independently... (read more)

Some AI research areas and their relevance to existential safety

William Saunders

4y30

For the preference learning skepticism, does this extend to the research direction (that isn't yet a research area) of modelling long term preferences/preferences on reflection? This is more along the lines of the "AI-assisted deliberation" direction from ARCHES.

To me it seems like AI alignment that can capture preferences on reflection could be used to find solutions to many of other problems. Though there are good reasons to expect that we'd still want to do other work (because we might need theoretical understanding and okay solutions before AI reaches ... (read more)

Some AI research areas and their relevance to existential safety

William Saunders

4y10

One thing I'd like to see are some more fleshed out examples of the kinds of governance demands that you think might be important in the future and would be bottlenecked on research progress in these areas.

Learning the prior and generalization

William Saunders

4y30

Okay, makes more sense now, now my understanding is that for question X, answer from ML system Y, amplification system A, verification in your quote is asking the A to answer "Would A(Z) output answer Y to question X?", as opposed to asking A to answer "X", and then checking if it equals "Y". This can at most be as hard as running the original system, and maybe could be much more efficient.

3Evan Hubinger4y

Yep; that's what I was imagining. It is also worth noting that it can be less safe to do that, though, since you're letting A(Z) see Y, which could bias it in some way that you don't want—I talk about that danger a bit in the context of approval-based amplification here and here.

Learning the prior and generalization

William Saunders

4y40

Right, but in the post the implicitly represented Z is used by an amplification or debate system, because it contains more information than a human can quickly read and use (so are you assuming it's simple to verify the results of amplification/debate systems?)

3Evan Hubinger4y

Ah, sorry, no—I was assuming you were just using whatever procedure you used previously to allow the human to interface with Z in that situation as well. I'll edit the post to be more clear there.

Learning the prior and generalization

William Saunders

4y40

for extremely large $Z$ which are represented only implicitly as in Paul's post, we might not always check whether the model matches the ground truth by actually generating the ground truth and instead just ask the human to verify the answer given $Z$

I'm not sure what "just ask the human to verify the answer given $Z$ " looks like, for implicitly represented $Z$

2Evan Hubinger4y

There are lots of ways to allow H to interface with an implicitly represented Z, but the one Paul describes in “Learning the Prior” is to train some model Mz(⋅, z) which represents Z implicitly by responding to human queries about Z (see also “Approval-maximizing representations” which describes how a model like Mz could represent Z implicitly as a tree). Once H can interface with Z, checking whether some answer is correct given Z is at least no more difficult than producing an answer given Z—since H can just produce their answer then check it against the model's using some distance metric (e.g. an autoregressive language model)—but could be much easier if there are ways for H to directly evaluate how likely H would be to produce that answer.

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

William Saunders

5y20

Seems like it could be helpful if people who've thought about this would also predict on the question of what the survey value would be today. (e.g. via elicit snapshots)

OpenAI announces GPT-3

William Saunders

5y60

Google's Meena (2.6 billion parameters, February 2020) creates original puns in "Conversation D", and I think "Cross-turn Repetition Example 2" is absurdist comedy, but maybe more as a result of the way the model fails.

Conversation D

Human: I like how cute cows are

Meena: Ah, interesting. You like how they are cute, right?

Human: yeah, and they’re realy smart

Meena: I heard they go to college

Human: cows go to college?

Meena: I heard that a cow went to Harvard.

Human: what did the cow study?

Meena: Bovine sciences.

Human: do horses g

... (read more)

1Isnasene5y

Thanks! This is great.

Zoom In: An Introduction to Circuits

William Saunders

5y50

The worry I'd have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?

5y90

The "Zoom In" work is aimed at understanding what's going on in neural networks as a scientific question, not directly tackling mesa-optimization. This work is relevant to more application-oriented interpretability if you buy that understanding what is going on is an important prerequisite to applications.

As the original article put it:

And so we often get standards of evaluations more targeted at whether an interpretability method is useful rather than whether we’re learning true statements.

Or, as I put it in Embedded Curiosities:

One downside of disc

... (read more)

5y80

[not affiliated with the author but have thought a fair bit about this sort of thing]

Once you understand what causes these circuits to arise, you could hopefully regularise for interpretability and boost that rookie 95% number up to 99.9%, where you could really believe that pruning the rest isn't a big deal.

5y80

I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I'm only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.

Given that assumption, I think it's possible to translate 95% transparency into a safety guarantee: just use your transparency to produce a consistent gradient away from deception such that

... (read more)

Reinforcement Learning in the Iterated Amplification Framework

William Saunders

5y10

I'm talking about an imitation version where the human you're imitating is allowed to do anything they want, including instatiting a search over all possible outputs X and taking that one that maximizes the score of "How good is answer X to Y?" to try to find X*. So I'm more pointing out that this behaviour is available in imitation by default. We could try to rule it out by instructing the human to only do limited searches, but that might be hard to do along with maintaining capabilities of the system, and we need to figure out what "safe limited search" actually looks like.

Reinforcement Learning in the Iterated Amplification Framework

William Saunders

5y10

If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn't the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?

I agree, and think that this problem occurs both in imitation IA and RL IA

For example is the plan to make sure M2 ha

... (read more)

Outer alignment and imitative amplification

William Saunders

5y60

Overall, I think imitative amplification seems safer, but I maybe don't think the distinction is as clear cut as my impression of this post gives.

if you can instruct them not to do things like instantiate arbitrary Turing machines

I think this and "instruct them not to search over arbitrary text strings for the text string that gives the most approval", and similar things, are the kind of details that would need to be filled out to make the thing you are talking about actually be in a distinct class from approval-based amplification and deba... (read more)

Concrete experiments in inner alignment

William Saunders

6y30

Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)

https://arxiv.org/abs/1905.12149

William Saunders

6y70

One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:

use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
train a new system to do as well on the reward function as the original system
measure the number of training steps needed to reach this point for the new system.

This would let you make comparisons between different systems as to which was more capability robust.

Maybe there's a version that could train the ... (read more)

Problems with AI debate

William Saunders

6y40

For Alaska vs. Bali, alternative answer is "You could be convinced that either Alaska or Bali is a good vacation destination". It's an interesting question whether this could actually win in debate. I think it might have a better chance in Factored Evaluation, because we can spin up two seperate trees to view the most compelling argument for Alaska and the most compelling argument for Bali and verify that these are convincing. In debate, you'd need view either Alaska Argument before Bali Argument, or Bali Argument before Alaska Argument, and you might just be convinced by the first argument you see in which case you wouldn't agree that you could be convinced either way.

2Stuart Armstrong6y

How about a third AI that gives a (hidden) probability about which one you'll be convinced by, conditional on which argument you see first? That hidden probability is passed to someone else, then the debate is run, and the result recorded. If that third AI gives good calibration and good discrimination over multiple experiments, then we can consider its predictions accurate in the future.

HCH is not just Mechanical Turk

William Saunders

6y30

I'd say that the claim is not sufficient - it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it's hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I'd want to produce along the way while proceeding with HCH-like approaches)

Deceptive Alignment

William Saunders

6y30

I think a cleaner way of stating condition 3 might be "there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating".

This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment tim... (read more)

An Increasingly Manipulative Newsfeed

William Saunders

6y10

To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it "Will you do things I don't like if given more capability?" or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where i... (read more)

Contest: $1,000 for good questions to ask to an Oracle AI

William Saunders

6y20

Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)

1Stuart Armstrong6y

Yes.

Contest: $1,000 for good questions to ask to an Oracle AI

William Saunders

6y60

Submission: low-bandwidth oracle

Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.

The Main Sources of AI Risk?

William Saunders

AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people's values are loaded into the system), and is relevant for overall strategy.