AI ALIGNMENT FORUM
AF

All of Ofer's Comments + Replies

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Did OpenAI/Anthropic allow you to evaluate smaller scale versions* of GPT4/Claude before training the full-scale model?

* [EDIT: and full-scale models in earlier stages of the training process]

2Hjalmar Wijk2y

ARC evals has only existed since last fall, so for obvious reasons we have not evaluated very early versions. Going forward I think it would be valuable and important to evaluate models during training or to scale up models in incremental steps.

Acausal normalcy

Ofer2y1-3

Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic.

Have you discussed this point with other relevant researchers before deciding to publish this post? Is there a wide agreement among relevant researchers that a public, unrestricted discussion about this topic is net-positive? Have you considered the unilateralist's curse and biases that you may have (in terms of you gaining status/prestige from publishing this)?

Categorizing failures as “outer” or “inner” misalignment is often confused

Ofer2y10

(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)

2Rohin Shah2y

It's still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don't expect will actually happen, so I think I'm fine with that.

Categorizing failures as “outer” or “inner” misalignment is often confused

Ofer2y10

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.

Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

This categorizatio... (read more)

2Rohin Shah2y

You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the "training data" and the future inputs are the "test data". In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is "training" and everything after is "test"), and then apply the categorization as usual.

We may be able to see sharp left turns coming

Ofer3y20

The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.

If we're trying to predict abrupt changes in the accuracy of output token sequences, the per-token log-likelihood can be a useful signal. What's the analogous signal when we're talking about abrupt changes in a model's ability to deceptively conceal capabilities, hack GPU firmware, etc.? What log-likelihood plots can we use to predict those types of abrupt changes in behavior?

1Ethan Perez3y

Here, I think we'll want to look for suspicious changes in the log-likelihood trends. E.g., it's a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.

Common misconceptions about OpenAI

Ofer3y10

Sorry, that text does appear in the linked page (in an image).

Common misconceptions about OpenAI

Ofer3y10

The Partnership may never make a profit

I couldn't find this quote in the page that you were supposedly quoting from. The only google result for it is this post. Am I missing something?

[This comment is no longer endorsed by its author]Reply

1Ofer3y

Sorry, that text does appear in the linked page (in an image).

Principles of Privacy for Alignment Research

Ofer3y10

That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important.

That consideration seems relevant only for language models that will be doing/supporting alignment work.

Principles of Privacy for Alignment Research

Ofer3y12

Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.

The relevant texts I'm thinking about here are:

Descriptions of certain tricks to evade our safety measures.
Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might "hijack" the model's logic).

Principles of Privacy for Alignment Research

Ofer3y10

Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?

4Richard Ngo3y

Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won't have them changed by small amounts of training text.

Announcing the Alignment of Complex Systems Research Group

Ofer3y20

I think this comment is lumping together the following assumptions under the "continuity" label, as if there is a reason to believe that either they are all correct or all incorrect (and I don't see why):

There is large distance in model space between models that behave very differently.
Takeoff will be slow.
It is feasible to create models that are weak enough to not pose an existential risk yet able to sufficiently help with alignment.

I bet more on scenarios where we get AGI when politics is very different compared to today.

I agree that just before... (read more)

Six Dimensions of Operational Adequacy in AGI Projects

Ofer3y50

Even with adequate closure and excellent opsec, there can still be risks related to researchers on the team quitting and then joining a competing effort or starting their own AGI company (and leveraging what they've learned).

autonomy: the missing AGI ingredient?

Ofer3y30

Do you generally think that people in the AI safety community should write publicly about what they think is "the missing AGI ingredient"?

It's remarkable that this post was well received on the AI Alignment Forum (18 karma points before my strong downvote).

Thoughts on gradient hacking

Ofer3y10

Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.

Relaxed adversarial training for inner alignment

Ofer3y00

If the model that is used as a Microscope AI does not use any optimization (search), how will it compute the probability that, say, Apple's engineers will overcome a certain technical challenge?

0Evan R. Murphy3y

That's a good question. Perhaps it does make use of optimization but the model still has an overall passive relationship to the world compared to an active mesa-optimizer AI. I'm thinking about the difference between say, GPT-3 and the classic paperclip maximizer or other tiling AI. This is just my medium-confidence understanding and may be different from what Evan Hubinger meant in that quote.

AMA Conjecture, A New Alignment Startup

Ofer3y10

What I can do is point to my history of acting in ways that, I hope, show my consistent commitment to doing what is best for the longterm future (even if of course some people with different models of what is “best for the longterm future” will have legitimate disagreements with my choices of past actions), and pledge to remain in control of Conjecture and shape its goals and actions appropriately.

Sorry, do you mean that you are actually pledging to "remain in control of Conjecture"? Can some other founder(s) make that pledge too if it's necessary for m... (read more)

AMA Conjecture, A New Alignment Startup

Ofer3y80

Your website says: "WE ARE AN ARTIFICIAL GENERAL INTELLIGENCE COMPANY DEDICATED TO MAKING AGI SAFE", and also "we are committed to avoiding dangerous AI race dynamics".

How are you planning to avoid exacerbating race dynamics, given that you're creating a new 'AGI company'? How will you prove to other AI companies—that do pursue AGI—that you're not competing with them?

Do you believe that most of the AI safety community approves of the creation of this new company? In what ways (if any) have you consulted with the community before starting the company?

Connor Leahy3y80

To address the opening quote - the copy on our website is overzealous, and we will be changing it shortly. We are an AGI company in the sense that we take AGI seriously, but it is not our goal to accelerate progress towards it. Thanks for highlighting that.

We don’t have a concrete proposal for how to reliably signal that we’re committed to avoiding AGI race dynamics beyond the obvious right now. There is unfortunately no obvious or easy mechanism that we are aware of to accomplish this, but we are certainly open to discussion with any interested parties ab... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

Ofer3y20

I'm late to the party by a month, but I'm interested in your take (especially Rohin's) on the following:

Conditional on an existential catastrophe happening due to AI systems, what is your credence that the catastrophe will occur only after the involved systems are deployed?

5Rohin Shah3y

Idk, 95%? Probably I should push that down a bit because I haven't thought about it very hard. It's a bit fuzzy what "deployed" means, but for now I'm going to assume that we mean that we put inputs into the AI system for the primary purpose of getting useful outputs, rather than for seeing what the AI did so that we can make it better. Any existential catastrophe that didn't involve a failure of alignment seems like it had to involve a deployed system. For failures of alignment, I'd expect that before you get an AI system that can break out of the training process and kill you, you get an AI system that can break out of deployment and kill you, because there's (probably) less monitoring during deployment. You're also just running much longer during deployment -- if an AI system is waiting for the right opportunity, then even if it is equally likely to happen for a training vs deployment input (i.e. ignoring the greater monitoring during training), you'd still expect to see it happen at deployment since >99% of the inputs happen at deployment.

Thoughts on gradient hacking

Ofer3y10

The two pieces of logic can use the same activation values as their input. For example, suppose they both (independently) cause failure if a certain activation value is above some threshold. (In which case each piece of logic "ruins" a different critical activation value).

2Richard Ngo3y

In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.

1Not Relevant3y

I confess I’m confused as to how the network gradient hacks to create these redundant mechanisms in the first place? Since one of them seems completely useless unless designed deliberately.

Formalizing Objections against Surrogate Goals

Ofer4y20

Regarding the following part of the view that you commented on:

But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI.

Just wanted to add: It may be important to consider potential downside risks of such work. It may be important to be vigilant when working on certain topics in game theory and e.g. make certain binding commitments before investigating certain issues, because otherwise one might lose a commitment race in logical time. (I think this is a special case of a... (read more)