AI ALIGNMENT FORUM
AF

All of Caspar Oesterheld's Comments + Replies

Using (Uninterpretable) LLMs to Generate Interpretable AI Code

I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood.

Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I'm still not quite sure what exactly "X can be understood" must require. It seems to matter, for example, that to a human it's understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it's not doing anything problematic (?).

Using (Uninterpretable) LLMs to Generate Interpretable AI Code

Caspar Oesterheld8mo20

As once discussed in person, I find this proposal pretty interesting and I think it deserves further thought.

Like some other commenters, I think for many tasks it's probably not tractable to develop a fully interpretable, competitive GOFAI program. For example, I would imagine that for playing chess well, one needs to do things like positively evaluating some random-looking feature of a position just on the basis that empirically this feature is associated with higher win rate. However, the approach of the post could be weakened to allow "mixed" programs t... (read more)

2Joar Skalse7mo

I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first). I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasibly difficult to find (even with the help of powerful AI systems). However, some black-box components might be acceptable (depending on how the AI is used, etc), and it seems like partial successes would be useful even if the full version of the problem isn't solved (at least under the assumption that interpretability is useful, even if the full version of interpretability isn't solved). I also think there is good reason to believe that quite a lot of the cognition that humans are capable of can be carried out by interpretable programs. For example, any problem where you can "explain your thought process" or "justify your answer" is probably (mostly) in this category. I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood. Humans give each other advice like this all the time. For example, consider a recommendation like "when solving a maze, it's often a good idea to start from the end". I would say that this is interpretable, even without a deeper justification for why this is a good thing to do. At the end of the day, all knowledge must (in some way) be grounded in statistical regularities. If you ask a sequence of "why"-questions, you must eventually hit a point where you are no longer able to answer. As long as the resulting model itself can be understood and reasoned about, I think we should consider t

Anthropic release Claude 3, claims >GPT-4 Performance

Caspar Oesterheld1y1211

As one further data point, I also heard people close to/working at Anthropic giving "We won't advance the state of the art."-type statements, though I never asked about specifics.
My sense is also that Claude 3 Opus is only slightly better than the best published GPT-4. To add one data point: I happen to work on a benchmark right now and on that benchmark, Opus is only very slightly better than gpt-4-1106. (See my X/Twitter post for detailed results.) So, I agree with LawrenceC's comment that they're arguably not significantly advancing the state of the a

... (read more)

Lawrence Chan1y1311

After having spent a few hours playing with Opus, I think "slightly better than best public gpt-4" seems qualitatively correct -- both models tend to get tripped up on the same kinds of tasks, but Opus can inconsistently solve some tasks in my workflow that gpt-4 cannot.

And yeah, it seems likely that I will also swap to Claude over ChatGPT.

How LLMs are and are not myopic

Caspar Oesterheld1y64

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.

How LLMs are and are not myopic

Caspar Oesterheld1y70

Here's a simple toy model that illustrates the difference between 2 and 3 (that doesn't talk about attention layers, etc.).

Say you have a bunch of triplets $(x, z_{1}, z_{2})$ . Your want to train a model that predicts $z_{1}$ from $x$ and $z_{2}$ from $x, z_{1}$ .

Your model consists of three components: $f, g_{1}, g_{2}$ . It makes predictions as follows:
$y = f (x)$
$z_{1} = g_{1} (y)$
$z_{2} = g_{2} (y, z_{1})$

(Why have such a model? Why not have two completely separate models, one for predicting $z_{1}$ and one for predicting $z_{2}$ ? Because it might be more efficient to use ... (read more)

Language Models can be Utility-Maximising Agents

Caspar Oesterheld2y43

Very interesting post! Unfortunately, I found this a bit hard to understand because the linked papers don’t talk about EDT versus CDT or scenarios where these two come apart and because both papers are (at least in part) about sequential decision problems, which complicates things. (CDT versus EDT can mostly be considered in the case of a single decision and there are various complications in multi-decision scenarios, like updatelessness.)

Here’s an attempt at trying to describe the relation of the two papers to CDT and EDT, including prior work on these to... (read more)

Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies

Caspar Oesterheld2y*40

>I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

Sorry if I was cryptic! Yes, it's basically the same as using the MAX decision rule and (importantly) a quasi-strictly proper scoring rule (in their terminology, which is basically the same up to notation as a strictly proper decision scoring rule in the terminology of the decision scoring rules paper). (We changed the terminology for our paper because "quasi-strictly proper scoring rule w.r.t. the max decision rule" is a mouthful. :-P) Does ... (read more)

Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies

Caspar Oesterheld2y40

>the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it.

Hmm... Johannes made a similar argument in personal conversation yesterday. I'm not sure how convinced I am by this argument.

So first, here's one variant of the proper decision scoring rules setup where we also don't need to specify the decision maker's utility function: Ask the predictor for her full conditional probability distribution for each actio... (read more)

4Rubi Hudson2y

I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course. I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies

Caspar Oesterheld2y40

The following is based on an in-person discussion with Johannes Treutlein (the second author of the OP).

>But is there some concrete advantage of zero-sum conditional prediction over the above method?

So, here's a very concrete and clear (though perhaps not very important) advantage of the proposed method over the method I proposed. The method I proposed only works if you want to maximize expected utility relative to the predictor's beliefs. The zero-sum competition model enables optimal choice under a much broader set of possible preferences over outcome... (read more)

Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies

Caspar Oesterheld2y41

Nice post!

Miscellaneous comments and questions, some of which I made on earlier versions of this post. Many of these are bibliographic, relating the post in more detail to prior work, or alternative approaches.

In my view, the proposal is basically to use a futarchy / conditional prediction market design like that the one proposed by Hanson, with I think two important details:
- The markets aren't subsidized. This ensures that the game is zero-sum for the predictors -- they don't prefer one action to be taken over another. In the scoring rules setting, subsi... (read more)

4Caspar Oesterheld2y

The following is based on an in-person discussion with Johannes Treutlein (the second author of the OP). >But is there some concrete advantage of zero-sum conditional prediction over the above method? So, here's a very concrete and clear (though perhaps not very important) advantage of the proposed method over the method I proposed. The method I proposed only works if you want to maximize expected utility relative to the predictor's beliefs. The zero-sum competition model enables optimal choice under a much broader set of possible preferences over outcome distributions. Let's say that you have some arbitrary (potentially wacky discontinuous) function V that maps a distributions over outcomes onto a real value representing how much you like the distribution over outcomes. Then you can do zero-sum competition as normal and select the action for which V is highest (as usual with "optimism bias", i.e., if the two predictors make different predictions for an action a, then take the maximum of the Vs of the two actions). This should still be incentive compatible and result in taking the action that is best in terms of V applied to the predictors' belief. (Of course, one could have even crazier preferences. For example, one's preferences could just be a function that takes as input a set of distributions and selects one distribution as its favorite. But I think if this preference function is intransitive, doesn't satisfy independence of irrelevant alternatives and the like, it's not so clear whether the proposed approach still works. For example, you might be able to slightly misreport some option that will not be taken anyway in such a way as to ensure that the decision maker ends up taking a different action. I don't think this is ever strictly incentivized. But it's not strictly disincentivized to do this.) Interestingly, if V is a strictly convex function over outcome distributions (why would it be? I don't know!), then you can strictly incentivize a single predic

5Rubi Hudson2y

Thanks Caspar, your comments here and on earlier drafts are appreciated. We'll expand more on the positioning within the related literature as we develop this into a paper. As for your work on Decision Scoring Rules and the proposal in your comment, the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it. That seems very useful to me, as if we were able to properly specify the desired utility function, we could skip using predictive models and just train an AI to maximize that instead (modulo inner alignment).

Dutch-Booking CDT: Revised Argument

Caspar Oesterheld2y30

Minor bibliographical note: A related academic paper is Arif Ahmed's unpublished paper, "Sequential Choice and the Agent's Perspective". (This is from memory -- I read that paper a few years ago.)

Worlds Where Iterative Design Fails

Caspar Oesterheld3y40

Nice post!

What would happen in your GPT-N fusion reactor story if you ask it a broader question about whether it is a good idea to share the plans?

Perhaps relatedly:

>Ok, but can’t we have an AI tell us what questions we need to ask? That’s trainable, right? And we can apply the iterative design loop to make AIs suggest better questions?

I don't get what your response to this is. Of course, there is the verifiability issue (which I buy). But it seems that the verifiability issue alone is sufficient for failure. If you ask, "Can this design be turned... (read more)

3johnswentworth3y

My response to the "get the AI to tell us what questions we need to ask" is that it fails for multiple reasons, any one of which is sufficient for failure. One of them is the verifiability issue. Another is the Gell-Mann Amnesia thing (which you could view as just another frame on the verifiability issue, but up a meta level). Another is the "get what we measure" problem. Another failure mode which this post did not discuss is the Godzilla Problem. In the frame of this post: in order to work in practice the iterative design loop needs to be able to self-correct; if we make a mistake at one iteration it must be fixable at later iterations. "Get the AI to tell us what questions we need to ask" fails that test; just one iteration of acting on malicious advice from an AI can permanently break the design loop.

Announcing: Mechanism Design for AI Safety - Reading Group

Caspar Oesterheld3y30

Sounds interesting! Are you going to post the reading list somewhere once it is completed?

(Sorry for self-promotion in the below!)

I have a mechanism design paper that might be of interest: Caspar Oesterheld and Vincent Conitzer: Decision Scoring Rules. WINE 2020. Extended version. Talk at CMID.

Here's a pitch in the language of incentivizing AI systems -- the paper is written in CS-econ style. Imagine you have an AI system that does two things at the same time:
1) It makes predictions about the world.
2) It takes actions that influence the world. (In the pape... (read more)

[Closed] Job Offering: Help Communicate Infrabayesianism

Caspar Oesterheld3y10

Cool that this is (hopefully) being done! I have had this on my reading list for a while and since this is about the kind of problems I also spend much time thinking about, I definitely have to understand it better at some point. I guess I can snooze it for a bit now. :P Some suggestions:

Maybe someone could write an FAQ page? Also, a somewhat generic idea is to write something that is more example based, perhaps even something that just solely gives examples. Part of why I suggest these two is that I think they can be written relatively mechanically and th... (read more)

In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy

Caspar Oesterheld3y20

I now have a draft for a paper that gives this result and others.

Formalizing Objections against Surrogate Goals

Caspar Oesterheld4y50

Not very important, but: Despite having spent a lot of time on formalizing SPIs, I have some sympathy for a view like the following:

> Yeah, surrogate goals / SPIs are great. But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI. If we do this, then AI will implement SPIs (or something even better) regardless of how well we understand them. And if we don't solve these issues, then it's hopeless to add SPIs manually. Furthermore, believing that surrogate goals / SPIs wor... (read more)

2Ofer4y

Regarding the following part of the view that you commented on: Just wanted to add: It may be important to consider potential downside risks of such work. It may be important to be vigilant when working on certain topics in game theory and e.g. make certain binding commitments before investigating certain issues, because otherwise one might lose a commitment race in logical time. (I think this is a special case of a more general argument made in Multiverse-wide Cooperation via Correlated Decision Making about how it may be important to make certain commitments before discovering certain crucial considerations.)

Formalizing Objections against Surrogate Goals

Caspar Oesterheld4y70

Great to see more work on surrogate goals/SPIs!

>Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.

I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that ... (read more)

1Vojtech Kovarik4y

I definitely don't think (C) and the "any" variant of (B). Less sure about the "most" variant of (B), but I wouldn't bet on that either. I do believe (D), mostly because I don't think that humans will be able to make the necessary commitments (in the sense mentioned in the thread with Rohin). I am not super sure about (A). My bet is that to the extent that SPI can work for humans, we are already using it (or something equivalent) in most situations. But perhaps some exceptions will work, like the lawyer example? (Although I suspect that our skill at picking hawkish lawyers is stronger than we realize. Or there might be existing incentives where lawyers are being selected for hawkishness, because we are already using them for someting-like-SPI? Overall, I guess that the more one-time-only an event is, the higher is the chance that the pre-existing selection pressures will be weak, and (A) might work.) That is a good point. I will try to expand on it, perhaps at least in a comment here once I have time, or so :-).

2Vojtech Kovarik4y

Thank you for pointing that out. In all these cases, I actually know that you "stated X", so this is not an impression I wanted to create. I added a note at the begging of the document to hopefully clarify this.

Extracting Money from Causal Decision Theorists

Caspar Oesterheld4y40

>If I win I get $6. If I lose, I get $5.

I assume you meant to write: "If I lose, I lose $5."

Yes, these are basically equivalent. (I even mention rock-paper-scissors bots in a footnote.)

Predictors exist: CDT going bonkers... forever

Caspar Oesterheld4y10

Apologies, I only saw your comment just now! Yes, I agree, CDT never strictly prefers randomizing. So there are agents who abide by CDT and never randomize. As our scenarios show, these agents are exploitable. However, there could also be CDT agents who, when indifferent between some set of actions (and when randomization is not associated with any cost), do randomize (and choose the probability according to some additional theory -- for example, you could have the decision procedure: "follow CDT, but when indifferent between multiple actions, choose a dis... (read more)

In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy

Caspar Oesterheld5y30

Sorry for taking an eternity to reply (again).

On the first point: Good point! I've now finally fixed the SSA probabilities so that they sum up to 1, which really they should, to really have a version of EDT.

>prevents coordination between agents making different observations.

Yeah, coordination between different observations is definitely not optimal in this case. But I don't see an EDT way of doing it well. After all, there are cases where given one observation, you prefer one policy and given another observation you favor another policy. So I ... (read more)

Predictors exist: CDT going bonkers... forever

Caspar Oesterheld5y20

>Caspar Oesterheld and Vince Conitzer are also doing something like this

That paper can be found at https://users.cs.duke.edu/~ocaspar/CDTMoneyPump.pdf . And yes, it is structurally essentially the same as the problem in the post.

2Stuart Armstrong5y

Cool! I notice that you assumed there were no independent randomising devices available. But why would the CDT agent ever opt to use a randomising device? Why would it see that as having value?

Pavlov Generalizes

Caspar Oesterheld6y30

Not super important but maybe worth mentioning in the context of generalizing Pavlov: the strategy Pavlov for the iterated PD can be seen as an extremely shortsighted version of the law of effect, which basically says: repeat actions that have worked well in the past (in similar situations). Of course, the LoE can be applied in a wide range of settings. For example, in their reinforcement learning textbook, Sutton and Barto write that LoE underlies all of (model-free) RL.

2Abram Demski6y

Somewhat true, but without further bells and whistles, RL does not replicate the Pavlov strategy in Prisoner's Dilemma, so I think looking at it that way is missing something important about what's going on.

In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy

Caspar Oesterheld6y10

Elsewhere, I illustrate this result for the absent-minded driver.

CDT=EDT=UDT

Caspar Oesterheld6y60

> I tried to understand Caspar’s EDT+SSA but was unable to figure it out. Can someone show how to apply it to an example like the AMD to help illustrate it?

Sorry about that! I'll try to explain it some more. Let's take the original AMD. Here, the agent only faces a single type of choice -- whether to EXIT or CONTINUE. Hence, in place of a policy we can just condition on $p$ when computing our SSA probabilities. Now, when using EDT+SSA, we assign probabilities to being a specific instance in a specific possible history of the world. For example, ... (read more)

3Wei Dai6y

Thanks, I think I understand now, and made some observations about EDT+SSA at the old thread. At this point I'd say this quote from the OP is clearly wrong: In fact UDT1.0 > EDT+SSA > CDT+SIA, because CDT+SIA is not even able to coordinate agents making the same observation, while EDT+SSA can do that but not coordinate agents making different observations, and UDT1.0 can (probably) coordinate agents making different observations (but seemingly at least some of them require UDT1.1 to coordinate).

Dutch-Booking CDT

Caspar Oesterheld6y*60

Caspar Oesterheld is working on similar ideas.

For anyone who's interested, Abram here refers to my work with Vincent Conitzer which we write about here.

ETA: This work has now been published in The Philosophical Quarterly.

Reflexive Oracles and superrationality: prisoner's dilemma

Caspar Oesterheld6y10

My paper "Robust program equilibrium" (published in Theory and Decision) discusses essentially NicerBot (under the name ϵGroundedFairBot) and mentions Jessica's comment in footnote 3. More generally, the paper takes strategies from iterated games and transfers them into programs for the corresponding program game. As one example, tit for tat in the iterated prisoner's dilemma gives rise to NicerBot in the "open-source prisoner's dilemma".

1Nisan4y

See also this comment from 2013 that has the computable version of NicerBot.

In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy

Caspar Oesterheld7y*100

Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below.

Given some existing policy $π$ , EDT+SSA recommends that upon receiving observation $o$ we should choose an action from $arg max a \sum s_{1} . . . s_{n} n \sum i = 1 S S A (s_{i} in s_{1} . . . s_{n} ∣ o, π_{o \to a}) U (s_{1} . . . s_{n}) .$ (For notational simplicity, I'll assume that poli... (read more)

2Caspar Oesterheld3y

I now have a draft for a paper that gives this result and others.

3Wei Dai6y

I noticed that the sum inside argmaxa∑s1,...,sn∑ni=1SSA(si in s1,...,sn∣o,πo→a)U(sn) is not actually an expected utility, because the SSA probabilities do not add up to 1 when there is more than one possible observation. The issue is that conditional on making an observation, the probabilities for the trajectories not containing that observation become 0, but the other probabilities are not renormalized. So this seems to be part way between "real" EDT and UDT (which does not set those probabilities to 0 and of course also does not renormalize). This zeroing of probabilities of trajectories not containing the current observation (and renormalizing, if one was to do that) seems at best useless busywork, and at worst prevents coordination between agents making different observations. In this formulation of EDT, such coordination is ruled out in another way, namely by specifying that conditional on o→a, the agent is still sure the rest of π is unchanged (i.e., copies of itself receiving other observations keep following π). If we remove the zeroing/renormalizing and say that the agent ought to have more realistic beliefs conditional on o→a, I think we end up with something close to UDT1.0 (modulo differences in the environment model from the original UDT). (Oh, I ignored the splitting up of probabilities of trajectories into SSA probabilities and then adding them back up again, which may have some intuitive appeal but ends up being just a null operation. Does anyone see a significance to that part?)

1Caspar Oesterheld6y

Elsewhere, I illustrate this result for the absent-minded driver.

Caspar Oesterheld7y60

Caveat: The version of EDT provided above only takes dependences between instances of EDT making the same observation into account. Other dependences are possible because different decision situations may be completely "isomorphic"/symmetric even if the observations are different. It turns out that the result is not valid once one takes such dependences into account, as shown by Conitzer [2]. I propose a possible solution in https://casparoesterheld.com/2017/10/22/a-behaviorist-approach-to-building-phenomenological-bridges/ . Roughly speaking, my solution

Caspar Oesterheld7y60

... (read more)