AI ALIGNMENT FORUM
AF

All of RyanCarey's Comments + Replies

The idea that "Agents are systems that would adapt their policy if their actions influenced the world in a different way." works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes: we simply check for a path from a utility function F_U to a policy Pi_D. But to apply this to a physical system, we would need a way to obtain such a partition those variables. Specifically, we need to know (1) what counts as a policy, and (2) whether any of its antecedents count as representations of "influence" on the world (and af... (read more)

2Tom Everitt2y

Agree, the formalism relies on a division of variable. One thing that I think we should perhaps have highlighted much more is Appendix B in the paper, which shows how you get a natural partition of the variables from just knowing the object-level variables of a repeated game. A spinal reflex would be different if humans had evolved in a different world. So it reflects an agentic decision by evolution. In this sense, it is similar to the thermostat, which inherits its agency from the humans that designed it. Same as above. One thing that I'm excited about to think further about is what we might call "proper agents", that are agentic in themselves, rather than just inheriting their agency from the evolution / design / training process that made them. I think this is what you're pointing at with the ant's knowledge. Likely it wouldn't quite be a proper agent (but a human would, as we are able to adapt without re-evolving in a new environment). I have some half-developed thoughts on this.

Ryan Carey2y34

The title suggests (weakly perhaps) that the estimates themselves peer-reviewed. Would be clearer to write "building on" peer reviewed argument, or similar.

2michaelcohen2y

Thank you. I've changed the title.

Your posts should be on arXiv

Ryan Carey3y810

Some reports are not publicised in order not to speed up timelines. And ELK is a bit rambly - I wonder if it will get subsumed by much better content within 2yr. But I do largely agree.

chinchilla's wild implications

Ryan Carey3y43

It would be useful to have a more descriptive title, like "Chinchilla's implications for data bottlenecks" or something.

Quantilizers and Generative Models

Ryan Carey3y10

It's noteworthy that the safety guarantee relies on the "hidden cost" (:= proxy_utility - actual_utility) of each action being bounded above. If it's unbounded, then the theoretical guarantee disappears.

1Adam Jermyn3y

Good point! And indeed I am skeptical that there are useful bounds on the cost...

Stop button: towards a causal solution

Ryan Carey3y*10

For past work on causal conceptions of corrigibility, you should check out this by Jessica Taylor. Quite similar.

RyanCarey's Shortform

Ryan Carey3y*20

Transformer models (like GPT-3) are generators of human-like text, so they can be modeled as quantilizers. However, any quantiliser guarantees are very weak, because they quantilise with very low q, equal to the likelihood that a human would generate that prompt.

Provide feedback on Open Philanthropy’s AI alignment RFP

Ryan Carey4y10

I imagine you could catch useful work with i) models of AI safety, or ii) analysis of failure modes, or something, though I'm obviously biased here.

Provide feedback on Open Philanthropy’s AI alignment RFP

Ryan Carey4y*80

The implication seems to be that this RFP is for AIS work that is especially focused on DL systems. Is there likely to be a future RFP for AIS research that applies equally well to DL and non-DL systems? Regardless of where my research lands, I imagine a lot of useful and underfunded research fits in the latter category.

2abergal4y

This RFP is an experiment for us, and we don't yet know if we'll be doing more of them in the future. I think we'd be open to including research directions we think that are promising that apply equally well to both DL and non-DL systems-- I'd be interested in hearing any particular suggestions you have. (We'd also be happy to fund particular proposals in the research directions we've already listed that apply to both DL and non-DL systems, though we will be evaluating them on how well they address the DL-focused challenges we've presented.)

AMA: Paul Christiano, alignment researcher

Ryan Carey4y*40

Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I've paraphrased the three bullet points, and responded in reverse order:

3) Many important incentives are not captured by the approach - e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment.

-> Agreed. We're starting to study "side-effect incentives" (improved name pending), whic... (read more)

The Case for a Journal of AI Alignment

Ryan Carey4y*20

One alternative would be to try to raise funds (e.g. perhaps from the EA LTF fund) to pay reviewers to perform reviews.

The Case for a Journal of AI Alignment

Ryan Carey4y*10

I don't (and perhaps shouldn't) have a guaranteed trigger - probably I will learn a lot more about what the trigger should be over the next couple years. But my current picture would be that the following are mostly true:

The AIS field is publishing 3-10x more papers per year as the causal inference field is now.
We have ~3 highly aligned tenured professors at top-10 schools, and ~3 mostly-aligned tenured professors with ~10k citations, who want to be editors of the journal
The number of great papers that can't get into other top AI journals is >20 per yea

... (read more)

1Koen Holtman4y

I agree with Ryan's comments above on this being somewhat bad timing to start a journal for publishing work like the two examples mentioned at the start of the post above. I have an additional reason, not mentioned by Ryan, for feeling this way. There is an inherent paradox when you want to confer academic credibility or prestige on much of the work that has appeared on LW/AF, work that was produced from an EA or x-risk driven perspective. Often, the authors chose the specific subject area of the work exactly because at the time, they felt that the subject area was a) important for x-risk while also b) lacking the credibility or prestige in main-stream academia that would have been necessary for academia to produce sufficient work in the subject area. If condition b) is not satisfied, or becomes satisfied, then the EA or x-risk driven researchers (and EA givers of research funds) will typically move elsewhere. I can't see any easy way to overcome this paradox of academic prestige-granting on prestige-avoiding work in an academic-style journal. So I think that energy is better spent elsewhere.

The Case for a Journal of AI Alignment

Ryan Carey4y60

This idea has been discussed before. Though it's an important one, so I don't think it's a bad thing for us to bring it up again. My perspective now and previously is that this would be fairly bad at the moment, but might be good in a couple of years time.

My background understanding is that the purpose of a conference or journal in this case (and in general) is primarily to certify the quality of some work (and to a lesser extent, the field of inquiry). This in-turn helps with growing the AIS field, and the careers of AIS researchers.

This is only effective... (read more)

1Adam Shimi4y

Thanks for the detailed feedback! David already linked the facebook conversation, but it's pretty useful that you summarize it in a comment like this. I think that your position makes sense, and you do take into account most of my issues and criticisms about the current model. Do you think you could make really specific statements about what needs to change for a journal to be worth it, detailing more your last paragraph maybe. Also, to provide a first step without the issues that you pointed, I proposed a review mechanism here in the AF in this comment.

Comparing reward learning/reward tampering formalisms

Ryan Carey5y30

It would be nice to draw out this distinction in more detail. One guess:

Uninfluencability seems similar to requiring zero individual treatment effect of D on R.
Riggability (from the paper) would then correspond to zero average treatment effect of D on R

New paper: The Incentives that Shape Behaviour

Ryan Carey5y10

Thanks!

Defining AI wireheading

Ryan Carey5y30

Seems like the idea is that wireheading denotes specification gaming that is egregious in its focus on the measurement channel. I'm inclined to agree..

2Stuart Armstrong5y

Where "measurement channel" not just one specific channel, but anything that has the properties of a measurement channel.

IRL in General Environments

Ryan Carey6y30

Aside from yourself, the other CHAI grad students don't seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem?

I think this is actually pretty strategically reasonable.

CHAI students would have high returns to their probability of attaining a top professorship by writing papers, which is quite beneficial for later recruiting top talent to work on AI safety, and quite structurally beneficial for the establishment of AI safety as a field of research. The t... (read more)

Ryan Carey6y*00

Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.

The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...

1Michaël Trazzi6y

The string is read with probability 1-ϵ

Ryan Carey6y10

The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

I don't see why this has to be true, given that we get to choose the AI's value function. Why can't we just make the agent act-based?

My main concern abou

... (read more)

1Michaël Trazzi6y

Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won't care about future versions of itself nor want to escape. I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between "having the rule to shutdown the system" and "having a current timestep maximizer". For it to really be a "current timestep maximizer" it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at "I might get shutdown at the next timestep". As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl's work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/"training then testing" here.

TAISU - Technical AI Safety Unconference

Ryan Carey6y50

There is now, and it's this thread! I'll also go if a couple of other researchers do ;)

4Victoria Krakovna6y

Janos and I are coming for the weekend part of the unconference

1Linda Linsefors6y

Are you worried about the unconference not having enough participants (in total), or it not having enough senior participants?

1Linda Linsefors6y

Good initiative. I will add a question to the application form, asking if the applicant allows me to share that they are coming. I then will share the participant list here (with the names of those how agreed) and update every few days. For pledges, just write here as Ryan said.

Ryan Carey6y40

Ok! That's very useful to know.

It seems pretty related to the Inverse Reward Design paper. I guess it's a variation. Your setup seems to be more specific about how the evaluator acts, but more general about the environment.

4michaelcohen6y

A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator). I believe this agent's beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the "training environment" from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator's beliefs might be. I agree this agent should definitely be compared to IRD, since they are both agents who don't "take rewards literally", but rather process them in some way first. Note that the design space of things which fit this description is quite large.

Ryan Carey6y30

As others have commented, it's difficult to understand what this math is supposed to say.

My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator's distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.

But this may be inaccurate, or there may be other material ideas here that I've missed.

3michaelcohen6y

Yep.

A brief note on factoring out certain variables

Ryan Carey7y00

[Note: This comment is three years later than the post]

The "obvious idea" here unfortunately seems not to work, because it is vulnerable to so-called "infinite improbability drives". Suppose $B$ is a shutdown button, and $P (b | e)$ gives some weight to $B = p r e s s e d$ and $B = u n p r e s s e d$ . Then, the AI will benefit from selecting a Q such that it always chooses an action $a$ , in which it enters a lottery, and if it does not win, then it the button B is pushed. In this circumstance, $P (b | e)$ is unchanged, while both $P (c | b = p r e s s e d, a, e)$ and $P (c | b = u n p r e s s e d, a, e)$ allocate almost al

... (read more)

HCH as a measure of manipulation

Ryan Carey8y10

I can think of two problems:

Let's generously suppose that $q$ is some fixed distribution of questions that we want the AI system to ask humans. Some manipulative action may only change the answers on $q$ by a little bit but may yet change the consequences of acting on those responses by a lot.
Consider an AI system that optimizes a utility function that includes this kind of term for regularizing against manipulation. The actions that best fulfill this utility function may be ones that manipulate humans a lot (and repurposes their resources for some other fu

... (read more)

0orthonormal8y

Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.

0orthonormal8y

Re #1, an obvious set of questions to include in q are questions of approval for various aspects of the AI's policy. (In particular, if we want the AI to later calculate a human's HCH and ask it for guidance, then we would like to be sure that HCH's answer to that question is not manipulated.)

In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy

Ryan Carey8y40

This result features in the paper by Piccione and Rubeinstein that introduced the absent-minded driver problem [1].

Philosophers like decision theories that self-ratify, and this is indeed a powerful self-ratification principle.

This self-ratification principle does however rely on SIA probabilities assuming the current policy. We have shown that conditioning on your current policy, you will want to continue on with your current policy. i.e. the policy will be a Nash Equilibrium. There can be Nash Equilibria for other policies $π^{'}$ however. The UDT policy will

Ryan Carey8y00

I noticed that CEE is already named in philosophy. Conservation of expected ethics is roughly what what Artnzenius calls Weak Desire Reflection. He calls Conservation of expected evidence Belief Reflection. [1]

Arntzenius, Frank. "No regrets, or: Edith Piaf revamps decision theory." Erkenntnis 68.2 (2008): 277-297. http://www.kennyeaswaran.org/readings/Arntzenius08.pdf

How to judge moral learning failure

Ryan Carey8y00

I’m thinking of modelling this as classical moral uncertainty over plausive value/reward functions in a set R={Ri}, but assuming that the probability of a given Ri is never assumed to go below a certain probability.

It's surprising to me that you would want your probabilities of each reward function to not approach zero, even asymptotically. In regular bandit problems, if your selection of some action never asymptotes toward zero, then you will necessarily keep making some kinds of mistakes forever, incurring linear regret. The same should be true for some suitable definition of regret if you stubbornly continue to behave according to some "wrong" moral theory.

0Stuart Armstrong8y

But I'm arguing that using these moral theories to assess regret, is the wrong thing to do.

Vector-Valued Reinforcement Learning

Ryan Carey8y00

There is a decent-sized literature that is relevant to the concept of vector-valued machine learners, though perhaps less so to the issues around self modification, that self-describes as "pareto-based multiobjective machine learning" and "multi-criteria reinforcement learning".

Online Learning 2: Bandit learning with catastrophes

Ryan Carey8y00

Thanks!!

Additionally to Jessica's comments: uniformly calling the selections 'arms' seems good, as does clarifying what is meant by 'red teams'. I've corrected these, and likewise the definition of $q_{i}$ .

Online Learning 1: Bias-detecting online learners

Ryan Carey9y10

Given that this is my first post, critical feedback is especially welcome.

Learning a concept using only positive examples

Ryan Carey9y00

The quantilizer idea seems excellent.

One note of caution - although it is neat to be able to bound loss, one can still question how useful meaningful 'loss' is here. Looking instead at utility, there is no bound on how much worse a quantilizer can perform relative to a random actor (nor how much it would contribute to our regret).

The real reason to use a quantilizer is that we expect that edge cases might have regret on the order of the cosmic endowment, but that most quantilized actions will not. For this thing that matters, there are no guarantees.

Chatbots or set answers, not WBEs

Ryan Carey10y00

Constraining the output of an AI seems like a reasonable option to explore to me. I agree that generating a finite set of humanlike answers (with a chatbot or otherwise) might be a sensible way to do this. An AI could perform gradient descent over the solution space then pick the nearest proposed behaviour (it could work like relaxation in integer programming).

The multiple choice AI (with human-suggested options) is the most obvious option for avoiding unhumanlike behaviour. Paul has said in some medium comments that he thinks his more elaborate approach o

... (read more)