All of xuan's Comments + Replies

It seems to me that it's not right to assume the probability of opportunities to trade are zero?

Suppose both John and David are alive on a desert island right now (but slowly dying), and there's a chance that a rescue boat will arrive that will save only one of them, leaving the other to die. What would they contract to? Assuming no altruistic preferences, presumably neither would agree to only the other person being rescued.

It seems more likely here that bargaining will break down, and one of them will kill off the other, resulting in an arbitrary resolution of who ends up on the rescue boat, not a "rational" resolution.

While I've focused on death here, I think this is actually much more general -- there are a lot of irreversible decisions that people make (and that artificial agents might make) between potentially incommensurable choices. Here's a nice example from Elizabeth Anderson's "Value in Ethics & Economics" (Ch. 3, P57 re: the question of how one should live one's life, to which I think irreversibility applies
 


Similar incommensurability applies, I think, to what kind of society we collectively we want to live in, given that path dependency makes many cho... (read more)

Interesting argument! I think it goes through -- but only under certain ecological / environmental assumptions:

  1. That decisions  / trades between goods are reversible.
  2. That there are multiple opportunities to make such trades / decisions in the environment.

But this isn't always the case! Consider:

  • Both John and David prefer living over dying.
  • Hence, John would not trade (John Alive, David Dead) for (John Dead, David Alive), and vice versa for David.

This is already a case of weakly incomplete preferences which, while technically reducible to a complete orde... (read more)

2Dweomite
Rather than talking about reversibility, can this situation be described just by saying that the probability of certain opportunities is zero?  For example, if John and David somehow know in advance that no one will ever offer them pepperoni in exchange for anchovies, then the maximum amount of probability mass that can be shifted from mushrooms to pepperoni by completing their preferences happens to be zero.  This doesn't need to be a physical law of anchovies; it could just be a characteristic of their trade partners. But in this hypothetical, their preferences are effectively no longer strongly incomplete--or at least, their trade policy is no longer strongly incomplete.  Since we've assumed away the edge between pepperoni and anchovies, we can (vacuously) claim that John and David will collectively accept 100% of the (non-existent) trades from anchovies to pepperoni, and it becomes possible to describe their trade policy as being a utility maximizer.  (Specifically, we can say anchovies = mushrooms because they won't trade between them, and say pepperoni > mushrooms because they will trade mushrooms for pepperoni.  The original problem was that this implies that pepperoni > anchovies, which is false in their preferences, but it is now (vacuously) true in their trade policy if such opportunities have probability zero.)
2Xuan (Tan Zhi Xuan)
While I've focused on death here, I think this is actually much more general -- there are a lot of irreversible decisions that people make (and that artificial agents might make) between potentially incommensurable choices. Here's a nice example from Elizabeth Anderson's "Value in Ethics & Economics" (Ch. 3, P57 re: the question of how one should live one's life, to which I think irreversibility applies   Similar incommensurability applies, I think, to what kind of society we collectively we want to live in, given that path dependency makes many choices irreversible.

Not sure if this is the same as the awards contest entry, but EJT also made this earlier post ("There are no coherence theorems") arguing that certain Dutch Book / money pump arguments against incompleteness fail!

Very interesting work! This is only a half-formed thought, but the diagrams you've created very much remind me of similar diagrams used to display learned "topics" in classic topic models like Latent Dirichlet Allocation (Figure 8 from the paper is below):

I think there's possibly something to be gained by viewing what the MLPs and attention heads are learning as something like "topic models" -- and it may be the case that some of the methods developed for evaluating topic interpretability and consistency will be valuable here. A couple of references:

... (read more)

Regarding causal scrubbing in particular, it seems to me that there's a closely related line of research by Geiger, Icard and Potts that it doesn't seem like TAISIC is engaging with deeply? I haven't looked too closely, but it may be another example of duplicated effort / rediscovery:

The importance of interventions

Over a series of recent papers (Geiger et al. 2020, Geiger et al. 2021, Geiger et al. 2022, Wu et al. 2022a, Wu et al. 2022b), we have argued that the theory of causal abstraction (Chalupka et al. 2016, Rubinstein et al. 2017, Beckers and Halpern

... (read more)
6Lawrence Chan
We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops. Hopefully this will be fixed with the forthcoming arXiv paper!

Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas. While I don't work on interpretability per se, I see similar things happening with value learning / inverse reinforcement learning approaches to alignment.

0David Reber
Ditto. I've recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me. I see similar things happening with causality generally, where it seems to me that (as a 1st order heuristic) much of alignment forum's reference for causality is frozen at Pearl's 2008 textbook, missing what I consider to be most of the valuable recent contributions and expansions in the field.  * Example: Finite Factored Sets seems to be reinventing causal representation learning [for a good intro, see Schölkopf 2021], where it seems to me that the broader field is outpacing FFS on its own goals. FFS promises some theoretical gains (apparently to infer causality where Pearl-esque frameworks can't) but I'm no longer as sure about the validity of this. * Counterexample(s): the Causal Incentives Working Group, and David Krueger's lab, for instance. Notably these are embedded in academia, where there's more culture (incentive) to thoroughly relate to previous work. (These aren't the only ones, just 2 that came to mind.)
2Xuan (Tan Zhi Xuan)
Regarding causal scrubbing in particular, it seems to me that there's a closely related line of research by Geiger, Icard and Potts that it doesn't seem like TAISIC is engaging with deeply? I haven't looked too closely, but it may be another example of duplicated effort / rediscovery: Source: https://ai.stanford.edu/blog/causal-abstraction/ 

Fascinating evidence!

I suspect this maybe because RLHF elicits a singular scale of "goodness" judgements from humans, instead of a plurality of "goodness-of-a-kind" judgements. One way to interpret language models is as *mixtures* of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal:

On this interpretation, what RL from human feedback does is shift/concentrate the distribution ov... (read more)

6Paul Christiano
I agree that the (unprompted) generative model is doing something kind of like: choose a random goal, then optimize it. In some sense that does reflect the "plurality of realistic human goals." But I don't think it's a good way to reflect that diversity. It seems like you want to either (i) be able to pick which goal you pursue, (ii) optimize an aggregate of several goals. Either way, I think that's probably best reflected by a deterministic reward function, and you'd probably prefer be mindful about what you are getting rather than randomly sampling from webtext. (Though as I mention in my other comment, I think there are other good reasons to want the pure generative model.)

Because the rules are meant for humans, with our habits and morals and limitations, and our explicit understanding of them only works because they operate in an ecosystem full of other humans.  I think our rules/norms would fail to work if we tried to port them to a society of octopuses, even if those octopuses were to observe humans to try to improve their understanding of the object-level impact of the rules.


I think there's something to this, but I think perhaps it only applies strongly if and when most of the economy is run by or delegated to AI se... (read more)

But here I would expect people to reasonably disagree on whether an AI system or community of systems has made a good decision, and therefore it seems harder to ever fully trust machines to make decisions at this level. 

I hope the above is at least partially addressed by the last paragraph of the section on Reverse Engineering Roles and Norms! I agree with the worry, and to address it I think we could design systems that mostly just propose revisions or extrapolations to our current rules, or highlight inconsistencies among them (e.g. conflicting laws... (read more)

1phillchris
Hey! Absolutely, I think a lot of this makes sense. I assume you were meaning this paragraph with the Reverse Engineering Roles and Norms paragraph: For both points here, I guess I was getting more at this question by asking these: how ought we structure this collaborative process? Like what constitutes feedback a machine sees to interactively improve with society? Who do AI interact with? What constitutes a datapoint in the moral learning process? These seem like loaded questions, and let me more concrete. In decisions without unanimity with regards to a moral fact, using simple majority rule, for example, could lead to disastrously bad moral theory: you could align an AI with norms resulting in of exploiting 40% of the public by 60% of the public (for example, if a majority deems it moral to exploit / under-provide for a minority, in an extreme case). It strikes me that to prevent this kind of failure mode, there must be some baked-in context of "obviously wrong" beforehand. If you require total unanimity, well then, you will never get even a single datapoint: people will reasonably disagree (I would argue to infinity, after arbitrary amounts of reasonable debate) about basic moral facts due to differences in values. I think this negotiation process is in itself really really important to get right if you advocate this kind of approach, and not by advancing any one moral view of the world. I certainly don't think it's impossible, just as it isn't impossible to have relatively well-functioning democracy. But this is the point I guess: are there limit guarantees to society agreeing after arbitrary lengths of deliberation? Has modern democracy / norm-setting historically risen from mutual deliberation, or from exertion of state power / arbitrary assertion of one norm over another? I honestly don't have sufficient context to answer that, but it seems like relevant empirical fact here.  Maybe another follow up: what are your idealized conditions for "rational / mutu

Hmm, I'm confused --- I don't think I said very much about inner alignment, and I hope to have implied that inner alignment is still important! The talk is primarily a critique of existing approaches to outer alignment (eg. why human preferences alone shouldn't be the alignment target) and is a critique of inner alignment work only insofar as it assumes that defining the right training objective / base objective is not a crucial problem as well.

Maybe a more refined version of the disagreement is about how crucial inner alignment is, vs. defining the right ... (read more)

Agreed that the interpreting law is hard, and the "literal" interpretation is not enough! Hence the need to represent normative uncertainty (e.g. a distribution over multiple formal interpretations of a natural language statement + having uncertainty over what terms in the contract are missing), which I see the section on "Inferring roles and norms" as addressing in ways that go beyond existing "reward modeling" approaches.

Let's call the above "wilful compliance", and the fully-fledged reverse engineering approach as "enlightened compliance". It seems like... (read more)

1Charlie Steiner
I'd be interested :) I think my two core concerns are that our rules/norms are meant for humans, and that even then, actors often have bad impacts that would only be avoided with a pretty broad perspective about their responsibilities. So an AI that follows rules/norms well can't just understand them on the object level, it has to have a really good understanding of what it's like to be a human navigating these rules/norms, and use that understanding to make things go well from a pretty broad perspective. That first one means that not only do I not want the AI to think about what rules mean "in a vacuum," I don't even want it to merely use human knowledge to refine its object-level understanding of the rules. Because the rules are meant for humans, with our habits and morals and limitations, and our explicit understanding of them only works because they operate in an ecosystem full of other humans. I think our rules/norms would fail to work if we tried to port them to a society of octopuses, even if those octopuses were to observe humans to try to improve their understanding of the object-level impact of the rules. An example (maybe not great because it only looks at one dimension of the problem) is that our norms may implicitly assume a certain balance between memetic offense and defense that AIs would upset. E.g. around governmental lobbying (those are also maybe a bad example because they're kinda insufficient already). Also! While listening to the latest Inside View podcast, it occurred to me that this perspective on AI safety has some natural advantages when translating into regulation that present governments might be able to implement to prepare for the future. If AI governance people aren't already thinking about this, maybe bother some / convince people in this comment section to bother some?

On the contrary, I think there exist large, complex, symbolic models of the world that are far more interpretable and useful than learned neural models, even if too complex for any single individual to understand, e.g.:

- The Unity game engine (a configurable model of the physical world)
- Pixar's RenderMan renderer (a model of optics and image formation)
- The GLEAMviz epidemic simulator (a model of socio-biological disease spread at the civilizational scale)

Humans are capable of designing and building these models, and learning how to build/write them as th... (read more)

4Steve Byrnes
I agree that gwern’s proposal “Any model simple enough to be interpretable is too simple to be useful” is an exaggeration. Even the Lake et al. handwritten-character-recognizer is useful. I would have instead said “Any model simple enough to be interpretable is too simple to be sufficient for AGI”. I notice that you are again bringing the discussion back to a comparison between program synthesis world-models versus deep learning world-models, whereas I want to talk about the possibility that neither would be human-interpretable by the time we reach AGI level.

Adding some thoughts as someone who works on probabilistic programming, and has colleagues who work on neurosymbolic approaches to program synthesis:

  • I think a lot of Bayes net structure learning / program synthesis approaches (Bayesian or otherwise) have the issue of uninformative variable names, but I do think it's possible to distinguish between structural interpretability and naming interpretability, as others have noted.
  • In practice, most neural or Bayesian program synthesis applications I'm aware of exhibit something like structural interpretability, b
... (read more)
4Steve Byrnes
Thanks for your reply! When I squint out towards the horizon, I see future researchers trying to do a Bayesian program synthesis thing that builds a generative model of the whole world—everything from “tires are usually black”, to “it’s gauche to wear white after labor day”, to “in this type of math problem, maybe try applying the Cauchy–Schwarz inequality”, etc. etc. etc. I’m perfectly happy to believe that Lake et al. can program-synthesis a little toy generative model of handwritten characters such that it has structural interpretability. But I’m concerned that we’ll work our way up to the thing in the previous paragraph, which might be a billion times more complicated, and it will no longer have structural interpretability. (And likewise I’m concerned that solutions to “uninformative variable names” won’t scale—e.g., how are we going to automatically put English-language labels on the various intuitive models / heuristics that are involved when Ed Witten is thinking about math, or when MLK Jr is writing a speech?) Nominally, I agree with this. But “relative to” is key here. Your takeaway seems to be “OK, great, let’s do probabilistic generative models, they’re better!”. By contrast, my perspective is: “If we take the probabilistic generative model approach, we’re in huge trouble with respect to interpretability, oh man this is really really bad, we gotta work on this ASAP!!! (Oh and by the way if we take the deep net approach then it’s even worse.)”.

This was a great read! I wonder how much you're committed to "brain-inspired" vs "mind-inspired" AGI, given that the approach to "understanding the human brain" you outline seems to correspond to Marr's computational and algorithmic levels of analysis, as opposed to the implementational level (see link for reference). In which case, some would argue, you don't necessarily have to do too much neuroscience to reverse engineer human intelligence. A lot can be gleaned by doing classic psychological experiments to validate the functional roles of various aspect... (read more)

2Steve Byrnes
Thanks! I guess my feeling is that we have a lot of good implementation-level ideas (and keep getting more), and we have a bunch of algorithm ideas, and psychology ideas and introspection and evolution and so on, and we keep piecing all these things together, across all the different levels, into coherent stories, and that's the approach I think will (if continued) lead to AGI. Like, I am in fact very interested in "methods for fast and approximate Bayesian inference" as being relevant for neuroscience and AGI, but I wasn't really interested in it until I learned bunch of supporting ideas about what part of the brain is doing that, and how it works on the neuron level, and how and when and why that particular capability evolved in that part of the brain. Maybe that's just me. I haven't seen compelling (to me) examples of people going successfully from psychology to algorithms without stopping to consider anything whatsoever about how the brain is constructed . Hmm, maybe very early Steve Grossberg stuff? But he talks about the brain constantly now. One reason it's tricky to make sense of psychology data on its own, I think, is the interplay between (1) learning algorithms, (2) learned content (a.k.a. "trained models"), (3) innate hardwired behaviors (mainly in the brainstem & hypothalamus). What you especially want for AGI is to learn about #1, but experiments on adults are dominated by #2, and experiments on infants are dominated by #3, I think.

Yup! And yeah I think those are open research questions -- inference over certain kinds of non-parametric Bayesian models is tractable, but not in general. What makes me optimistic is that humans in similar cultures have similar priors over vast spaces of goals, and seem to do inference over that vast space in a fairly tractable manner. I think things get harder when you can't assume shared priors over goal structure or task structure, both for humans and machines.

Belatedly reading this and have a lot of thoughts about the connection between this issue and robustness to ontological shifts (which I've written a bit about here), but I wanted to share a paper which takes a very small step in addressing some of these questions by detecting when the human's world model may diverge from a robot's world model, and using that as an explanation for why a human might seem to be acting in strange or counter-productive ways:

Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior
Siddharth Reddy, Anca D.

... (read more)

Belatedly seeing this post, but I wanted to note that probabilistic programming languages (PPLs) are centered around this basic idea! Some useful links and introductions to PPLs as a whole:
- Probabilistic models of cognition (web book)
- WebPPL
- An introduction to models in Pyro
- Introduction to Modeling in Gen

And here's a really fascinating paper by some of my colleagues that tries to model causal interventions that go beyond Pearl's do-operator, by formalizing causal interventions as (probabilistic) program transformations:

Bayesian causal inference via pr

... (read more)

Replying to the specific comments:

This still seems like a fair way to evaluate what the alignment community thinks about, but I think it is going to overestimate how parochial the community is. For example, if you go by "what does Stuart Russell think is important", I expect you get a very different view on the field, much of which won't be in the Alignment Newsletter.

I agree. I intended to gesture a little bit at this when I mentioned that "Until more recently, It’s also been excluded and not taken very seriously within traditional academia", because I th... (read more)

5Rohin Shah
Re: worries about "reward", I don't feel like I have a great understanding of what your worry is, but I'd try to summarize it as "while the abstraction of reward is technically sufficiently expressive, 1) it may not have the right inductive biases, and so the framework might fail in practice, and 2) it is not a good framework for thought, because it doesn't sufficiently emphasize many important concepts like logic and hierarchical planning". I think I broadly agree with those points if our plan is to explicitly learn human values, but it seems less relevant when we aren't trying to do that and are instead trying to In this framework, "knowledge about what humans want" doesn't come from a reward function, it comes from something like GPT-3 pretraining. The AI system can "invent" whatever concepts are best for representing its knowledge, which includes what humans want. Here, reward functions should instead be thought of as akin to loss functions -- they are ways of incentivizing particular kinds of outputs. I think it's reasonable to think on priors that this wouldn't be sufficient to get logical / hierarchical behavior, but I think GPT and AlphaStar and all the other recent successes should make you rethink that judgment. ---- I agree that trend-following behavior exists. I agree that this means that work on deep learning is less promising than you might otherwise think. That doesn't mean it's the wrong decision; if there are a hundred other plausible directions, it can still be the case that it's better to bet on deep learning rather than try your hand at guessing which paradigm will become dominant next. To quote Rodney Brooks: He also predicts that the "next big thing" will happen by 2027 (though I get the sense that he might count new kinds of deep learning architectures as a "big thing" so he may not be predicting something as paradigm-shifting as you're thinking). Whether to diversify depends on the size of the field; if you have 1 million alignment res

Thanks for this summary. Just a few things I would change:

  1. "Deep learning" instead of "deep reinforcement learning" at the end of the 1st paragraph -- this is what I meant to say, and I'll update the original post accordingly.
  2. I'd replace "nice" with "right" in the 2nd paragraph.
  3. "certain interpretations of Confucian philosophy" instead of "Confucian philosophy", "the dominant approach in Western philosophy" instead of "Western philosophy" -- I think it's important not to give the impression that either of these is a monolith.
3Rohin Shah
Done :)

Thanks for these thoughts! I'll respond to your disagreement with the framework here, and to the specific comments in a separate reply.

First, with respect to my view about the sources of AI risk, the characterization you've put forth isn't quite accurate (though it's a fair guess, since I wasn't very explicit about it). In particular:

  1. These days I'm actually more worried by structural risks and multi-multi alignment risks, which may be better addressed by AI governance than technical research per se. If we do reach super-intelligence, I think it's more like
... (read more)
5Rohin Shah
I agree with you on 1 and 2 (and am perhaps more optimistic about not building globally optimizing agents; I actually see that as the "default" outcome). I think this is where I disagree. I'd offer two main reasons not to believe this: 1. Children learn to follow common sense, despite not having (explicit) meta-ethical and meta-normative beliefs at all. (Though you could argue that the relevant meta-ethical and meta-normative concepts are inherent in / embedded in / compiled into the human brain's "priors" and learning algorithm.) 2. Intuitively, it seems like sufficiently good imitations of humans would have to have (perhaps implicit) knowledge of "common sense". We can see this to some extent, where GPT-3 demonstrates implicit knowledge of at least some aspects of common sense (though I do not claim that it acts in accordance with common sense). (As a sanity check, we can see that neither of these arguments would apply to the "learning human values" case.) I'm going to assume that Quality Y is "normative" if determining whether an object X has quality Y depends on who is evaluating Y. Put another way, an independent race of aliens that had never encountered humans would probably not converge to the same judgments as we do about quality Y. This feels similar to the is-ought distinction: you cannot determine "ought" facts from "is" facts, because "ought" facts are normative, whereas "is" facts are not (though perhaps you disagree with the latter). I think "common sense is normative" is sufficient to argue that a race of aliens could not build an AI system that had our common sense, without either the aliens or the AI system figuring out the right meta-normative concepts for humanity (which they presumably could not do without encountering humans first). I don't see why it implies that we cannot build an AI system that has our common sense. Even if our common sense is normative, its effects are widespread; it should be possible in theory to back out the conc

In exchange for the mess, we get a lot closer to the structure of what humans think when they imagine the goal of "doing good." Humans strive towards such abstract goals by having a vague notion of what it would look and feel like, and by breaking down those goals into more concrete sub-tasks. This encodes a pattern of preferences over universe-histories that treats some temporally extended patterns as "states."

Thank you for writing this post! I've had very similar thoughts for the past year or so, and I think the quote above is exactly right. IMO, part of... (read more)

3Charlie Steiner
Oh wait, are you the first author on this paper? I didn't make the connection until I got around to reading your recent post. So when you talk about moving to a hierarchical human model, how practical do you think it is to also move to a higher-dimensional space of possible human-models, rather than using a few hand-crafted goals? This necessitates some loss function or prior probability over models, and I'm not sure how many orders of magnitude more computationally expensive it makes everything.
1Charlie Steiner
Sorry for being slow :) No, I haven't read anything of Bratman's. Should I? The synopsis looks like it might have some interesting ideas but I'm worried he could get bogged down in what human planning "really is" rather than what models are useful. I'd totally be happy to chat either here or in PMs. Full Bayesian reasoning seems tricky if the environment is complicated enough to make hierarchical planning attractive - or do you mean optimizing a model for posterior probability (the prior being something like MML?) by local search? I think one interesting question there is if it can learn human foibles. For example, suppose we're playing a racing game and I want to win the race, but fail because my driving skills are bad. How diverse a dataset about me do you need to actually be able to infer that a) I am capable of conceptualizing how good my performance is b) I wanted it to be good c) It wasn't good, from a hierarchical perpective, because of the lower-level planning faculties I have. I think maybe you could actually learn this only from racing game data (no need to make an AGI that can ask me about my goals and do top-down inference), so long as you had diverse enough driving data to make the "bottom-up" generalization that my low-level driving skill can be modeled as bad almost no matter the higher-level goal, and therefore it's simplest to explain me not winning a race by taking the bad driving I display elsewhere as a given and asking what simple higher-level goal fits on top.

Thanks for writing up this post! It's really similar in spirit to some research I've been working on with others, which you can find on the ArXiv here: https://arxiv.org/abs/2006.07532 We also model bounded goal-directed agents by assuming that the agent is running some algorithm given bounded compute, but our approach differs in the following ways:

  • We don't attempt to compute full policies over the state space, since this is generally intractable, and also cognitively implausible, at least for agents like ourselves. Instead, we compute (par
... (read more)
2Adam Shimi
Sorry for the delay in answering. Your paper looks great! It seems to tackle in a clean and formal way what I was vaguely pointing at. We're currently reading a lot of papers and blog posts to prepare for an in-depth literature review about goal-directedness, and I added your paper to the list. I'll try to come back here and comment after I read it.