This is a special post for quick takes by DanielFilan. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Rationality-related writings that are more comment-shaped than post-shaped. Please don't leave top-level comments here unless they're indistinguishable to me from something I would say here.

21 comments, sorted by Click to highlight new comments since:

A theory of how alignment research should work

(cross-posted from danielfilan.com)

Epistemic status:

  • I listened to the Dwarkesh episode with Gwern and started attempting to think about life, the universe, and everything
  • less than an hour of thought has gone into this post
  • that said, it comes from a background of me thinking for a while about how the field of AI alignment should relate to agent foundations research

Maybe obvious to everyone but me, or totally wrong (this doesn't really grapple with the challenges of working in a domain where an intelligent being might be working against you), but:

  • we currently don't know how to make super-smart computers that do our will
    • this is not just a problem of having a design that is not feasible to implement: we do not even have a sense of what the design would be
    • I'm trying to somewhat abstract over intent alignment vs control approaches, but am mostly thinking about intent alignment
    • I have not thought that much about societal/systemic risks very much, and this post doesn't really address them.
  • ideally we would figure out how to do this
  • the closest traction that we have: deep learning seems to work well in practice, altho our theoretical knowledge of why it works so well or how capabilities are implemented is lagging
  • how should we proceed? Well:
    • thinking about theory alone has not been practical
    • probably we need to look at things that exhibit alignment-related phenomena and understand them, and that will help us develop the requisite theory
      • said things are probably neural networks
    • there are two ways we can look at neural networks: their behaviour, and their implementation.
    • looking at behaviour is conceptually straightforward, and valuable, and being done
    • looking at their implementation is less obvious
    • what we need is tooling that lets us see relevant things about how neural networks are working
    • such tools (e.g. SAEs) are not impossible to create, but it is not obvious that their outputs tell us quantities that are actually of interest
    • in order to discipline the creation of such tools, we should demand that they help us understand models in ways that matter
    • once we get such tools, we should be trying to use them to understand alignment-relevant phenomena, to build up our theory of what we want out of alignment and how it might be implemented
      • this is also a thing that looking at the external behaviour of models in alignment-relevant contexts should be doing
  • so should we be just doing totally empirical things? No.
    • firstly, we need to be disciplined along the way by making sure that we are looking at settings that are in fact relevant to the alignment problem, when we do our behavioural analysis and benchmark our interpretability tools. This involves having a model of what situations are in fact alignment-relevant, what problems we will face as models get smarter, etc
    • secondly, once we have the building blocks for theory, ideally we will put them together and make some actual theorems like "in such-and-such situations models will never become deceptive" (where 'deceptive' has been satisfactorily operationalized in a way that suffices to derive good outcomes from no deception and relatively benign humans)
  • I'm imagining the above as being analogous to an imagined history of statistical mechanics (people who know this history or who have read "inventing temperature" should let me know if I'm totally wrong about it):
    • first we have steam engines etc
    • then we figure out that 'temperature' and 'entropy' are relevant things to track for making the engines run
    • then we relate temperature, entropy, and pressure
    • then we get a good theory of thermodynamics
    • then we develop statistical mechanics
  • exceptions to "theory without empiricism doesn't work":
  • lesson of above: theory does seem to help us analyze some issues and raise possibilities

I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.

Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:

Looking at behaviour is conceptually straightforward, and valuable, and being done

I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.

Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.

When I wrote that, I wasn't thinking so much about evals / model organisms as stuff like:

basically stuff along the lines of "when you put agents in X situation, they tend to do Y thing", rather than trying to understand latent causes / capabilities

Frankfurt-style counterexamples for definitions of optimization

In "Bottle Caps Aren't Optimizers", I wrote about a type of definition of optimization that says system S is optimizing for goal G iff G has a higher value than it would if S didn't exist or were randomly scrambled. I argued against these definitions by providing a examples of systems that satisfy the criterion but are not optimizers. But today, I realized that I could repurpose Frankfurt cases to get examples of optimizers that don't satisfy this criterion.

A Frankfurt case is a thought experiment designed to disprove the following intuitive principle: "a person is morally responsible for what she has done only if she could have done otherwise." Here's the basic idea: suppose Alice is considering whether or not to kill Bob. Upon consideration, she decides to do so, takes out her gun, and shoots Bob. But little-known to her, a neuroscientist had implanted a chip in her brain that would have forced her to shoot Bob if she had decided not to. That said, the chip didn't activate, because she did decide to shoot Bob. The idea is that she's morally responsible, even tho she couldn't have done otherwise.

Anyway, let's do this with optimizers. Suppose I'm playing Go, thinking about how to win - imagining what would happen if I played various moves, and playing moves that make me more likely to win. Further suppose I'm pretty good at it. You might want to say I'm optimizing my moves to win the game. But suppose that, unbeknownst to me, behind my shoulder is famed Go master Shin Jinseo. If I start playing really bad moves, or suddenly die or vanish etc, he will play my moves, and do an even better job at winning. Now, if you remove me or randomly rearrange my parts, my side is actually more likely to win the game. But that doesn't mean I'm optimizing to lose the game! So this is another way such definitions of optimizers are wrong.

That said, other definitions treat this counter-example well. E.g. I think the one given in "The ground of optimization" says that I'm optimizing to win the game (maybe only if I'm playing a weaker opponent).

As far as I can tell, people typically use the orthogonality thesis to argue that smart agents could have any motivations. But the orthogonality thesis is stronger than that, and its extra content is false - there are some goals that are too complicated for a dumb agent to have, because the agent couldn't understand those goals. I think people should instead directly defend the claim that smart agents could have arbitrary goals.

I no longer endorse this claim about what the orthogonality thesis says.

Here's a project idea that I wish someone would pick up (written as a shortform rather than as a post because that's much easier for me):

  • It would be nice to study competent misgeneralization empirically, to give examples and maybe help us develop theory around it.
  • Problem: how do you measure 'competence' without reference to a goal??
  • Prior work has used the 'agents vs devices' framework, where you have a distribution over all reward functions, some likelihood distribution over what 'real agents' would do given a certain reward function, and do Bayesian inference on that vs choosing actions randomly. If conditioned on your behaviour you're probably an agent rather than a random actor, then you're competent.
  • I don't like this:
    • Crucially relies on knowing the space of reward functions that the learner in question might have.
    • Crucially relies on knowing how agents act given certain motivations.
    • A priori it's not so obvious why we care about this metric.
  • Here's another option: throw out 'competence' and talk about 'consequential'.
    • This has a name collision with 'consequentialist' that you'll probably have to fix but whatever.
  • The setup: you have your learner do stuff in a multi-agent environment. You use the AUP metric on every agent other than your learner. You say that your learner is 'consequential' if it strongly affects the attainable utility of other agents.
  • How good is this?
    • It still relies on having a space of reward functions, but there's some more wiggle-room: you probably don't need to get the space exactly right, just to have goals that are similar to yours.
      • Note that this would no longer be true if this were a metric you were optimizing over.
    • You still need to have some idea about how agents will act realistically, because if you only look at the utility attainable by optimal policies, that might elide the fact that it's suddenly gotten much computationally harder to achieve that utility.
      • That said, I still feel like this is going to degrade more gracefully, as long as you include models that are roughly right. I guess this is because this model is no longer a likelihood ratio where misspecification can just rule out the right answer.
    • It's more obvious why we care about this metric.
  • Bonus round: you can probably do some thinking about why various setups would tend to reduce other agents' attainable utility, prove some little theorems, etc., in the style of the power-seeking paper.
    • Ideally you could even show a relation between this and the agents vs devices framing.
  • I think this is the sort of project a first-year PhD student could fruitfully make progress on.

Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.

Let it be known: I'm way more likely to respond to (and thereby algorithmically signal-boost) criticisms of AI doomerism that I think are dumb than those that I think are smart, because the dumb objections are easier to answer. Caveat emptor.

This is a fun Aumann paper that talks about what players have to believe to be in a Nash equilibrium. Here, instead of imagining agents randomizing, we're instead imagining that the probabilities over actions live in the heads of the other agents: you might well know exactly what you're going to do, as long as I don't. It shows that in 2-player games, you can write down conditions that involve mutual knowledge but not common knowledge that imply that the players are at a Nash equilibrium: mutual knowledge of player's conjectures about each other, players' rationality, and players' payoffs suffices. On the contrary, in 3-player games (or games with more players), you need common knowledge: common priors, and common knowledge of conjectures about other players.

The paper writes:

One might suppose that one needs stronger hypotheses in Theorem B [about 3-player games] than in Theorem A [about 2-player games] only because when , the conjectures of two players about a third one may disagree. But that is not so. One of the examples in Section 5 shows that even when the necessary agreement is assumed outright, conditions similar to those of Theorem A do not suffice for Nash equilibrium when .

This is pretty mysterious to me and I wish I understood it better. Probably it would help to read more carefully thru the proofs and examples.

Got it, sort of. Once you have 3 people, then each person has a conjecture about the actions of the other two people. This means that your distribution might not be the product of the marginals over your distributions over the actions of each opponent, so you might be maximizing expected utility wrt your actual beliefs, but not wrt the product of the marginals - and the marginals are what are supposed to form the Nash equilibrium. Common knowledge and common priors mean stop this by forcing your conjecture over the different players to be independent. I still have a hard time explaining in words why this has to be true, but at least I understand the proof.

Quantitative claims about code maintenance from Working in Public, plausibly relevant to discussion of code rot and machine intelligence:

  • "most computer programmers begin their careers doing software maintenance, and many never do anything but", attributed to Nathan Ensmenger, professor at Indiana University.
  • "most software at Google gets rewritten every few years", attributed to Fergus Henderson of Google.
  • "A 2018 Stripe survey of software developers suggested that developers spend 42% of their time maintaining code" - link
  • "Nathan Ensmenger, the informatics professor, notes that, since the early 1960s, maintenance costs account for 50% to 70% of total expenditures on software development" - paper

FYI: I am not using the dialogue matching feature. If you want to dialogue with me, your best bet is to ask me. I will probably say no, but who knows.

An attempt at rephrasing a shard theory critique of utility function reasoning, while restricting myself to things I basically agree with:

Yes, there are representation theorems that say coherent behaviour is optimizing some utility function. And yes, for the sake of discussion let's say this extends to reward functions in the setting of sequential decision-making (even tho I don't remember seeing a theorem for that). But: just because there's a mapping, doesn't mean that we can pull back a uniform measure on utility/reward functions to get a reasonable measure on agents - those theorems don't tell us that we should expect a uniform distribution on utility/reward functions, or even a nice distribution! They would if agents were born with utility functions in their heads represented as tables or something, where you could swap entries in different rows, but that's not what the theorems say!

Suppose there are two online identities, and you want to verify that they're associated with the same person. It's not too hard to verify this: for instance, you could tell one of them something secretly, and ask the other what you told the first. But how do you determine that two online identities are different people? It's not obvious how you do this with anything like cryptographic keys etc.

One way to do it if the identities always do what's causal-decision-theoretically correct is to have the two identities play a prisoner's dilemma with each other, and make it impossible to enforce contracts. If you're playing with yourself, you'll cooperate, but if you're playing with another person you'll defect.

That being said, this only works if the payoff difference between both identities cooperating and both identities defecting is greater than the amount a single person controlling both would pay to convince you that they're actually two people. Which means it only works if the amount you're willing to pay to learn the truth is greater than the amount they're willing to pay to deceive you.

Here's one way you can do it: Suppose we're doing public key cryptography, and every person is associated with one public key. Then when you write things online you could use a linkable ring signature. That means that you prove that you're using a private key that corresponds to one of the known public keys, and you also produce a hash of your keypair, such that (a) the world can tell you're one of the known public keys but not which public key you are, and (b) the world can tell that the key hash you used corresponds to the public key you 'committed' to when writing the proof.

Actually I'm being silly, you don't need ring signatures, just signatures that are associated with identities and also used for financial transfers.

Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.

'Seminar' announcement: me talking quarter-bakedly about products, co-products, deferring, and transparency. 3 pm PT tomorrow (actually 3:10 because that's how time works at Berkeley).

I was daydreaming during a talk earlier today (my fault, the talk was great), and noticed that one diagram in Dylan Hadfield-Menell's off-switch paper looked like the category-theoretic definition of the product of two objects. Now, in category theory, the 'opposite' of a product is a co-product, which in set theory is the disjoint union. So if the product of two actions is deferring to a human about which action to take, what's the co-product? I had an idea about that which I'll keep secret until the talk, when I'll reveal it (you can also read the title to figure it out). I promise that I won't prepare any slides or think very hard about what I'm going to say. I also won't really know what I'm talking about, so hopefully one of you will. The talk will happen in my personal zoom room. Message me for the passcode.

I do not have many ideas here, so it might mostly be me talking about the category-theoretic definition of products and co-products.