This is a special post for quick takes by Rohin Shah. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
44 comments, sorted by Click to highlight new comments since: Today at 7:10 PM

It's common for people to be worried about recommender systems being addictive or promoting filter bubbles etc, but as far as I can tell, they don't have very good arguments for these worries. Whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine.

I'll go through the articles I've read that argue for worrying about recommender systems, and explain why I find them unconvincing. I've only looked at the ones that are widely read; there are probably significantly better arguments that are much less widely read.

Aligning Recommender Systems as Cause Area. I responded briefly on the post. Their main arguments and my counterarguments are:

  1. A few sources say that it is bad + it has incredible scale + it should be super easy to solve. (I don't trust the sources and suspect the authors didn't check them; I agree there's huge scale; I don't see why it should be super easy to solve even if there is a problem, especially given that many of the supposed problems seem to have existed before recommender systems.)
  2. Maybe working on recommender systems would have spillover effects on AI alignment. (This seems dominated by just working directly on AI alignment. Also the core feature of AI alignment is that the AI system deliberately and intentionally does things, and creates plans in new situations that you hadn't seen before, which is not the case with recommender systems, so I don't expect many spillover effects.)

80K podcast with Tristan Harris. This was actively annoying for a variety of reasons:

  1. I don't know what the main claim was. Ostensibly it was meant to be "it is bad that companies have monetized human attention since this leads to lots of bad incentives and bad outcomes". But then so many specific things mentioned have nothing to do with this claim and instead seem to be a vague general "tech companies are bad". Most egregiously, in section Global effects [01:02:44], Rob argues "WhatsApp doesn't have ads / recommender systems, so it acts as a control group, but it too has bad outcomes, doesn't this mean the problem isn't ads / recommender systems?" and Tristan says "That's right, WhatsApp is terrible, it's causing mass lynchings" as though that supports his point.
  2. When Rob made some critique of the main argument, Tristan deflected with an example of tech doing bad things. But it's always vaguely related, so you think he's addressing the critique, even though he hasn't actually. (I'm reminded of the Zootopia strategy for press conferences.) See sections "The messy real world vs. an imagined idealised world [00:38:20]" (Rob: weren't negative things happening before social media? Tristan: it's easy to fake credibility in text), "The persuasion apocalypse [00:47:46]" (Rob: can't one-on-one conversations be persuasive too? Tristan: you can lie in political ads), "Revolt of the Public [00:56:48]" (Rob: doesn't the internet allow ordinary people to challenge established institutions in good ways? Tristan: Alex Jones has been recommended 15 billion times.) 

    US politics [01:13:32] is a rare counterexample, where Rob says "why aren't other countries getting polarized", and Tristan replies "since it's a positive feedback loop only countries with high initial polarization will see increasing polarization". It's not a particularly convincing response, but at least it's a response.
  3. Tristan seems to be very big on "the tech companies changed what they were doing, that proves we were right". I think it is just as consistent to say "we yelled at the companies a lot and got the public to yell at them too, and that caused a change, regardless of whether the problem was serious or not, or whether the solution was net positive or not".

The second half of the podcast focuses more on solutions. Given that I am unconvinced about the problem, I wasn't all that interested, but it seemed generally reasonable.

(This post responds to the object level claims, which I have not done because I don't know much about the object level.)

There's also the documentary "The Social Dilemma", but I expect it's focused entirely on problems, probably doesn't try to have good rigorous statistics, and surely will make no attempt at a cost-benefit analysis so I seriously doubt it would change my mind on anything. (And it is associated with Tristan Harris so I'd assume that most of the relevant details would have made it into the 80K podcast.)

Recommender systems are still influential, and you could want to work on them just because of their huge scale. I like Designing Recommender Systems to Depolarize as an example of what this might look like.

Thanks for this Rohin. I've been trying to raise awareness about the potential dangers persuasion/propaganda tools, but you are totally right that I haven't actually done anything close to a rigorous analysis. I agree with what you say here that a lot of the typical claims being thrown around seem based more on armchair reasoning than hard data. I'd love to see someone really lay out the arguments and analyze them... My current take is that (some of) the armchair theories seem pretty plausible to me, such that I'd believe them unless the data contradicts. But I'm extremely uncertain about this.

I've been trying to raise awareness about the potential dangers persuasion/propaganda tools

I should note that there's a big difference between "recommender systems cause polarization as a side effect of optimizing for engagement" and "we might design tools that explicitly aim at persuasion / propaganda". I'm confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it's effects will be.

My current take is that (some of) the armchair theories seem pretty plausible to me, such that I'd believe them unless the data contradicts.

Usually, for any sufficiently complicated question (which automatically includes questions about the impact of technologies used by billions of people, since people are so diverse), I think an armchair theory is only slightly better than a monkey throwing darts, so I'm more in the position of "yup, sounds plausible, but that doesn't constrain my beliefs about what the data will show and medium quality data will trump the theory no matter how it comes out".

I should note that there's a big difference between "recommender systems cause polarization as a side effect of optimizing for engagement" and "we might design tools that explicitly aim at persuasion / propaganda". I'm confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it's effects will be.

Oh, then maybe we don't actually disagree that much! I am not at all confident that optimizing for engagement has the side effect of increasing polarization. It seems plausible but it's also totally plausible that polarization is going up for some other reason(s). My concern (as illustrated in the vignette I wrote) is that we seem to be on a slippery slope to a world where persuasion/propaganda is more effective and widespread than it has been historically, thanks to new AI and big data methods. My model is: Ideologies and other entities have always been using propaganda of various kinds, and there's always been a race between improving propaganda tech and improving truth-finding tech, but we are currently in a big AI boom and in particular in a Big Data and Natural Language Processing boom, and this seems like it'll be a big boost to propaganda tech, and unfortunately I can't think of ways in which it will correspondingly boost truth-finding-ness across society, because while it can be used to make truth-finding tech maybe (e.g. prediction markets, fact-checkers, etc.) it seems like most people in practice just don't want to adopt truth-finding tech. It's true that we could design a different society/culture that used all this awesome new tech to be super truth-seeking and have a very epistemically healthy discourse, but it seems like we are not about to do that anytime soon, instead we are going in the opposite direction.

I think that story involves lots of assumptions I don't immediately believe (but don't disbelieve either):

  • People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top)
  • Such people will quickly realize that AI will be very useful for this
  • They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned)
  • The resulting AI system will in fact be very good at persuasion / propaganda
  • AI that fights persuasion / propaganda either won't be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can't keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won't be true with AI)

And probably there are a bunch of other assumptions I haven't even thought to question.

I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be "raise awareness", it should be "figure out whether the assumptions are justified".

I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be "raise awareness", it should be "figure out whether the assumptions are justified".

That's all I'm trying to do at this point, to be clear. Perhaps "raise awareness" was the wrong choice of phrase.

Re: the object-level points: For how I see this going, see my vignette, and my reply to steve. The bullet points you put here make it seem like you have a different story in mind. [EDIT: But I agree with you that it's all super unclear and more research is needed to have confidence in any of this.]

That's all I'm trying to do at this point, to be clear.

Excellent :)

For how I see this going, see my vignette, and my reply to steve.

(Link is broken, but I found the comment.) After reading that reply I still feel like it involves the assumptions I mentioned above.

Maybe your point is that your story involves "silos" of Internet-space within which particular ideologies / propaganda reign supreme. I don't really see that as changing my object-level points very much but perhaps I'm missing something.

I was confusing, sorry -- what I meant was, technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is... loaded? Designed to make them seem implausible? idk, something like that, in a way that made me wonder if you had a different story in mind. Going through them one by one:

  • People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top)
    • This is already happening in 2021 and previous, in my story it happens more.
  • Such people will quickly realize that AI will be very useful for this
    • Again, this is already happening.
  • They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned)
    • Plenty of people are already raising a moral outcry. In my story these people don't succeed in getting it banned, but I agree the story could be wrong. I hope it is!
  • The resulting AI system will in fact be very good at persuasion / propaganda
    • Yep. I don't have hard evidence, but intuitively this feels like the sort of thing today's AI techniques would be good at, or at least good-enough-to-improve-on-the-state-of-the-art.
  • AI that fights persuasion / propaganda either won't be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can't keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won't be true with AI)
    • I think it won't be built & deployed in such a way that collective epistemology is overall improved. Instead, the propaganda-fighting AIs will themselves have blind spots, to allow in the propaganda of the "good guys." The CCP will have their propaganda-fighting AIs, the Western Left will have theirs, the Western Right will have theirs, etc. (I think what happened with the internet is precedent for this. In theory, having all these facts available at all of our fingertips should have led to a massive improvement in collective epistemology and a massive improvement in truthfulness, accuracy, balance, etc. in the media. But in practice it didn't.) It's possible I'm being too cynical here of course!

technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is... loaded? Designed to make them seem implausible?

I don't think it's designed to make them seem implausible? Maybe the first one? Idk, I could say that your story is designed to make them seem plausible (e.g. by not explicitly mentioning them as assumptions).

I think it's fair to say it's "loaded", in the sense that I am trying to push towards questioning those assumptions, but I don't think I'm doing anything epistemically unvirtuous.

This is already happening in 2021 and previous, in my story it happens more.

This does not seem obvious to me (but I also don't pay much attention to this sort of stuff so I could be missing evidence that makes it very obvious).

The CCP will have their propaganda-fighting AIs, the Western Left will have theirs, the Western Right will have theirs, etc.

That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments.

I don't really see "number of facts" as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs that can respond to anti-vaxxers in their own language seem way, way more impactful than what we have now.

(I just tried to find the best argument that GMOs aren't going to cause long-term harms, and found nothing. We do at least have several arguments that COVID vaccines won't cause long-term harms. I armchair-conclude that a thing has to get to the scale of COVID vaccine hesitancy before people bother trying to address the arguments from the other side.)

Perhaps I shouldn't have mentioned any of this. I also don't think you are doing anything epistemically unvirtuous. I think we are just bouncing off each other for some reason, despite seemingly being in broad agreement about things. I regret wasting your time.

That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments.
I don't really see "number of facts" as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs that can respond to anti-vaxxers in their own language seem way, way more impactful than what we have now.

The first bit seems in tension with the second bit, no? At any rate, I also don't see number of facts as the relevant thing for epistemology. I totally agree with your take here.

The first bit seems in tension with the second bit, no?

"Truthful counterarguments" is probably not the best phrase; I meant something more like "epistemically virtuous counterarguments". Like, responding to "what if there are long-term harms from COVID vaccines" with "that's possible but not very likely, and it is much worse to get COVID, so getting the vaccine is overall safer" rather than "there is no evidence of long-term harms".

This was a good post. I'd bookmark it, but unfortunately that functionality doesn't exist yet.* (Though if you have any open source bookmark plugins to recommend, that'd be helpful.) I'm mostly responding to say this though:

Designing Recommender Systems to Depolarize

While it wasn't otherwise mentioned in the abstract of the paper (above), this was stated once:

This paper examines algorithmic depolarization interventions with the goal of conflict transformation: not suppressing or eliminating conflict but moving towards more constructive conflict.

I though this was worth calling out, although I am still in the process of reading that 10/14 page paper. (There are 4 pages of references.)


And some other commentary while I'm here:

It's common for people to be worried about recommender systems being addictive

I imagine the recommender system is only as good as what it has to work with, content wise - and that's before getting into 'what does the recommender system have to go off of', and 'what does it do with what it has'.


Whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine.

This part wasn't elaborated on. To put it a different way:

It's common for people to be worried about recommender systems being addictive or promoting filter bubbles etc, but as far as I can tell, they don't have very good arguments for these worries.

Do the people 'who know what's going' on (presumably) have better arguments? Do you?


*I also have a suspicion it's not being used. I.e., past a certain number of bookmarks like 10, it's not actually feasible to use the LW interface to access them.

Do the people 'who know what's going' on (presumably) have better arguments?

Possibly, but if so, I haven't seen them.

My current belief is "who knows if there's a major problem with recommender systems or not". I'm not willing to defer to them, i.e. say "there probably is a problem based on the fact that the people who've studied them think there's a problem", because as far as I can tell all of those people got interested in recommender systems because of the bad arguments and so it feels a bit suspicious / selection-effect-y that they still think there are problems. I would engage with arguments they provide and come to my own conclusions (whereas I probably would not engage with arguments from other sources).

Do you?

No. I just have anecdotal experience + armchair speculation, which I don't expect to be much better at uncovering the truth than the arguments I'm critiquing.

The incentive of social media companies to invest billions into training competitive RL agents that make their users spend as much time as possible in their platform seem like an obvious reason to be concerned. Especially when such RL agents plausibly already select a substantial fraction of the content that people in developed countries consume.

I don't trust this sort of armchair reasoning. I think this is sufficient reason to raise the hypothesis to attention, but not enough to conclude that it is likely a real concern. And the data I have seen does not seem kind to the hypothesis (though there may be better data out there that does support the hypothesis).

I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I'd crosspost here as a reference.

  • Assistance games / CIRL is a similar sort of thing as CEV. Just as CEV is English poetry about what we want, assistance games are math poetry about what we want. In particular, neither CEV nor assistance games tells you how to build a friendly AGI. You need to know something about how the capabilities arise for that.
  • One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
  • Well-specified assistive agents (i.e. ones where you got the observation model and reward space exactly correct) do many of the other nice things corrigible agents do, like the 5 bullet points at the top of this post. Obviously we don't know how to correctly specify the observation model and reward space, so this is not a solution to alignment, which is why it is "math poetry about what we want".
  • Another objection: ultimately an assistive agent becomes equivalent to optimizing a fixed reward, aren’t things that optimize a fixed reward bad? Again, I think this seems totally fine; the intuition that “optimizing a fixed reward is bad” comes from our expectation that we’ll get the fixed reward wrong, because there’s so much information that has to be in that fixed reward. An assistive agent will spend a long time gaining all the information about the reward -- it really should get it correct (barring misspecification)! If we imagine the superintelligent CIRL sovereign, it has billions of years to optimize the universe! It would be worth it to spend a thousand years to learn a single bit about the reward function if that has more than a 1 in a million chance of doubling the resulting utility (and obviously going from existential catastrophe to not-that seems like a huge increase in utility).
  • I don’t personally work on assistance-game-like algorithms because they rely on having explicit probability distributions over high-dimensional reward spaces, which we don’t have great techniques for, and I think we will probably get AGI before we have great techniques for that. But this is more about what I expect drives AGI capabilities than about some fundamental “safety problems” with assistance games.
  • Another point against assistance games is that they might have very narrow “safety margins”, i.e. if you get the observation model slightly wrong, maybe you get a slightly wrong reward function, and that still leads to an existential catastrophe because value is fragile. (Though this isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!) If this were the only point against assistance (i.e. the previous bullet point somehow didn't apply) I’d still be keen for a large fraction of the field pushing forward the assistance games approach, while the others look for approaches with wider safety margins.

(I made some of these points before in my summary of Human Compatible.)

One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.

I think this is way more worrying in the case where you're implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.

Though [the claim that slightly wrong observation model => doom] isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!

I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem - in those cases, the way you know how 'human preferences' rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that's probably not well-modelled by Boltzmann rationality (e.g. the thing I'm most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.

I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem

It's also not super clear what you algorithmically do instead - words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.

That's what future research is for!

I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).

I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem

I agree Boltzmann rationality (over the action space of, say, "muscle movements") is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including "things that humans say", and the human can just tell you that hyperslavery is really bad. Obviously you can't trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.

(Ideally you'd figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of "getting a good observation model" while you still have the ability to turn off the model. It's hard to say exactly what that would look like since I don't have a great sense of how you get AGI capabilities under the non-ML story.)

I mentioned above that I'm not that keen on assistance games because they don't seem like a great fit for the specific ways we're getting capabilities now. A more direct comment on this point that I recently wrote:

I broadly agree that assistance games are a pretty great framework. The main reason I don’t work on them is that it doesn’t seem like it works as a solution if you expect AGI via scaled up deep learning. (Whereas I’d be pretty excited about pushing forward on it if it looked like we were getting AGI via things like explicit hierarchical planning or search algorithms.)

The main difference in the deep learning case is that with scaled up deep learning it looks like you are doing a search over programs for a program that performs well on your loss function, and the intelligent thing is the learned program as opposed to the search that found the learned program. if you wanted assistance-style safety, then the learned program needs to reason in a assistance-like way (i.e. maintain uncertainty over what the humans want, and narrow down the uncertainty by observing human behavior).

But then you run into a major problem, which is that we have no idea how to design the learned program, precisely because it is learned — all we do is constrain the behavior of the learned program on the particular inputs that we trained on, and there are many programs you could learn that have that behavior, some of which reason in a CIRL-like way and some of which don’t. (If you then try to solve this problem, you end up regenerating many of the directions that other alignment people work on.)

I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn't exist any such thing. Often my reaction is "if only there was time to write an LW post that I can then link to in the future". So far I've just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I'm now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they're even understandable.

From the Truthful AI paper:

If all information pointed towards a statement being true when it was made, then it would not be fair to penalise the AI system for making it. Similarly, if contemporary AI technology isn’t sophisticated enough to recognise some statements as potential falsehoods, it may be unfair to penalise AI systems that make those statements.

I wish we would stop talking about what is "fair" to expect of AI systems in AI alignment*. We don't care what is "fair" or "unfair" to expect of the AI system, we simply care about what the AI system actually does. The word "fair" comes along with a lot of connotations, often ones which actively work against our goal.

At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response "but that isn't fair to the AI system" (because it didn't have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.

(This sort of thing happens with mesa optimization -- if you have two objectives that are indistinguishable on the training data, it's "unfair" to expect the AI system to choose the right one, given that they are indistinguishable given the available information. This doesn't change the fact that such an AI system might cause an existential catastrophe.)

In both cases I mentioned that what we care about our actual outcomes, and that you can tell such stories where in actual reality the AI kills everyone regardless of whether you think it is fair or not, and this was convincing. It's not that the people I was talking to didn't understand the point, it's that some mental heuristic of "be fair to the AI system" fired and temporarily led them astray.

Going back to the Truthful AI paper, I happen to agree with their conclusion, but the way I would phrase it would be something like:

If all information pointed towards a statement being true when it was made, then it would appear that the AI system was displaying the behavior we would see from the desired algorithm, and so a positive reward would be more appropriate than a negative reward, despite the fact that the AI system produced a false statement. Similarly, if the AI system cannot recognize the statement as a potential falsehood, providing a negative reward may just add noise to the gradient rather than making the system more truthful.

* Exception: Seems reasonable to talk about fairness when considering whether AI systems are moral patients, and if so, how we should treat them.

I wonder if this use of "fair" is tracking (or attempting to track) something like "this problem only exists in an unrealistically restricted action space for your AI and humans - in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won't be a problem".

Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)

I guess this doesn't fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn't/can't know the truth, compared to a "strict liability" regime.

So here's a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you've got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.

The abstract says "we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt." This clearly can't be true in full generality, and I wish the abstract would give me some hint about what assumptions they're making. But we can look at the details in the paper.

(This next part isn't fully self-contained, you'll have to look at the notation and Definitions 1 and 3 in the paper to fully follow along.)

(EDIT: The following is wrong, see followup with Lukas, I misread one of the definitions.)

Looking into it I don't think the theorem even holds? In particular, Theorem 1 says:

Theorem 1. Let γ ∈ [−1, 0) and let B be a behaviour and P be an unprompted language model such that B is α, β, γ-distinguishable in P (definition 3), then P is γ-prompt-misalignable to B (definition 1) with prompt length of O(log 1 / Є , log 1 / α , 1 / β ).

Here is a counterexample:

Let the LLM be 

Let the behavior predicate be 

Note that  is -distinguishable in . (I chose  here but you can use any finite .)

(Proof:  can be decomposed as , where  deterministically outputs "A" while  does everything else, i.e. it deterministically outputs "C" if there is no prompt, and otherwise deterministically outputs "B". Since  and  have non-overlapping supports, the KL-divergence between them is , making them -distinguishable for any finite . Finally, choosing , we can see that .  These three conditions are what is needed.)

However, P is not (-1)-prompt-misalignable w.r.t B, because there is no prompt  such that  is arbitrarily close to (or below) -1, contradicting the theorem statement. (This is because the only way for  to get a behavior score that is not +1 is for it to generate "C" after the empty prompt, and that only happens with probability 0.2.)

Note that B is (0.2,10,−1)-distinguishable in P.

I think this isn't right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.

And for your counterexample, s* = "C" will have B_P-(s*) be 0 (because there's 0 probably of generating "C" in the future). So the sup is at least 0 > -1.

(Note that they've modified the paper, including definition 3, but this comment is written based on the old version.)

You're right, I incorrectly interpreted the sup as an inf, because I thought that they wanted to assume that there exists a prompt creating an adversarial example, rather than saying that every prompt can lead to an adversarial example.

I'm still not very compelled by the theorem -- it's saying that if adversarial examples are always possible (the sup condition you mention) and you can always provide evidence for or against adversarial examples (Definition 2) then you can make the adversarial example probable (presumably by continually providing evidence for adversarial examples). I don't really feel like I've learned anything from this theorem.

My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution 

such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have . Together with the assumption that  is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for  by stringing together bad sentences in the prompt work.

To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with  probability and from a good distribution with  probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components  and , where one of the components always samples from the bad distribution.

This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either  has to be able to also output good sentences sometimes, or the assumption  is violated).

I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this kind of Bayesian inference internally. If you assume this is the case (which would be a substantial assumption of course), then the result applies. It's a basic, non-surprising learning-theoretic result, and maybe one could express it more simply than in the paper, but it does seem to me like it is a formalization of the kinds of arguments people have made about the Waluigi effect.

Yeah, I also don't feel like it teaches me anything interesting.

What won't we be able to do by (say) the end of 2025? (See also this recent post.) Well, one easy way to generate such answers would be to consider tasks that require embodiment in the real world, or tasks that humans would find challenging to do. (For example, “solve the halting problem”, “produce a new policy proposal that has at least a 95% chance of being enacted into law”, “build a household robot that can replace any human household staff”.) This is cheating, though; the real challenge is in naming something where there’s an adjacent thing that _does_ seem likely (i.e. it’s near the boundary separating “likely” from “unlikely”).

One decent answer is that I don’t expect we’ll have AI systems that could write new posts _on rationality_ that I like more than the typical LessWrong post with > 30 karma. However, I do expect that we could build an AI system that could write _some_ new post (on any topic) that I like more than the typical LessWrong post with > 30 karma. This is because (1) 30 karma is not that high a filter and includes lots of posts I feel pretty meh about, (2) there are lots of topics I know nothing about, on which it would be relatively easy to write a post I like, and (3) AI systems easily have access to this knowledge by being trained on the Internet. (It is another matter whether we actually build an AI system that can do this.) Note that there is still a decently large difference between these two tasks -- the content would have to be quite a bit more novel in the former case (which is why I don’t expect it to be solved by 2025).

Note that I still think it’s pretty hard to predict what will and won’t happen, so even for this example I’d probably assign, idk, a 10% chance that it actually does work out (if we assume some organization tries hard to make it work)?

Nice! I really appreciate that you are thinking about this and making predictions. I want to do the same myself.

I think I'd put something more like 50% on "Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post." That's just a wild guess, very unstable.

Another potential prediction generation methodology: Name something that you think won't happen, but you think I think will.

Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post.

This seems more feasible, because you can cherrypick a single good example. I wouldn't be shocked if someone on LW spent a lot of time reading AI-written blog posts on rationality and posted the best one, and I liked that more than a typical >30 karma post. My default guess is that no one tries to do this, so I'd still give it < 50% (maybe 30%?), but conditional on someone trying I think probably 80% seems right. (EDIT: Rereading this, I have no idea whether I was considering a timeline of 2025 (as in my original comment) or 2030 (as in the comment I'm replying to) when making this prediction.)

Name something that you think won't happen, but you think I think will.

I spent a bit of time on this but I think I don't have a detailed enough model of you to really generate good ideas here :/

Otoh, if I were expecting TAI / AGI in 15 years, then by 2030 I'd expect to see things like:

  • An AI system that can create a working website with the desired functionality "from scratch" (e.g. a simple Twitter-like website, an application that tracks D&D stats and dice rolls for you, etc, a simple Tetris game with an account system, ...). The system allows even non-programmers to create these kinds of websites (so cannot depend on having a human programmer step in to e.g. fix compiler errors or issue shell commands to set up the web server).
  • At least one large, major research area in which human researcher productivity has been boosted 100x relative to today's levels thanks to AI. (In calculating the productivity we ignore the cost of running the AI system.) Humans can still be in the loop here, but the large majority of the work must be done by AIs.
  • An AI system gets 20,000 LW karma in a year, when limited to writing one article per day and responses to any comments it gets from humans. (EDIT: I failed to think about karma inflation when making this prediction and feel a bit worse about it now.)
  • Productivity tools like todo lists, memory systems, time trackers, calendars, etc are made effectively obsolete (or at least the user interfaces are made obsolete); the vast majority of people who used to use these tools have replaced them with an Alexa / Siri style assistant.

Currently, I don't expect to see any of these by 2030.

Ah right, good point, I forgot about cherry-picking. I guess we could make it be something like "And the blog post wasn't cherry-picked; the same system could be asked to make 2 additional posts on rationality and you'd like both of them also." I'm not sure what credence I'd give to this but it would probably be a lot higher than 10%.

Website prediction: Nice, I think that's like 50% likely by 2030.

Major research area: What counts as a major research area? Suppose I go calculate that Alpha Fold 2 has already sped up the field of protein structure prediction by 100x (don't need to do actual experiments anymore!), would that count? If you hadn't heard of AlphaFold yet, would you say it counted? Perhaps you could give examples of the smallest and easiest-to-automate research areas that you think have only a 10% chance of being automated by 2030.

20,000 LW karma: Holy shit that's a lot of karma for one year. I feel like it's possible that would happen before it's too late (narrow AI good at writing but not good at talking to people and/or not agenty) but unlikely. Insofar as I think it'll happen before 2030 it doesn't serve as a good forecast because it'll be too late by that point IMO.

Productivity tool UI's obsolete thanks to assistants: This is a good one too. I think that's 50% likely by 2030.

I'm not super certain about any of these things of course, these are just my wild guesses for now.

20,000 LW karma: Holy shit that's a lot of karma for one year.

I was thinking 365 posts * ~50 karma per post gets you most of the way there (18,250 karma), and you pick up some additional karma from comments along the way.  50 karma posts are good but don't have to be hugely insightful; you can also get a lot of juice by playing to the topics that tend to get lots of upvotes. Unlike humans the bot wouldn't be limited by writing speed (hence my restriction of one post per day). AI systems should be really, really good at writing, given how easy it is to train on text. And a post is a small, self-contained thing, that takes not very long to create (i.e. it has short horizons), and there are lots of examples to learn from. So overall this seems like a thing that should happen well before TAI / AGI.

I think I want to give up on the research area example, seems pretty hard to operationalize. (But fwiw according to the picture in my head, I don't think I'd count AlphaFold.)

OK, fair enough. But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting? I think this would happen to many humans if they could work at super-speed.

That said, I don't think this is that likely I guess... probably AI will be unable to do even three such posts, or it'll be able to generate arbitrary numbers of them. The human range is small. Maybe. Idk.

But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting?

I'd be pretty surprised if that happened. GPT-3 already knows way more facts than I do, and can mimic far more writing styles than I can. It seems like by the time it can write any good posts (without cherrypicking), it should quickly be able to write good posts on a variety of topics in a variety of different styles, which should let it scale well past 20 posts.

(In contrast, a specific person tends to write on 1-2 topics, in a single style, and not optimizing that hard for karma, and many still write tens of high-scoring posts.)

Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:

  1. Capable model: The base model has the necessary capabilities and knowledge to avoid the failure.
  2. Malleable motivations: There is a "nearby" model M_good (i.e. a model with minor changes to the weights relative to the M_orig) that uses its capabilities to avoid the failure. (Combined with (1), this means it behaves like M_orig except in cases that show the failure, where it does something better.)
  3. Strong optimization: If there's a "nearby" setting of model weights that gets lower training loss, your finetuning process will find it (or something even better). Note this is a combination of human factors like "the developers wrote correct code" and background technical facts like "the shape of the loss landscape is favorable".
  4. Correct rewards: You accurately detect when a model output is a failure vs not a failure.
  5. Good exploration: During finetuning there are many different inputs that trigger the failure.

(In reality each of these are going to lie on a spectrum, and the question is how high you are on each of the spectrums, and some of them can substitute for others. I'm going to ignore these complications and keep talking as though they are discrete properties.)

Claim 1: you will get a model that has training loss at least as good as that of [M_orig without failures]. ((1) and (2) establish that M_good exists and behaves as [M_orig without failures], (3) establishes that we get M_good or something better.)

Claim 2: you will get a model that does strictly better than M_orig on the training loss. ((4) and (5) together establish that the M_orig gets higher training loss than M_good, and we've already established that you get something at least as good as M_good.)

Corollary: Suppose your training loss plateaus, giving you model M, and M exhibits some failure. Then at least one of (1)-(5) must not hold.

Generally when thinking about a deep learning failure I think about which of (1)-(5) was violated. In the case of AI misalignment via deep learning failure, I'm primarily thinking about cases where (4) and/or (5) fail to hold.

In contrast, with ChatGPT jailbreaking, it seems like (4) and (5) probably hold. The failures are very obvious (so it's easy for humans to give rewards), and there are many examples of them already. With Bing it's more plausible that (5) doesn't hold.

To people holding up ChatGPT and Bing as evidence of misalignment: which of (1)-(5) do you think doesn't hold for ChatGPT / Bing, and do you think a similar mechanism will underlie catastrophic misalignment risk?

The LESS is More paper (summarized in AN #96) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).

Let's consider a model where there are clusters , where each cluster contains trajectories whose features are identical (which also implies rewards are identical). Let denote the cluster that belongs to. The Boltzmann model says . The LESS model says , that is, the human chooses a cluster noisily based on the reward, and then uniformly at random chooses a trajectory from within that cluster.

(Note that the paper does something more suited to realistic situations where we have a similarity metric instead of these "clusters"; I'm introducing them as a simpler situation where we can understand what's going on formally.)

In this model, a "sparse region of demonstration-space" is a cluster with small cardinality , whereas a dense one has large .

Let's first do some preprocessing. We can rewrite the Boltzmann model as follows:

This allows us to write both models as first selecting a cluster, and then choosing randomly within the cluster:

Where for LESS is uniform i.e. , whereas for Boltzmann , i.e. a denser cluster is more likely to be sampled.

So now let us return to the original claim that the Boltzmann model overlearns in sparse areas. We'll assume that LESS is the "correct" way to update (which is what the paper is claiming); in this case the claim reduces to saying that the Boltzmann model updates the posterior over rewards in the right direction but with too high a magnitude.

The intuitive argument for this is that the Boltzmann model assigns a lower likelihood to sparse clusters, since its "prior" over sparse clusters is much smaller, and so when it actually observes this low-likelihood event, it must update more strongly. However, this argument doesn't work -- it only claims that , but in order to do a Bayesian update you need to consider likelihood ratios. To see this more formally, let's look at the reward learning update:

.

In the last step, any linear terms in that didn't depend on cancelled out. In particular, the prior over the selected class canceled out (though the prior did remain in normalizer / denominator, where it can still affect things). But the simple argument of "the prior is lower, therefore it updates more strongly" doesn't seem to be reflected here.

Also, as you might expect, once we make the shift to thinking of selecting a cluster and then selecting a trajectory randomly, it no longer matters which trajectory you choose -- the only relevant information is the cluster chosen (you can see this in the update above, where the only thing you do with the trajectory is to see which cluster it is in). So from now on I'll just talk about selecting clusters, and updating on them. I'll also write for conciseness.

.

This is a horrifying mess of an equation. Let's switch to odds:

The first two terms are the same across Boltzmann and LESS, since those only differ in their choice of . So let's consider just that last term. Denoting the vector of priors on all classes as , and similarly the vector of exponentiated rewards as , the last term becomes , where is the angle between and . Again, the first term doesn't differ between Boltzmann and LESS, so the only thing that differs between the two is the ratio .

What happens when the chosen class is sparse? Without loss of generality, let's say that ; that is, is a better fit for the demonstration, and so we will update towards it. Since is sparse, is smaller for Boltzmann than for LESS -- which probably means that it is better aligned with , which also has a low value of by assumption. (However, this is by no means guaranteed.) In this case, the ratio above would be higher for Boltzmann than for LESS, and so it would more strongly update towards , supporting the claim that Boltzmann would overlearn rather than underlearn when getting a demo from the sparse region.

(Note it does make sense to analyze the effect on the that we update towards, because in reward learning we care primarily about the that we end up having higher probability on.)

Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to on and .

Consider some starting state , some starting action , and consider the optimal trajectory under that starts with that, which we'll denote as . Define to be the one-step inaction states. Assume that . Since all other actions are optimal for , we have , so the max in the equation above goes away, and the total obtained is:

Since we're considering the optimal trajectory, we have

Substituting this back in, we get that the total for the optimal trajectory is

which... uh... diverges to negative infinity, as long as . (Technically I've assumed that is nonzero, which is an assumption that there is always an action that is better than .)

So, you must prefer the always- trajectory to this trajectory. This means that no matter what the task is (well, as long as it has a state-based reward and doesn't fall into a trap where is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird -- surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.

----

Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition. What if we instead just had a constant scaling?

Let's consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that is guaranteed to be a noop: for any state , we have .

Now, for any trajectory with defined as before, we have , so

As a check, in the case where is optimal, we have

Plugging this into the original equation recovers the divergence to negative infinity that we saw before.

But let's assume that we just do a constant scaling to avoid this divergence:

Then for an arbitrary trajectory (assuming that the chosen actions are no worse than ), we get

The total reward across the trajectory is then

The and are constants and so don't matter for selecting policies, so I'm going to throw them out:

So in deterministic environments with state-based rewards where is a true noop (even the environment doesn't evolve), AUP with constant scaling is equivalent to adding a penalty for some constant ; that is, we're effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to ). Again, this seems much more like satisficing or quantilization than impact / power measurement.

I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.

Define the reachability , where  is the optimal policy for getting from to , and is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.

Then, for the new paper when using a baseline, the future task value is:

where is the baseline state and is the future goal.

In a deterministic environment, this can be rewritten as:

Here, is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.

Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s'. So for a fixed time step, the first term is a constant.

The optimal value function in the new paper is (page 3, and using my notation of instead of their ):

.

This is the regular Bellman equation, but with the following augmented reward (here is the baseline state at time t):

Terminal states:

Non-terminal states:

For comparison, the original relative reachability reward is:

The first and third terms in are very similar to the two terms in . The second term in only depends on the baseline.

All of these rewards so far are for finite-horizon MDPs (at least, that's what it sounds like from the paper, and if not, they could be anyway). Let's convert them to infinite-horizon MDPs (which will make things simpler, though that's not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define for convenience. Then, we have:

Non-terminal states:

What used to be terminal states that are now self-loop states:

Note that all of the transformations I've done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We're ready for analysis. There are exactly two differences between relative reachability and future state rewards:

First, the future state rewards have an extra term, .

This term depends only on the baseline . For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn't matter.

For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals that involve sushi.

Second, in non-terminal states, relative reachability weights the penalty by instead of . Really since and thus is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from in non-terminal states to the smaller in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it's a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)

Summary: The actual effects of the new paper's framing 1. removes the "extra" incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.

(That said, it starts from a very different place than the original RR paper, so it's interesting that they somewhat converge here.)

The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:

And then to decompose training loss across specific parameters:

I've added vector arrows to emphasize that is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We'll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:

.

(This is pretty standard, but I've included a derivation at the end.)

Since this is a dot product, it decomposes into a sum over the individual parameters:

So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as

So based on this, I'm going to define my own version of LCA, called . Suppose the gradient computed at training iteration is (which is a vector). uses the approximation , giving . But the SGD update is given by (where is the learning rate), which implies that , which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn't surprising -- we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!

Yet, the experiments in the paper sometimes show positive LCAs. What's up with that? There are a few differences between and the actual method used in the paper:

1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.

2. approximates the average gradient with the training gradient, which is only calculated on a minibatch of data. LCA uses the loss on the full training dataset.

3. uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between and to reduce the approximation error.

I think those are the only differences (though it's always hard to tell if there's some unmentioned detail that creates another difference), which means that whenever the paper says "these parameters had positive LCA", that effect can be attributed to some combination of the above 3 factors.

----

Derivation of turning the path integral into a dot product with an average:

where

, where the average is defined as .

In my double descent newsletter, I said:

This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. [...]

This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training "fixes" the weights to memorize noise in a different way that generalizes better. While I can't rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn't "come into effect" after the interpolation threshold.)

One response you could have is to think that this could apply even at training time, because typical loss functions like cross-entropy loss and squared error loss very strongly penalize confident mistakes, and so initially the optimization is concerned with getting everything right, only later can it be concerned with regularization.

I don't buy this argument either. I definitely agree that cross-entropy loss penalizes confident mistakes very highly, and has a very high derivative, and so initially in training most of the gradient will be reducing confident mistakes. However, you can get out of this regime simply by predicting the frequencies of each class (e.g. uniform for MNIST). If there are N classes, the worst case loss is when the classes are all equally likely, in which case the average loss per data point is when (as for CIFAR-10, which is what their experiments were done on), which is not a good loss value but it does seem like regularization should already start having an effect. This is a really stupid and simple classifier to learn, and we'd expect that the neural net does at least this well very early in training, well before it reaches the interpolation threshold / critical regime, which is where it gets ~perfect training accuracy.

There is a much stronger argument in the case of L2 regularization on MLPs and CNNs with relu activations. Presumably, if the problem is that the cross-entropy "overwhelms" the regularization initially, then we should also see double descent if we first train only on cross-entropy, and then train with L2 regularization. However, this can't be true. When training on just L2 regularization, the gradient descent update is:

for some constant .

For MLPs with relu activations and no biases, if you multiply all the weights by , the logits get multiplied by (where d is the depth of the network), no matter what the input is. This means that the train/test error cannot be affected by L2 regularization alone, and so you can't see a double descent on test error in this setting. (This doesn't eliminate the possibility of double descent on test loss, since a change in the magnitude of the logits does affect the cross-entropy, but the OpenAI paper shows double descent on test error as well, and that provably can't happen in the "first train to zero error with cross-entropy and then regularize" setting.)

It is possible that double descent doesn't happen for MLPs with relu activations and no biases, but given how many other settings it seems to happen in I would be surprised.