All of Vlad Mikulik's Comments + Replies

FWIW I agree with you and wouldn't put it the way it is in Roger's post. Not sure what Roger would say in response.

The reasoning for the indistinguishability criterion is as a precaution against a model sandbagging when used as the red team attack model. If the model can tell it’s in a control eval it could pretend to be worse at attacking, making us think the control measures are better than they actually are. If the red team elicitation is robust to this, I agree this isn’t necessary. (See the note below [P2.4]).

6Buck Shlegeris
I don't think this makes any sense. How are you hoping to get the model to attack except by telling it that it's in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.

Ah, yep, I think you're right -- it should be pretty easy to add support for and in selectors then.

You're right that accounting for the softmax could be used to get around the argument in the appendix.  We'll mention this when we update the paper.

The scheme as you described it relies on an always-on dummy token, which would conflict with our implementation of default values in aggregates: we fix the BOS token attention to 0.5, so at low softmax temp it's attended to iff nothing else is; however with this scheme we'd want 0.5 to always round off to zero after softmax. This is plausibly surmountable but we put it out of scope for the pilot release si... (read more)

1Gurkenglas
1.5 does conjunction without my dummy. When at most one clause is true, 0.5 does disjunction instead.

Nice work! 

This section in Anthropic's work on Induction heads seems highly relevant -- I would be interested in seeing an extension of your analysis that looks at what induction heads do in these tasks.

If we believe the claims in that paper, then in-context learning of any kind seems to driven by a fairly simple mechanism not unlike kNN -- induction attention heads. Since it's pretty tractable to locate induction heads in an automated way, we could potentially take a look at the actual mechanism being used to implement these predictions and verify/fa... (read more)

0Lovre
That seems like a great idea, and induction heads do seem highly relevant! What you describe is actually one of the key reasons why I'm so excited about this whole approach. I've seen many interesting metalearning tasks, and they mostly just like work or not work, or they fail sometimes, and you can try to study their failures to perhaps glean some insight into the underlying algorithm -- but...they just don't have (m)any nontrivial "degrees of freedom" in which you can vary them. The class of numerical models, on the other hand, has a substantial amount of nontrivial ways in which you can vary your input -- and even more so, you can vary it not just discretely, but also ~continuously. That makes me really optimistic about the possibility of which you hint, of reverse engineering whatever algorithm the model is running underneath, and then using interpretability tools to verify/falsify those findings. Conversely, interpretability tools could be used to make predictions about the algorithm, which can then be checked. Hence one can imagine a quite meaningful feedback loop between experimentation and interpretability!

Thanks for the link. This has been on my reading list for a little bit and your recco tipped me over.

Mostly I agree with Paul's concerns about this paper. 

However, I did find the "Transformer Feed-Forward Layers Are Key-Value Memories" paper they reference more interesting -- it's more mechanistic, and their results are pretty encouraging. I would personally highlight that one more, as it's IMO stronger evidence for the hypothesis, although not conclusive by any means.

Some experiments they show:

  • Top-k activations of individual 'keys' do seem like coher
... (read more)

Thanks for a great post.

---

One nice point that this post makes (which I suppose was also prominent in the talk, but I can only guess, not being there myself) is that there's a kind of progression we can draw (simplifying a little):

- Human specifies what to do (Classical software)
- Human specifies what to achieve (RL)
- Machine infers a specification of what to achieve (IRL)
- Machine collaborates with human to infer and achieve what the human wants (Assistance games)

Towards the end, this post describes an extrapolation of this trend,

- Machine and human colla... (read more)

1Alex Flint
Thank you for the kind words. Well it would definitely be a mistake to build an AI system that extracts human intentions at some fixed point in time and treats them as fixed forever, yes? So it seems to me that it would be better to build systems predicated on that which is the underlying generator of the trajectory of human intentions. When I say "something bigger that human's intentions should be aligned with" I don't mean "physically bigger", I mean "prior to" or "the cause of". For example, the work concerning corrigibility is about building AI systems that can be modified later, yes? But why is it good to have AI systems that can be modified later? I would say that the implicit claim underlying corrigibility research is that we believe humans have the capacity to, over time, slowly and with many detours, align our own intentions with that which is actually good. So we believe that if we align AI systems with human intentions in a way that is not locked in, then we will be aligning AI systems with that which is actually good. I'm not claiming this is true, just that this is a premise of corrigibility being good. Another way of looking at it: Suppose we look at a whole universe with a single human embedded in it, and we ask: where in this system should we look in order to discover the trajectory of this human's intentions as they evolve through time? We might draw a boundary around the human's left foot and ask: can we discover the trajectory of this human's intentions by examining the configuration of this part of the world? We might draw a boundary around the human's head and ask the same question, and I think some would say in this case that the answer is yes, we can discover the human's intentions by examining the configuration of the head. But this is a remarkably strong claim: it asserts that there is no information crucial to tracing the trajectory of the human's intentions over time in any part of the system outside the head. It we draw a boundary aro

Thanks for the post and writeup, and good work! I especially appreciate the short, informal explanation of what makes this work.

Given my current understanding of the proposal, I have one worry which makes me reluctant to share your optimism about this being a solution to inner alignment:

The scheme doesn't protect us if somehow all top-n demonstrator models have correlated errors. This could happen if they are coordinating, or more prosaically if our way to approximate the posterior leads to such correlations. The picture I have in my head for the latter is... (read more)

2michaelcohen
Yes, I agree that an ensemble of models generated by a neural network may have correlated errors. I only claim to solve the inner alignment problem in theory (i.e. for idealized Bayesian reasoning).

More than a year since writing this post, I would still say it represents the key ideas in the sequence on mesa-optimisation which remain central in today's conversations on mesa-optimisation. I still largely stand by what I wrote, and recommend this post as a complement to that sequence for two reasons:

First, skipping some detail allows it to focus on the important points, making it better-suited than the full sequence for obtaining an overview of the area. 

Second, unlike the sequence, it deemphasises the mechanism of optimisation, and explicitly cas... (read more)

Not Abram, and I have only skimmed the post so far, and maybe you're pointing to something more subtle, but my understanding is this:

In Stuart's original use, 'No Indescribable Hellwords' is the hypothesis that in any possible world in which a human's values are violated, the violation is describable: one can point out to the human how her values are violated by the state of affairs.

Analogously, debate as an approach to alignment could be seen as predicated on a similar hypothesis: that in any possible flawed argument, the flaw is describable: one can poin... (read more)

Thanks for writing this. 

I wish you included an entry for your definition of 'mesa-optimizer'. When you use the term, do you mean the definition from the paper* (an algorithm that's literally doing search using the mesa objective as the criterion), or you do speak more loosely (e.g., a mesa-optimizer is an optimizer in the same sense as a human is an optimizer)? 

A related question is: how would you describe a policy that's a bag of heuristics which, when executed, systematically leads to interesting (low-entopy) low-base-objective states?

*inciden... (read more)

2Evan Hubinger
Glad you liked it! I definitely mean mesa-optimizer to refer to something mechanistically implementing search. That being said, I'm not really sure whether humans count or not on that definition—I would probably say humans do count but are fairly non-central. In terms of the bag of heuristics model, I probably wouldn't count that, though it depends on what “bag of heuristics” means exactly—if the heuristics are being used to guide a planning process or something, then I would call that a mesa-optimizer.

Good point -- I think I wasn't thinking deeply enough about language modelling. I certainly agree that the model has to learn in the colloquial sense, especially if it's doing something really impressive that isn't well-explained by interpolating on dataset examples -- I'm imagining giving GPT-X some new mathematical definitions and asking it to make novel proofs.

I think my confusion was rooted in the fact that you were replying to a section that dealt specifically with learning an inner RL algorithm, and the above sense of 'learni... (read more)

I am quite confused. I wonder if we agree on the substance but not on the wording, but perhaps it’s worthwhile talking this through.

I follow your argument, and it is what I had in mind when I was responding to you earlier. If approximating within the constraints requires computing , then any policy that approximates must compute . (Assuming appropriate constraints that preclude the policy from being a lookup table precomputed by SGD; not sure if that’s what you meant by “other similar”, though this may be trickier to do formally than we... (read more)

3Evan Hubinger
How about language modeling? I think that the task of predicting what a human will say next given some prompt requires learning in a pretty meaningful way, as it requires the model to be able to learn from the prompt what the human is trying to do and then do that.

I interpreted your previous point to mean you only take updates off-policy, but now I see what you meant. When I said you can update after every observation, I meant that you can update once you have made an environment transition and have an (observation, action, reward, observation) tuple. I now see that you meant the RL algorithm doesn't have the ability to update on the reward before the action is taken, which I agree with. I think I still am not convinced, however.

And can we taboo the word 'learning' for this discussion, or keep it to t... (read more)

3Evan Hubinger
I agree with all of that—I was using the term “learning” to be purposefully vague precisely because the space is so large and the point that I'm making is very general and doesn't really depend on exactly what notion of responsiveness/learning you're considering. This does in fact seem like an interesting angle from which to analyze this, though it's definitely not what I was saying—and I agree that current meta-learners probably aren't doing this. What I mean here is in fact very basic—let me try to clarify. Let π∗:S→A be the optimal policy. Furthermore, suppose that any polynomial-time (or some other similar constraint) algorithm that well-approximates π∗ has to perform some operation f. Then, my point is just that, for π to achieve performance comparable with π∗, it has to do f. And my argument for that is just simply because we know that you have to do f to get good performance, which means either π has to do f or the gradient descent algorithm has to—but we know the gradient descent algorithm can't be doing something crazy like running f at each step and putting the result into the model because the gradient descent algorithm only updates the model on the given state after the model has already produced its action.
I've thought of two possible reasons so far.
Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.
More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its
... (read more)
I would propose a third reason, which is just that learning done by the RL algorithm happens after the agent has taken all of its actions in the episode, whereas learning done inside the model can happen during the episode.

This is not true of RL algorithms in general -- If I want, I can make weight updates after every observation. And yet, I suspect that if I meta-train a recurrent policy using such an algorithm on a distribution of bandit tasks, I will get a 'learning-to-learn' style policy.

So I think this is a less fundamental reason, though it is true in off-policy RL.

2Evan Hubinger
You can't update the model based on its action until its taken that action and gotten a reward for it. It's obviously possible to throw in updates based on past data whenever you want, but that's beside the point—the point is that the RL algorithm only gets new information with which to update the model after the model has taken its action, which means if taking actions requires learning, then the model itself has to do that learning.

I had a similar confusion when I first read Evan's comment. I think the thing that obscures this discussion is the extent to which the word 'learning' is overloaded -- so I'd vote taboo the term and use more concrete language.

You might want to look into NMF, which, unlike PCA/SVD, doesn't aim to create an orthogonal projection. It works well for interpretability because its components cannot cancel each other out, which makes its features more intuitive to reason about. I think it is essentially what you want, although I don't think it will allow you to find directly the 'larger set of almost orthogonal vectors' you're looking for.

I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.

Re: inner alignment

I think it’s basically the right term. I guess in my mind I want to say something like, “Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap”, which shows how the two have slightly different shapes. But the difference isn’t really important.

Inner alignment gap? Inner objective gap?

I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.

As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at le

... (read more)
1Matthew Barnett
I'm not sure what's unsatisfying about the characterization I gave? If we just redefined optimizer to mean an interpretation of the agent's behavior, specifically, that it abstractly pursues goals, why is that an unsatisfying way of showing the mesa != base issue? ETA: I agree. And the relevance this plays is that in future systems that might experience malign generalization, we would want some model of how goals play a role in their architecture, because this could help us align the system. But until we have such architectures, or until we have models for how those future systems should behave, we should work abstractly.

I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.

In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.

We’re probably in agreement, but I’m not sure what exactly you mean by “retreat to malign generalisation”.

For me, mesa-optimisation’s primary claim isn’t (call it Optimisers) that agents are well-described as optimisers, which I’m happy to drop. It is the claim (call it Mesa≠Base) that whatever the right way to describe them is, in general their intrinsic goals are distinct from the reward.

That’s a specific (if informal) claim about a possible source of malign generalisation. Namely, that when intrinsic goals differ arbitrarily from the reward, then system

... (read more)
3Rohin Shah
From my perspective, there are three levels: * Most general: The inner agent could malignly generalize in some arbitrary bad way. * Middle: The inner agent malignly generalizes in such a way that it makes sense to call it goal-directed, and the mesa-goal (= intentional-stance-goal) is different from the base-goal. * Most specific: The inner agent encodes an explicit search algorithm, an explicit world model, and an explicit utility function. I worry about the middle case. It seems like upon reading the mesa optimizers paper, most people start to worry about the last case. I would like people to worry about the middle case instead, and test their proposed solutions against that. (Well, ideally they'd test it against the most general case, but if it doesn't work against that, which it probably won't, that isn't necessarily a deal breaker.) I feel better about people accidentally worrying about the most general case, rather than people accidentally worrying about the most specific case. I like "inner alignment", and am not sure why you think it isn't specific enough.
1Matthew Barnett
I think it's fairly easy to point out the problem using an alternative definition. If we just change the definition of mesa optimizer to reflect that we're are using the intentional stance (in other words, we're interpreting the neural network as having goals, whether it's using an internal search or not), the mesa!=base description falls right out, and all the normal risks about building mesa optimizers still apply.
2Matthew Barnett
When you don't have a deep understanding of a phenomenon, it's common to use some empirical description of what you're talking about, rather than using your current (and incorrect) model to interpret the phenomenon. The issue with using your current model, is that it leads you to make incorrect inferences about why things happen because you're relying too heavily on the model being internally correct. Therefore, until we gain a deeper understanding, it's better to use the pre-theoretical description of what we're talking about. I'm assuming that's what Rohin meant by "retreat to malign generalization." This is important because if we used the definition given in the paper, then this could affect which approaches we use to address inner alignment. For instance, we could try using some interpretability technique to discover the "objective" that a neural network was maximizing. But if our model of the neural network as an optimizer is ultimately incorrect, then the neural network won't have an explicit objective, making this technique very difficult.

I’m sympathetic to what I see as the message of this post: that talk of mesa-optimisation is too specific given that the practical worry is something like malign generalisation. I agree that it makes extra assumptions on top of that basic worry, which we might not want to make. I would like to see more focus on inner alignment than on mesa-optimisation as such. I’d also like to see a broader view of possible causes for malign generalisation, which doesn’t stick so closely to the analysis in our paper. (In hindsight our analysis could also have benefitted f

... (read more)
5Rohin Shah
Even taking that as an assumption, it seems like if we accept that "mesa optimizer" doesn't work as a description of humans, then mesa optimization can't be the right view, and we should retreat to malign generalization while trying to figure out a better view.

By that I didn’t mean to imply that we care about mesa-optimisation in particular. I think that this demo working “as intended” is a good demo of an inner alignment failure, which is exciting enough as it is. I just also want to flag that the inner alignment failure doesn’t automatically provide an example of a mesa-optimiser.

I have now seen a few suggestions for environments that demonstrate misaligned mesa-optimisation, and this is one of the best so far. It combines being simple and extensible with being compelling as a demonstration of pseudo-alignment if it works (fails?) as predicted. I think that we will want to explore more sophisticated environments with more possible proxies later, but as a first working demo this seems very promising. Perhaps one could start even without the maze, just a gridworld with keys and boxes.

I don’t know whether observing key-collection beha

... (read more)
5Rohin Shah
I don't see why we care about mesa-optimization in particular. The argument for risk just factors through the fact that capabilities generalize, but the objective doesn't. Why does it matter whether the agent is internally performing some kind of search?
1Matthew Barnett
I agree. That's why I think we should compare it to a hard-coded agent that pursues the optimal policy for collecting keys, and an agent that pursues the optimal policy for opening chests. If the trained agent is similar to the first hard-coded agent rather than the second, this would be striking evidence of misalignment.

Ah; this does seem to be an unfortunate confusion.

I didn’t intend to make ‘utility’ and ‘reward’ terminology – that’s what ‘mesa-‘ and ‘base’ objectives are for. I wasn’t aware of the terms being used in the technical sense as in your comment, so I wanted to use utility and reward as friendlier and familiar words for this intuition-building post. I am not currently inclined to rewrite the whole thing using different words because of this clash, but could add a footnote to clear this up. If the utility/reward distinction in your sense becomes accepted termi

... (read more)
5Issa Rice
To me, it seems like the two distinctions are different. There seem to be three levels to distinguish: 1. The reward (in the reinforcement learning sense) or the base objective (example: inclusive genetic fitness for humans) 2. A mechanism in the brain that dispenses pleasure or provides a proxy for the reward (example: pleasure in humans) 3. The actual goal/utility that the agent ends up pursuing (example: a reflective equilibrium for some human's values, which might have nothing to do with pleasure or inclusive genetic fitness) The base objective vs mesa-objective distinction seems to be about (1) vs a combination of (2) and (3). The reward maximizer vs utility maximizer distinction seems to be about (2) vs (3), or maybe (1) vs (3). Depending on the agent that is considered, only some of these levels may be present: * A "dumb" RL-trained agent that engages in reward gaming. Only level (1), and there is no mesa-optimizer. * A "dumb" RL-trained agent that engages in reward tampering. Only level (1), and there is no mesa-optimizer. * A paperclip maximizer built from scratch. Only level (3), and there is no mesa-optimizer. * A relatively "dumb" mesa-optimizer trained using RL might have just (1) (the base objective) and (2) (the mesa-objective). This kind of agent would be incentivized to tamper with its pleasure circuitry (in the sense of (2)), but wouldn't be incentivized to tamper with its RL-reward circuitry. (Example: rats wirehead to give themselves MAX_PLEASURE, but don't self-modify to delude themselves into thinking they have left many descendants.) * If the training procedure somehow coughs up a mesa-optimizer that doesn't have a "pleasure center" in its brain (I don't know how this would happen, but it seems logically possible), there would just be (1) (the base objective) and (3) (the mesa-objective). This kind of agent wouldn't try to tamper with its utility function (in the sense of (3)), nor would it try to tamper with its RL-reward/base-obj
1Rohin Shah
The reward vs. utility distinction in the grandparent has existed for a while, see for example Learning What to Value.

You’re completely right; I don’t think we meant to have ‘more formally’ there.

I’ve been meaning for a while to read Dennett with reference to this, and actually have a copy of Bacteria to Bach. Can you recommend some choice passages, or is it significantly better to read the entire book?

P.S. I am quite confused about DQN’s status here and don’t wish to suggest that I’m confident it’s an optimiser. Just to point out that it’s plausible we might want to call it one without calling PPO an optimiser.

P.P.S.: I forgot to mention in my previous comment that I enjoyed the objective graph stuff. I think there might be fruitful overlap betwee

... (read more)
4Tom Everitt
Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context). Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there's a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don't have any concrete ideas at the moment -- I can be in touch if I think of something suitable for collaboration!

Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.

What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is us

... (read more)
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Indeed, this is a super slippery question. And I think this is a good reason to stand on the shoulders of a giant like Dennett. Some of the questions he has been tackling are actually quite similar to yours, around the emergence of agency and the emergence of consciousness.

For example, does it ma... (read more)

I think humans are fairly weird because we were selected for an objective that is unlikely to be what we select for in our AIs.

That said, if we model AI success as driven by model size and compute (with maybe innovations in low-level architecture), then I think that the way humans represent objectives is probably fairly close to what we ought to expect.

If we model AI success as mainly innovative high-level architecture, then I think we will see more explicitly represented objectives.

My tentative sense is that for AI to be interpretable (and safer) we want

... (read more)

Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.

I do think this is a pretty speculative argument, even for this sequence.

The main benefit I see of hardcoding optimisation is that, assuming the system's pieces learn as intended (without any mesa-optimisation happening in addition to the hardcoded optimisation) you get more access and control as a programmer over what the learned objective actually is. You could attempt to regress the learned objective directly to a goal you want, or attempt to enforce a certain form on it, etc. When the optimisation itself is learned*, the optimiser is more opaque, and you have fewer ways to affect what goal is learned: which weights of... (read more)

1Rohin Shah
That makes sense, thanks.

The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:

For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.

The idea there isn't that the algorithm simulates human judgement as an external source of information for itself, but that the actual algorithm learns to be a human-like reasoner, with human-like goals (because tha... (read more)

1Rohin Shah
Just to make sure I understand, this example assumes that the base objective is "predict human behavior", and doesn't apply to most base objectives, right?

to what extent are mesa-controllers with simple behavioural objectives going to be simple?

I’m not sure what “simple behavioural objective” really means. But I’d expect that for tasks requiring very simple policies, controllers would do, whereas the more complicated the policy required to solve a task, the more one would need to do some kind of search. Is this what we observe? I’m not sure. AlphaStar and OpenAI Five seem to do well enough in relatively complex domains without any explicit search built into the architecture. Are they using their recurrenc

... (read more)
3Abram Demski
Yeah, I agree that this seems possible, but extremely unclear. If something uses a fairly complex algorithm like FFT, is it search? How "sophisticated" can we get without using search? How can we define "search" and "sophisticated" so that the answer is "not very sophisticated"?

(I am unfortunately currently bogged down with external academic pressures, and so cannot engage with this at the depth I’d like to, but here’s some initial thoughts.)

I endorse this post. The distinction explained here seems interesting and fruitful.

I agree with the idea to treat selection and control as two kinds of analysis, rather than as two kinds of object – I think this loosely maps onto the distinction we make between the mesa-objective and the behavioural objective. The former takes the selection view of the learned algorithm; the latter takes the

... (read more)

Yeah, I agree with most of what you're saying here.

  • A learned controller which isn't implementing any internal selection seems more likely to be incoherent out-of-distribution (ie lack a strong IRL interpretation of its behavior), as compared with a mesa-optimizer;
  • However, this is a low-confidence argument at present; it's very possible that coherent controllers can appear w/o necessarily having a behavioral objective which matches the original objective, in which case a version of the internal alignment problem applies. (But this might be a
... (read more)

The goal that the agent is selected to score well on is not necessarily the goal that the agent is itself pursuing. So, unless the agent’s internal goal matches the goal for which it’s selected, the agent might still seek influence because its internal goal permits that. I think this is in part what Paul means by “Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges)”

1Alex Turner
And if the internal goal doesn’t permit that? I’m trying to feel out which levels of meta are problematic in this situation.

I agree. That’s what I meant when I wrote there will be TMs that artificially promote S itself. However, this would still mean that most of S’s mass in the prior would be due to these TMs, and not due to the natural generator of the string.

Furthermore, it’s unclear how many TMs would promote S vs S’ or other alternatives. Because of this, I don’t now whether the prior would be higher for S or S’ from this reasoning alone. Whichever is the case, the prior no longer reflects meaningful information about the universe that generates S and whose inhabitants are using the prefix to choose what to do; it’s dominated by these TMs that search for prefixes they can attempt to influence.

2Alex Mennen
I didn't mean that an agenty Turing machine would find S and then decide that it wants you to correctly predict S. I meant that to the extent that predicting S is commonly useful, there should be a simple underlying reason why it is commonly useful, and this reason should give you a natural way of computing S that does not have the overhead of any agency that decides whether or not it wants you to correctly predict S.

I agree that this probably happens when you set out to mess with an arbitrary particular S, I.e. try to make some S’ that shares a prefix with S as likely as S.

However, some S are special, in the sense that their prefixes are being used to make very important decisions. If you, as a malicious TM in the prior, perform an exhaustive search of universes, you can narrow down your options to only a few prefixes used to make pivotal decisions, selecting one of those to mess with is then very cheap to specify. I use S to refer to those strings that are the ‘natu... (read more)

1Alex Mennen
This reasoning seems to rely on there being such strings S that are useful to predict far out of proportion to what you would expect from their complexity. But a description of the circumstance in which predicting S is so useful should itself give you a way of specifying S, so I doubt that this is possible.

The trigger sequence is a cool idea.

I want to add that the intended generator TM also needs to specify a start-to-read time, so there is symmetry there. Whatever method a TM needs to use to select the camera start time in the intended generator for the real world samples, it can also use in the simulated world with alien life, since for the scheme to work only the difference in complexity between the two matters.

There is additional flex in that unlike the intended generator, the reasoner TM can sample its universe simulation at any cheaply computable interval, giving the civilisation the option of choosing any amount of thinking they can perform between outputs, if they so choose.