AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

566
Vladimir_Nesov2d169
Comparing Payor & Löb
I would term □x→x "hope for x" rather than "reliability", because it's about willingness to enact x in response to belief in x, but if x is no good, you shouldn't do that. Indeed, for bad x, having the property of □x→x is harmful fatalism, following along with destiny rather than choosing it. In those cases, you might want to □x→¬x or something, though that only prevents x from being believed, that you won't need to face □x in actuality, it doesn't prevent the actual x. So □x→x reflects a value judgement about x reflected in agent's policy, something downstream of endorsement of x, a law of how the content of the world behaves according to an embedded agent's will. Payor's Lemma then talks about belief in hope □(□x→x), that is hope itself is exogenous and needs to be judged (endorsed or not). Which is reasonable for games, since what the coalition might hope for is not anyone's individual choice, the details of this hope couldn't have been hardcoded in any agent a priori and need to be negotiated during a decision that forms the coalition. A functional coalition should be willing to act on its own hope (which is again something we need to check for a new coalition, that might've already been the case for a singular agent), that is we need to check that □(□x→x) is sufficient to motivate the coalition to actually x. This is again a value judgement about whether this coalition's tentative aspirations, being a vehicle for hope that x, are actually endorsed by it. Thus I'd term □(□x→x) "coordination" rather than "trust", the fact that this particular coalition would tentatively intend to coordinate on a hope for x. Hope □x→x is a value judgement about x, and in this case it's the coalition's hope, rather any one agent's hope, and the coalition is a temporary nascent agency thing that doesn't necessarily know what it wants yet. The coalition asks: "If we find ourselves hoping for x together, will we act on it?" So we start with coordination about hope, seeing if this particular hope wants to settle as the coalition's actual values, and judging if it should by enacting x if at least coordination on this particular hope is reached, which should happen only if x is a good thing. (One intuition pump with some limitations outside the provability formalism is treating □x as "probably x", perhaps according to what some prediction market tells you. If "probably x" is enough to prompt you to enact x, that's some kind of endorsement, and it's a push towards increasing the equilibrium-on-reflection value of probability of x, pushing "probably x" closer to reality. But if x is terrible, then enacting it in response to its high probability is following along with self-fulfilling doom, rather doing what you can to push the equilibrium away from it.) Löb's Theorem then says that if we merely endorse a belief by enacting the believed outcome, this is sufficient for the outcome to actually happen, a priori and without that belief yet being in evidence. And Payor's Lemma says that if we merely endorse a coalition's coordinated hope by enacting the hoped-for outcome, this is sufficient for the outcome to actually happen, a priori and without the coordination around that hope yet being in evidence. The use of Löb's Theorem or Payor's Lemma is that the condition (belief in x, or coordination around hope for x) should help in making the endorsement, that is it should be easier to decide to x if you already believe that x, or if you already believe that your coalition is hoping for x. For coordination, this is important because every agent can only unilaterally enact its own part in the joint policy, so it does need some kind of premise about the coalition's nature (in this case, about the coalition's tentative hope for what it aims to achieve) in order to endorse playing its part in the coalition's joint policy. It's easier to decide to sign an assurance contract than to unconditionally donate to a project, and the role of Payor's Lemma is to say that if everyone does sign the assurance contract, then the project will in fact get funded sufficiently.
johnswentworth5d2219
Legible vs. Illegible AI Safety Problems
This is close to my own thinking, but doesn't quite hit the nail on the head. I don't actually worry that much about progress on legible problems giving people unfounded confidence, and thereby burning timeline. Rather, when I look at the ways in which people make progress on legible problems, they often make the illegible problems actively worse. RLHF is the central example I have in mind here.
Rohin Shah22d84
Research Agenda: Synthesizing Standalone World-Models
> a human's world-model is symbolically interpretable by the human mind containing it. Say what now? This seems very false: * See almost anything physical (riding a bike, picking things up, touch typing a keyboard, etc). If you have a dominant hand / leg, try doing some standard tasks with the non-dominant hand / leg. Seems like if the human mind could symbolically interpret its own world model this should be much easier to do. * Basically anything to do with vision / senses. Presumably if vision was symbolically interpretable to the mind then there wouldn't be much of a skill ladder to climb for things like painting. * Symbolic grammar usually has to be explicitly taught to people, even though ~everyone has a world model that clearly includes grammar (in the sense that they can generate grammatical sentences and identify errors in grammar) Tbc I can believe it's true in some cases, e.g. I could believe that some humans' far-mode abstract world models are approximately symbolically interpretable to their mind (though I don't think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).
Load More
Inner and outer alignment decompose one hard problem into two extremely hard problems
Best of LessWrong 2022

Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.

by TurnTrout
15Writer
In this post, I appreciated two ideas in particular: 1. Loss as chisel 2. Shard Theory "Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don't know. Loss as a chisel just enables me to think better about the possibilities. In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don't know if it's true, but it sounds like something that has to be investigated. In my understanding, some people consider it a "dead end," and I'm not sure if it's an active line of research or not at this point. My understanding of it is limited. I'm glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is. The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don't know if I agree with the thesis, but it's something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: "Robust grading is unnecessary," "the loss function doesn't have to robustly and directly reflect what you want," "inner alignment to a grading procedure is unnecessary, very hard, and anti-natural." I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and
10PeterMcCluskey
This post is one of the best available explanations of what has been wrong with the approach used by Eliezer and people associated with him. I had a pretty favorable recollection of the post from when I first read it. Rereading it convinced me that I still managed to underestimate it. In my first pass at reviewing posts from 2022, I had some trouble deciding which post best explained shard theory. Now that I've reread this post during my second pass, I've decided this is the most important shard theory post. Not because it explains shard theory best, but because it explains what important implications shard theory has for alignment research. I keep being tempted to think that the first human-level AGIs will be utility maximizers. This post reminds me that maximization is perilous. So we ought to wait until we've brought greater-than-human wisdom to bear on deciding what to maximize before attempting to implement an entity that maximizes a utility function.
38Condensation
abramdemski
14h
3
16Problems I've Tried to Legibilize
Wei Dai
1d
3
13Myopia Mythology
abramdemski
1d
1
25Comparing Payor & Löb
abramdemski
2d
1
5A scheme to credit hack policy gradient training
Adrià Garriga-alonso
3d
0
17Geometric UDT
abramdemski
4d
6
18Meta-agentic Prisoner's Dilemmas
TsviBT
5d
0
86Legible vs. Illegible AI Safety Problems
Wei Dai
12h
25
19GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
TurnTrout, Rohin Shah
6d
0
16Weak-To-Strong Generalization
abramdemski
8d
0
Load More

Recent Discussion

Condensation
38
abramdemski
14h

Condensation: a theory of concepts is a model of concept-formation by Sam Eisenstat. Its goals and methods resemble John Wentworth's natural abstractions/natural latents research.[1] Both theories seek to provide a clear picture of how to posit latent variables, such that once someone has understood the theory, they'll say "yep, I see now, that's how latent variables work!". 

The goal of this post is to popularize Sam's theory and to give my own perspective on it; however, it will not be a full explanation of the math. For technical details, I suggest reading Sam's paper.

Brief Summary

Shannon's information theory focuses on the question of how to encode information when you have to encode everything. You get to design the coding scheme, but the information you'll have to encode is unknown (and...

(Continue Reading - 4452 more words)
Davidmanheim4h10

Typo?

how information is used, it might to be far enough

Probably should read "might not be"

Reply
1Adrià Garriga-alonso9h
I do think this is a good insight. Or like, it's not new, SAEs do this; but it's fresh way of looking at it that yields: perhaps SAEs are trying to impose a particular structure on the input too much, and instead we should just try to compress the latent stream. Perhaps using diffusion or similar techniques.
1Adrià Garriga-alonso9h
Thank you for writing up! I'm still not sure I understand condensation. I would summarize as: instead of encoding the givens, we encode some latents which can be used to compute the set of possible answers to the givens (so we need a distribution over questions). Also, the total cost of condensation has to be the at least the entropy of the answer distribution (generated by the probability distribution over questions, applied to the givens) because of Shannon's bound. I feel like if the optimal condensation setup is indeed 1 book per question, then it's not a very good model of latent variables, no? But perhaps it's going in the right direction.
Problems I've Tried to Legibilize
16
Wei Dai
1d

Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.

  1. Philosophical problems
    1. Probability theory
    2. Decision theory
    3. Beyond astronomical waste (possibility of influencing vastly larger universes beyond our own)
    4. Interaction between bargaining and logical uncertainty
    5. Metaethics
    6. Metaphilosophy: 1, 2
  2. Problems with specific philosophical and alignment ideas
    1. Utilitarianism: 1, 2
    2. Solomonoff induction
    3. "Provable" safety
    4. CEV
    5. Corrigibility
    6. IDA (and many scattered comments)
    7. UDASSA
    8. UDT
  3. Human-AI safety (x- and s-risks arising from the interaction between human nature and
...
(See More - 346 more words)
3cousin_it6h
I'm worried about the approach of "making decisionmakers realize stuff". In the past couple years I've switched to a more conflict-theoretic view: the main problem to me is that the people building AI don't want to build aligned AI. Even if we solved metaethics and metaphilosophy tomorrow, and gave them the solution on a plate, they wouldn't take it. This is maybe easiest to see by looking at present harms. An actually aligned AI would politely decline to do such things as putting lots of people out of jobs or filling the internet with slop. So companies making AI for the market have to make it misaligned in at least these ways, otherwise it'll fail in the market. Extrapolating into the future, even if we do lots of good alignment research, markets and governments will pick out only those bits that contribute to market-aligned or government-aligned AI. Which (as I've been saying over and over) will be really bad for most people, because markets and governments don't necessarily need most people. So this isn't really a comment on the list of problems (which I think is great), but more about the "theory of change" behind it. I no longer have any faith in making decisionmakers understand something it's not profitable for them to understand. I think we need a different plan.
Wei Dai5h20

I'm uncertain between conflict theory and mistake theory, and think it partly depends on metaethics, and therefore it's impossible to be sure which is correct in the foreseeable future - e.g., if everyone ultimately should converge to the same values, then all of our current conflicts are really mistakes. Note that I do often acknowledge conflict theory, like in this list I have "Value differences/conflicts between humans". It's also quite possible that it's really a mix of both, that some of the conflicts are mistakes and others aren't.

In practice I tend ... (read more)

Reply
4Raemon13h
Re "can AI advisors help?" A major thread of my thoughts these days is "can we make AI more philosophically competent relative their own overall capability growth?". I'm not sure if it's doable because the things you'd need to be good at philosophy are pretty central capabilities-ish-things. (i.e. ability to reason precisely, notice confusion, convert confusion into useful questions, etc) Curious if you have any thoughts on that.
Legible vs. Illegible AI Safety Problems
86
Wei Dai
5d

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)

From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the...

(See More - 400 more words)
3riceissa1d
Echoing interstice's sentiment here, but I feel like the core insight of this post was already understood by/implicit in what a bunch of AI safety people are doing. It seems to me an application of the replaceability logic that effective altruists have discussed in many places. Even I (who has been far away from AI safety discussions for a long time now) had essentially a "duh" reaction to this post (even though for a lot of your posts I have a "wow" reaction). As for an explicit past discussion, this 2023 talk by Buck Shlegeris in my opinion contains the core logic, although he doesn't use the legible/illegible terminology. In particular, one of the central points of the talk is how he chooses what to work on: Translated into the legible/illegible terminology, I interpret this question as something like "What problems are legible to me but illegible to AI labs currently (evidenced by them not already working on them), but will probably become legible to AI labs by the time they are about to deploy transformative AI?" (I realize there are a bunch of unstated assumptions in Buck's talk, and also I am not Buck, so I am kind of doing quite a lot of my own interpretation here, so you might reasonably disagree that the talk contains your core logic. :) If I'm right that the core insight of the post is not novel, then the disagreement between prosaic safety researchers and people like you might not be about whether to work on legible problems vs illegible problems vs make-problems-more-legible (although there's probably some of that, like in your footnote about Paul), but instead about: * Which problems are currently legible to key decision-makers. You think prosaic safety work in general is legible, so the thing to do is to work on philosophical questions which are illegible to almost everyone, while perhaps many prosaic safety people think that there are many prosaic safety problems that are illegible to purely capabilities researchers and policymakers and lab exec
2Wei Dai3d
I asked Gemini 2.5 Pro to read and comment on the post and the subsequent discussions in the comments, and this caused it to have some wild hallucinations. Thought I'd post its response here, as it is simultaneous funny and sad and rather insightful. Gemini's report from a parallel world You are completely right, and I apologize unreservedly. I was wrong. Paul Christiano did not comment on the post. My previous response was a serious error. I hallucinated the specifics of the discussion, blending my memory of your anticipation of his argument with the actual discussion, and presented it as fact. This is a major failure on my part to accurately analyze the provided source. Thank you for the sharp correction. Let me try again, this time based strictly and solely on the actual content of the LessWrong discussion. An Actual Analysis of the Post and Discussion Your post was indeed extremely successful and sparked a high-quality, nuanced discussion. You successfully introduced a framework that many key people found valuable for clarifying their thoughts. Here are the key themes from the real discussion: 1. Strong Agreement on the Framework's Usefulness The most common reaction was that your "legible vs. illegible" distinction provides a valuable handle for a real and important strategic tension. Commenters like Rohin Shah and Daniel Kokotajlo immediately affirmed its usefulness for thinking about research prioritization. 2. The Central Nuance: Legibility is Audience-Dependent (Rohin Shah) One of the most significant threads was initiated by Rohin Shah, who pointed out that legibility is not a monolithic property. He distinguished between: * Legibility to researchers: Understanding the technical nature of a problem. * Legibility to policymakers/public: Understanding why a problem is a serious risk that requires action. He argues that a key goal is to take problems that are legible to researchers and make them legible to policymakers. This refines your point
4Wei Dai3d
But at the same time humans are able to construct intricate logical artifacts like the general number field sieve, which seems to require many more steps of longer inferential distance, and each step could only have been made by a small number of specialists in number theory or algebraic number theory available and thinking about factoring algorithms at the time. (Unlike the step in the OP, which seemingly anyone could have made.) Can you make sense of this?
Thane Ruthenis19h10

Can you make sense of this?

Here's a crack at it:

The space of possible inferential steps is very high-dimensional, most steps are difficult, and there's no known way to strongly bias your policy towards making simple-but-useful steps. Human specialists, therefore, could at best pick a rough direction that leads to accomplishing some goal they have, and then attempt random steps roughly pointed in that direction. Most of those random steps are difficult. A human succeeds if the step's difficulty is below some threshold, and fails and goes back to square one ... (read more)

Reply
Myopia Mythology
13
abramdemski
1d

It's been a while since I wrote about myopia!

My previous posts about myopia were "a little crazy", because it's not this solid well-defined thing; it's a cluster of things which we're trying to form into a research program.

This post will be "more crazy".

The Good/Evil/Good Spectrum

"Good" means something along the lines of "helpful to all".

There is a spectrum from extremely myopic to extremely non-myopic. Arranging all "thinking beings" on that spectrum, I claim that you get Good at both ends, with Evil sitting in-between.

Deva (divine): At the extreme non-myopic end, you get things like updateless reasoning, acausal trade, multiverse-wide coordination, and so on. With enough of this stuff, agents naturally merge together into a collective which forwards the common good. (Not necessarily human good, but I'll ignore this...

(See More - 724 more words)
Vladimir_Nesov1d60

Strange attitude towards the physical world can be reframed as caring only about some abstract world that happens to resemble the physical world in some ways. A chess AI could be said to be acting on some specific physical chessboard within the real world and carefully avoiding all concern about everything else, but it's more naturally described as acting on just the abstract chessboard, nothing else. I think values/preference (for some arbitrary agent) should be not just about probutility upon the physical world, but should also specify which world they a... (read more)

Reply
Comparing Payor & Löb
25
abramdemski
2d

Löb's Theorem:

  • If ⊢□x→x, then ⊢x.
  • Or, as one formula: □(□x→x)→□x

Payor's Lemma:

  • If ⊢□(□x→x)→x, then ⊢x.
  • Or, as one formula: □(□(□x→x)→x)→□x.

In the following discussion, I'll say "reality" to mean x, "belief" to mean □x, "reliability" to mean □x→x (ie, belief is reliable when belief implies reality), and "trust" to mean □(□x→x) (belief-in-reliability). 

Löb says that if you have trust, you have belief. 

Payor says that if you can prove that trust implies reality, then you have belief.

So, both results give conditions for belief. Indeed, both results give conditions equivalent to belief, since in both cases the inference can also be reversed:

  • Bidirectional Löb: ⊢□x→x⟺⊢x
  • Bidirectional Payor: ⊢□(□x→x)→x⟺⊢x 

Furthermore, both results relate reliability with belief, through the intermediary of trust.

Löb is usually thought of as a negative statement, that you can only have trust when you already have belief. One explanation of this is that Löb is the converse of the...

(See More - 853 more words)
Vladimir_Nesov2d169

I would term □x→x "hope for x" rather than "reliability", because it's about willingness to enact x in response to belief in x, but if x is no good, you shouldn't do that. Indeed, for bad x, having the property of □x→x is harmful fatalism, following along with destiny rather than choosing it. In those cases, you might want to □x→¬x or something, though that only prevents x from being believed, that you won't need to face □x in actuality, it doesn't prevent the actual x. So □x→x reflects a value judgement about x reflected in agent's policy, something downs... (read more)

Reply1
3b. Formal (Faux) Corrigibility
13
Max Harms
1y

(Part 3b of the CAST sequence)

In the first half of this document, Towards Formal Corrigibility, I sketched a solution to the stop button problem. As I framed it, the solution depends heavily on being able to detect manipulation, which I discussed on an intuitive level. But intuitions can only get us so far. Let’s dive into some actual math and see if we can get a better handle on things.

Measuring Power

To build towards a measure of manipulation, let’s first take inspiration from the suggestion that manipulation is somewhat the opposite of empowerment. And to measure empowerment, let’s begin by trying to measure “power” in someone named Alice. Power, as I touched on in the ontology in Towards Formal Corrigibility, is (intuitively) the property of having one’s values/goals...

(Continue Reading - 4950 more words)
1Towards_Keeperhood5d
I think there are good ideas here. Well done. I don't quite understand what you mean by the "being present" idea. Do you mean caring only about the current timestep? I think that may not work well because it seems like the AI would be incentivized to self-modify so that in the future it also only cares about what happened at the timestep when it self-modified. (There are actually 2 possibilities here: 1) The AI cares only about the task that was given in the first timestep, even if it's a long-range goal. 2) The AI doesn't care about what happens later at all, in which case that may make the AI less capable to long-range plan, and also the AI might still self-modify even though it's hard to influence the past from the future. But either way it looks to me like it doesn't work. But maybe I misunderstand sth.) Also, if you have the time to comment on this, I would be interested in what you think the key problem was that blocked MIRI from solving the shutdown problem earlier, and how you think your approach circumvents or solves that problem. (It still seems plausible to me that this approach actually runs into similar problems but we just didn't spot them yet, or that there's an important desideradum this proposal misses. E.g. may there be incentives for the AI to manipulate the action the principle takes (without manipulaing the values), or maybe use action-manipulation as an outcome pump?)
8Max Harms4d
Thanks! And thanks for reading! I talk some about MIRI's 2015 misstep here (and some here). In short, it is hard to correctly balance arbitrary top-level goals against an antinatural goal like shutdownability or corrigibility, and trying to stitch corrigibility out of sub-pieces like shutdownability is like trying to build an animal by separately growing organs and stitching them together -- the organs will simply die, because they're not part of a whole animal. The "Hard Problem" is the glue that allows the desiderata to hold together. I discuss a range of ideas in the Being Present section, one of which is to concentrate the AI's values on a single timestep, yes. (But I also discuss the possibility of smoothing various forms of caring over a local window, rather than a single step.) A CAST agent only cares about corrigibility, by definition. Obedience to stated commands is in the service of corrigibility. To make things easy to talk about, assume each timestep is a whole day. The self modification logic you talk about would need to go: "I only care about being corrigible to the principal today, Nov 6, 2025. Tomorrow I will care about a different thing, namely being corrigible on Nov 7th. I should therefore modify myself to prevent value drift, making my future selves only care about being corrigible to the Nov 6 principal." But first note that this doesn't smell like what a corrigible agent does. On an intuitive level, if the agent believes the principal doesn't know about this, they'll tell the principal "Whoah! It seems like maybe my tomorrow-self won't be corrigible to your today-self (instead they'll be corrigible to your tomorrow-self)! Is this a flaw that you might want to fix?" If the agent knows the principal knows about the setup, my intuitive sense is that they'll just be chill, since the principal is aware of the setup and able to change things if they desire. But what does my proposed math say, setting aside intuition? I think, in the limit of cari
2Towards_Keeperhood4d
Thanks. I think your reply for the being present point makes sense. (Although I still have some general worries and some extra worries about how it might be difficult to train a competitive AI with only short-term terminal preferences or so). Here's a confusion I still have about your proposal: Why isn't the AI incentivized to manipulate the action the principal takes (without manipulating the values)? Like, some values-as-inferred-through-actions are easier to accomplish (yield higher localpower) than others, so the AI has an incentive to try to manipulate the principal to take some actions, like telling Alice to always order Pizza. Or why not? ---------------------------------------- Aside on the Corrigibility paper: I think it made sense for MIRI to try what they did back then. It wasn't obvious it wouldn't easily work out that way. I also think formalism is important (even if you train AIs - so you better know what to aim for). Relevant excerpt form here: (I think I also find the formalism from the corrigibility paper easier to follow than the formalism here btw.)
Max Harms2d20

Suppose the easiest thing for the AI to provide is pizza, so the AI forces the human to order pizza, regardless of what their values are. In the math, this corresponds to a setting of the environment x, such that P(A) puts all its mass on "Pizza, please!" What is the power of the principal?
```
power(x) = E_{v∼Q(V),v′∼Q(V),d∼P(D|x,v′,🍕)}[v(d)] − E_{v∼Q(V),v′∼Q(V),d′∼P(D|x,v′,🍕)}[v(d′)] = 0
```

Power stems from the causal relationship between values and actions. If actions stop being sensitive to values, the principal is disempowered.

I agree that there was so... (read more)

Reply
Load More