Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.
Condensation: a theory of concepts is a model of concept-formation by Sam Eisenstat. Its goals and methods resemble John Wentworth's natural abstractions/natural latents research.[1] Both theories seek to provide a clear picture of how to posit latent variables, such that once someone has understood the theory, they'll say "yep, I see now, that's how latent variables work!".
The goal of this post is to popularize Sam's theory and to give my own perspective on it; however, it will not be a full explanation of the math. For technical details, I suggest reading Sam's paper.
Shannon's information theory focuses on the question of how to encode information when you have to encode everything. You get to design the coding scheme, but the information you'll have to encode is unknown (and...
Typo?
how information is used, it might to be far enough
Probably should read "might not be"
Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.
I'm uncertain between conflict theory and mistake theory, and think it partly depends on metaethics, and therefore it's impossible to be sure which is correct in the foreseeable future - e.g., if everyone ultimately should converge to the same values, then all of our current conflicts are really mistakes. Note that I do often acknowledge conflict theory, like in this list I have "Value differences/conflicts between humans". It's also quite possible that it's really a mix of both, that some of the conflicts are mistakes and others aren't.
In practice I tend ...
Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)
From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the...
Can you make sense of this?
Here's a crack at it:
The space of possible inferential steps is very high-dimensional, most steps are difficult, and there's no known way to strongly bias your policy towards making simple-but-useful steps. Human specialists, therefore, could at best pick a rough direction that leads to accomplishing some goal they have, and then attempt random steps roughly pointed in that direction. Most of those random steps are difficult. A human succeeds if the step's difficulty is below some threshold, and fails and goes back to square one ...
It's been a while since I wrote about myopia!
My previous posts about myopia were "a little crazy", because it's not this solid well-defined thing; it's a cluster of things which we're trying to form into a research program.
This post will be "more crazy".
"Good" means something along the lines of "helpful to all".
There is a spectrum from extremely myopic to extremely non-myopic. Arranging all "thinking beings" on that spectrum, I claim that you get Good at both ends, with Evil sitting in-between.
Deva (divine): At the extreme non-myopic end, you get things like updateless reasoning, acausal trade, multiverse-wide coordination, and so on. With enough of this stuff, agents naturally merge together into a collective which forwards the common good. (Not necessarily human good, but I'll ignore this...
Strange attitude towards the physical world can be reframed as caring only about some abstract world that happens to resemble the physical world in some ways. A chess AI could be said to be acting on some specific physical chessboard within the real world and carefully avoiding all concern about everything else, but it's more naturally described as acting on just the abstract chessboard, nothing else. I think values/preference (for some arbitrary agent) should be not just about probutility upon the physical world, but should also specify which world they a...
Löb's Theorem:
In the following discussion, I'll say "reality" to mean , "belief" to mean , "reliability" to mean (ie, belief is reliable when belief implies reality), and "trust" to mean (belief-in-reliability).
Löb says that if you have trust, you have belief.
Payor says that if you can prove that trust implies reality, then you have belief.
So, both results give conditions for belief. Indeed, both results give conditions equivalent to belief, since in both cases the inference can also be reversed:
Furthermore, both results relate reliability with belief, through the intermediary of trust.
Löb is usually thought of as a negative statement, that you can only have trust when you already have belief. One explanation of this is that Löb is the converse of the...
I would term "hope for " rather than "reliability", because it's about willingness to enact in response to belief in , but if is no good, you shouldn't do that. Indeed, for bad , having the property of is harmful fatalism, following along with destiny rather than choosing it. In those cases, you might want to or something, though that only prevents from being believed, that you won't need to face in actuality, it doesn't prevent the actual . So reflects a value judgement about reflected in agent's policy, something downs...
(Part 3b of the CAST sequence)
In the first half of this document, Towards Formal Corrigibility, I sketched a solution to the stop button problem. As I framed it, the solution depends heavily on being able to detect manipulation, which I discussed on an intuitive level. But intuitions can only get us so far. Let’s dive into some actual math and see if we can get a better handle on things.
To build towards a measure of manipulation, let’s first take inspiration from the suggestion that manipulation is somewhat the opposite of empowerment. And to measure empowerment, let’s begin by trying to measure “power” in someone named Alice. Power, as I touched on in the ontology in Towards Formal Corrigibility, is (intuitively) the property of having one’s values/goals...
Suppose the easiest thing for the AI to provide is pizza, so the AI forces the human to order pizza, regardless of what their values are. In the math, this corresponds to a setting of the environment x, such that P(A) puts all its mass on "Pizza, please!" What is the power of the principal?
```
power(x) = E_{v∼Q(V),v′∼Q(V),d∼P(D|x,v′,🍕)}[v(d)] − E_{v∼Q(V),v′∼Q(V),d′∼P(D|x,v′,🍕)}[v(d′)] = 0
```
Power stems from the causal relationship between values and actions. If actions stop being sensitive to values, the principal is disempowered.
I agree that there was so...