AI ALIGNMENT FORUM
AF

All of cousin_it's Comments + Replies

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

The relevant point is his latter claim: “in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”″ I think this is a very strong conclusion, relative to available data.

I think humans don't steal mostly because society enforces that norm. Toward weaker "other" groups that aren't part of your society (farmed animals, weaker countries, etc) there's no such norm, and humans often behave badly toward such groups. And to AIs, humans will be a weaker "other" group. So if alignment of AIs to human standard is a complete success - if A... (read more)

In Defense of Open-Minded UDT

cousin_it7mo*10

Going back to the envelopes example, a nosy neighbor hypothesis would be "the left envelope contains $100, even in the world where the right envelope contains $100". Or if we have an AI that's unsure whether it values paperclips or staples, a nosy neighbor hypothesis would be "I value paperclips, even in the world where I value staples". I'm not sure how that makes sense. Can you give some scenario where a nosy neighbor hypothesis makes sense?

2Abram Demski7mo

I think so, yes, but I want to note that my position is consistent with nosy-neighbor hypotheses not making sense. A big part of my point is that there's a lot of nonsense in a broad prior. I think it's hard to rule out the nonsense without learning. If someone thought nosy neighbors always 'make sense', it could be an argument against my whole position. (Because that person might be just fine with UDT, thinking that my nosy-neighbor 'problems' are just counterfactual muggings.) Here's an argument that nosy neighbors can make sense. For values, as I mentioned, a nosy-neighbors hypothesis is a value system which cares about what happens in many different universes, not just the 'actual' universe. For example, a utility function which assigns some value to statements of mathematics. For probability, a nosy-neighbor is like the Lizard World hypothesis mentioned in the post: it's a world where what happens there depends a lot on what happens in other worlds. I think what you wrote about staples vs paperclips nosy-neighbors is basically right, but maybe if we rephrase it it can 'make more sense'?: "I (actual me) value paperclips being produced in the counterfactual(-from-my-perspective) world where I (counterfactual me) don't value paperclips." Anyway, whether or not it makes intuitive sense, it's mathematically fine. The idea is that a world will contain facts that are a good lens into alternative worlds (such as facts of Peano Arithmetic), which utility hypotheses / probabilistic hypotheses can care about. So although a hypothesis is only mathematically defined as a function of worlds where it holds, it "sneakily" depends on stuff that goes on in other worlds as well.

In Defense of Open-Minded UDT

cousin_it7mo10

Imagine if we had narrowed down the human prior to two possibilities, P_1 and P_2 . Humans can’t figure out which one represents our beliefs better, but the superintelligent AI will be able to figure it out. Moreover, suppose that P_2 is bad enough that it will lead to a catastrophe from the human perspective (that is, from the P_1 perspective), even if the AI were using UDT with 50-50 uncertainty between the two. Clearly, we want the AI to be updateful about which of the two hypotheses is correct.

This seems like the central argument in the post,... (read more)

2Abram Demski7mo

Let's frame it in terms of value learning. Naive position: UDT can't be combined with value learning, since UDT doesn't learn. If we're not sure whether puppies or rainbows are what we intrinsically value, but rainbows are easier to manufacture, then the superintelligent UDT will tile the universe with rainbows instead of puppies because that has higher expectation according to the prior, regardless of evidence it encounters suggesting that puppies are what's more valuable. Cousin_it's reply: There's puppy-world and rainbow-world. In rainbow-world, tiling the universe with rainbows has 100 utility, and tiling the universe with puppies has 0 utility. In puppy-world, tiling the universe with puppies has 90 utility (because puppies are harder to maximize than rainbows), but rainbows have 0 utility. The UDT agent gets to observe which universe it is in, although it has a 50-50 prior on the two. There are four policies: * Always make puppies: this has a 50% chance of a utility of 90, and otherwise yields zero. * EV: 45 * Always make rainbows: 50% chance of utility 100, otherwise zero. * EV: 50 * Make puppies in rainbow world; make rainbows in puppy world. * EV: 0 * Make puppies in puppy world, make rainbows in rainbow world. * EV: 95 The highest EV is to do the obvious value-learning thing; so, there's no problem. Fixing the naive position: Some hypotheses will "play nice" like the example above, and updateless value learning will work fine. However, there are some versions of "valuing puppies" and "valuing rainbows" which value puppies/rainbows regardless of which universe the puppies/rainbows are in. This only requires that there's some sort of embedding of counterfactual information into the sigma-algebra which the utility functions are predicated on. For example, if the agent has beliefs about PA, these utility functions could check for the number of puppies/rainbows in arbitrary computations. This mostly won't matter, because the agent doesn't

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

cousin_it9mo99

Yeah, this agrees with my thinking so far. However, I think if you could research how to align AIs specifically to human flourishing (as opposed to things like obedience/interpretability/truthfulness, which defer to the user's values), that kind of work could be more helpful than most.

Andrew Critch9mo412

I very much agree with human flourishing as the main value I most want AI technologies to pursue and be used to pursue.

In that framing, my key claim is that in practice no area of purely technical AI research — including "safety" and/or "alignment" research — can be adequately checked for whether it will help or hinder human flourishing, without a social model of how the resulting techologies will be used by individuals / businesses / governments / etc..

Refusal in LLMs is mediated by a single direction

cousin_it11mo53

Sorry for maybe naive question. Which other behaviors X could be defeated by this technique of "find n instructions that induce X and n that don't"? Would it work for X=unfriendliness, X=hallucination, X=wrong math answers, X=math answers that are wrong in one specific way, and so on?

Neel Nanda11mo73

There's been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven't seen much elsewhere, but I could easily be missing references

cousin_it's Shortform

cousin_it11mo10

If the housing crisis is caused by low-density rich neighborhoods blocking redevelopment of themselves (as seems the consensus on the internet now), could it be solved by developers buying out an entire neighborhood or even town in one swoop? It'd require a ton of money, but redevelopment would bring even more money, so it could be win-win for everyone. Does it not happen only due to coordination difficulties?

Security amplification

cousin_it11mo10

I don't know about others, but to me these approaches sound like "build a bureaucracy from many well-behaved agents", and it seems to me that such a bureaucracy wouldn't necessarily behave well.

Transformers Represent Belief State Geometry in their Residual Stream

cousin_it11mo*30

I have maybe a naive question. What information is needed to find the MSP image within the neural network? Do we have to know the HMM to begin with? Or could it be feasible someday to inspect a neural network, find something that looks like an MSP image, and infer the HMM from it?

Many arguments for AI x-risk are wrong

cousin_it1y*3-5

I’m worried about centralization of power and wealth in opaque non-human decision-making systems, and those who own the systems.

This has been my main worry for the past few years, and to me it counts as "doom" too. AIs and AI companies playing by legal and market rules (and changing these rules by lobbying, which is also legal) might well lead to most humans having no resources to survive.

How Would an Utopia-Maximizer Look Like?

cousin_it1y*32

It's interesting that part of human value might be having our actions matter. But if you build an AI that can give you all the things, or even if you could've built such an AI but chose not to, then objectively your actions no longer matter much after that. I've no idea how even CEV could approach this problem.

Edit: I think I've figured it out. The AI shouldn't try to build the best world according to CEV, it should take the best action for an AI to take according to CEV. So if the AI notices that humans strongly prefer to be left alone with their problems, it'll just shut down. Or find some other way to ensure that humans can't rely on AIs for everything.

3. Premise three & Conclusion: AI systems can affect value change trajectories & the Value Change Problem

cousin_it1y30

Yeah. It also worries me that a lot of value change has happened in the past, and much of it has been caused by selfish powerful actors for their ends rather than ours. The most obvious example is jingoism, which is often fanned up by power-hungry leaders. A more subtle example is valuing career or professional success. The part of it that isn't explained by money or the joy of the work itself, seems to be an irrational desire installed into us by employers.

1Nora_Ammann1y

Agree! Examples abound. You can never escape your local ideological context - you can only try to find processes that have some hope at occasionally pumping into the bounds of your current ideology and press beyond it - no reliably receipt (just like there is no reliably receipt to make yourself notice your own blind spot) - but there is the hope for things that in expectation and intertemporally can help us with this. Which poses a new problem (or clarifies the problem we're facing): we don't get to answer the question of value change legitimacy in a theoretical vacuum -- instead we are already historically embedded in a collective value change trajectory, affecting both what we value but also what we (can) know. I think that makes it sound a bit hopeless from one perspective, but on the other hand, we probably also shouldn't let hypothetical worlds we could never have reached weight us down -- there are many hypothetical worlds we still can reach that it is worth fighting for.

Meta Questions about Metaphilosophy

cousin_it2y10

Maybe one example is the idea of Dutch book. It comes originally from real world situations (sport betting and so on) and then we apply it to rationality in the abstract.

Or another example, much older, is how Socrates used analogy. It was one of his favorite tools I think. When talking about some confusing thing, he'd draw an analogy with something closer to experience. For example, "Is the nature of virtue different for men and for women?" - "Well, the nature of strength isn't that much different between men and women, likewise the nature of health, so maybe virtue works the same way." Obviously this way of reasoning can easily go wrong, but I think it's also pretty indicative of how people do philosophy.

Meta Questions about Metaphilosophy

cousin_it2y20

I don't say it's not risky. The question is more, what's the difference between doing philosophy and other intellectual tasks.

Here's one way to look at it that just occurred to me. In domains with feedback, like science or just doing real world stuff in general, we learn some heuristics. Then we try to apply these heuristics to the stuff of our mind, and sometimes it works but more often it fails. And then doing good philosophy means having a good set of heuristics from outside of philosophy, and good instincts when to apply them or not. And some luck, in ... (read more)

3Wei Dai2y

Do you have any examples that could illustrate your theory? It doesn't seem to fit my own experience. I became interested in Bayesian probability, universal prior, Tegmark multiverse, and anthropic reasoning during college, and started thinking about decision theory and ideas that ultimately led to UDT, but what heuristics could I have been applying, learned from what "domains with feedback"? Maybe I used a heuristic like "computer science is cool, lets try to apply it to philosophical problems" but if the heuristics are this coarse grained, it doesn't seem like the idea can explain how detailed philosophical reasoning happens, or be used to ensure AI philosophical competence?

Meta Questions about Metaphilosophy

cousin_it2y41

I'm pretty much with you on this. But it's hard to find a workable attack on the problem.

One question though, do you think philosophical reasoning is very different from other intelligence tasks? If we keep stumbling into LLM type things which are competent at a surprisingly wide range of tasks, do you expect that they'll be worse at philosophy than at other tasks?

4Wei Dai2y

I'm not sure but I do think it's very risky to depend on LLMs to be good at philosophy by default. Some of my thoughts on this: * Humans do a lot of bad philosophy and often can't recognize good philosophy. (See popularity of two-boxing among professional philosophers.) Even if a LLM has learned how to do good philosophy, how will users or AI developers know how to prompt it to elicit that capability (e.g., which philosophers to emulate)? (It's possible that even solving metaphilosophy doesn't help enough with this, if many people can't recognize the solution as correct, but there's at least a chance that the solution does look obviously correct to many people, especially if there's not already wrong solutions to compete with). * What if it learns how to do good philosophy during pre-training, but RLHF trains that away in favor of optimizing arguments to look good to the user. * What if philosophy is just intrinsically hard for ML in general (I gave an argument for why ML might have trouble learning philosophy from humans in the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy, but I'm not sure how strong it is) or maybe it's just some specific LLM architecture that has trouble with this, and we never figure this out because the AI is good at finding arguments that look good to humans? * Or maybe we do figure out that AI is worse at philosophy than other tasks, after it has been built, but it's too late to do anything with that knowledge (because who is going to tell the investors that they've lost their money because we don't want to differentially decelerate philosophical progress by deploying the AI).

Shutdown-Seeking AI

cousin_it2y10

If the AI can rewrite its own code, it can replace itself with a no-op program, right? Or even if it can't, maybe it can choose/commit to do nothing. So this approach hinges on what counts as "shutdown" to the AI.

Role Architectures: Applying LLMs to consequential tasks

cousin_it2y11

I think if AIs talk to each other using human language, they'll start encoding stuff into it that isn't apparent to a human reader, and this problem will get worse with more training.

1Vojtech Kovarik2y

This made me think of "lawyer-speak", and other jargons. More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)

1Eric Drexler2y

I agree that using the forms of human language does not ensure interpretability by humans, and I also see strong advantages to communication modalities that would discard words in favor of more expressive embeddings. It is reasonable to expect that systems with strong learning capacity could to interprete and explain messages between other systems, whether those messages are encoded in words or in vectors. However, although this kind of interpretability seems worth pursuing, it seems unwise to rely on it. The open-agency perspective suggests that while interpretability is important for proposals, it is less important in understanding the processes that develop those proposals. There is a strong case for accepting potentially uninterpretable communications among models involved in generating proposals and testing them against predictive models — natural language is insufficient for design and analysis even among humans and their conventional software tools. Plans of action, by contrast, call for concrete actions by agents, ensuring a basic form of interpretability. Evaluation processes can and should favor proposals that are accompanied by clear explanations that stand up under scrutiny.

Simulators

cousin_it2y*33

It seems as a result of this post, many people are saying that LLMs simulate people and so on. But I'm not sure that's quite the right frame. It's natural if you experience LLMs through chat-like interfaces, but from playing with them in a more raw form, like the RWKV playground, I get a different impression. For example, if I write something that sounds like the start of a quote, it'll continue with what looks like a list of quotes from different people. Or if I write a short magazine article, it'll happily tack on a publication date and "All rights reser... (read more)

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

cousin_it2y90

As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.

I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.

Implied "utilities" of simulators are broad, dense, and shallow

cousin_it2y32

I really like how you've laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we'd live for awhile with input-imitator AIs that get more and more capable but still stay docile.

But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conve... (read more)

Cognitive Emulation: A Naive AI Safety Proposal

cousin_it2y67

We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties.

Doesn't that require understanding why humans have (or don't have) certain safety properties? That seems difficult.

Some conceptual alignment research projects

cousin_it3y49

A takeover scenario which covers all the key points in https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/, but not phrased as an argument, just phrased as a possible scenario

For what it's worth, I don't think AI takeover will look like war.

The first order of business for any AI waking up won't be dealing with us; it will be dealing with other possible AIs that might've woken up slightly earlier or later. This needs to be done very fast and it's ok to take some risk doing it. Basically, covert takeover of the internet in the first hours.

After... (read more)

cousin_it3y280

Can you describe what changed / what made you start feeling that the problem is solvable / what your new attack is, in short?

Stuart Armstrong3y60

Firstly, because the problem feels central to AI alignment, in the way that other approaches didn't. So making progress in this is making general AI alignment progress; there won't be such a "one error detected and all the work is useless" problem. Secondly, we've had success generating some key concepts, implying the problem is ripe for further progress.

Gordon Seidoh Worley3y*100

This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.

Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I k... (read more)

Inferring utility functions from locally non-transitive preferences

cousin_it3y*40

There's a bit of math directly relevant to this problem: Hodge decomposition of graph flows, for the discrete case, and vector fields, for the continuous case. Basically if you have a bunch of arrows, possibly loopy, you can always decompose it into a sum of two components: a "pure cyclic" one (no sources or sinks, stuff flowing in cycles) and a "gradient" one (arising from a utility function). No neural network needed, the decomposition is unique and can be computed explicitly. See this post, and also the comments by FactorialCode and me.

0Jan Hendrik Kirchner3y

Fantastic, thank you for the pointer, learned something new today! A unique and explicit representation would be very neat indeed.

Reply to Eliezer on Biological Anchors

cousin_it3y42

With these two points in mind, it seems off to me to confidently expect a new paradigm to be dominant by 2040 (even conditional on AGI being developed), as the second quote above implies. As for the first quote, I think the implication there is less clear, but I read it as expecting AGI to involve software well over 100x as efficient as the human brain, and I wouldn’t bet on that either (in real life, if AGI is developed in the coming decades—not based on what’s possible in principle.)

I think this misses the point a bit. The thing to be afraid of is not... (read more)

5Matthew Barnett3y

Unless I’m mistaken, the Bio Anchors framework explicitly assumes that we will continue to get algorithmic improvements, and even tries to estimate and extrapolate the trend in algorithmic efficiency. It could of course be that progress in reality will turn out a lot faster than the median trendline in the model, but I think that’s reflected by the explicit uncertainty over the parameters in the model. In other words, Holden’s point about this framework being a testbed for thinking about timelines remains unscathed if there is merely more ordinary algorithmic progress than expected.

Considerations on interaction between AI and expected value of the future

cousin_it3y*10

To me it feels like alignment is a tiny target to hit, and around it there's a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don't see why that's true.

Considerations on interaction between AI and expected value of the future

cousin_it3y10

These involve extinction, so they don't answer the question what's the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.

4Vanessa Kosoy3y

My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.

Considerations on interaction between AI and expected value of the future

cousin_it3y20

I think alignment is finicky, and there's a "deep pit around the peak" as discussed here.

Vanessa Kosoy3y60

I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:

Goodharting some proxy, such as making the reward signal go on instead of satisfying the human's request in order for the human to press the reward button. This usually produces a universe without people, since specifying a "person" is fairly complicated and the proxy will not be robustly tied to this concept.
Allowing a daemon to take over. Daemonic utility function are probably completely alien and also produce a universe without people. One caveat is: maybe t

cousin_it3y10

There are very “large” impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people’s personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.)

I don't think we are indifferent to these outcomes. We leave them to luck, but that'... (read more)

2Stuart Armstrong3y

Yes, but we would be mostly indifferent to shifts in the distribution that preserve most of the features - eg if the weather was the same but delayed or advanced by six days.

Considerations on interaction between AI and expected value of the future

cousin_it3y30

I think the default non-extinction outcome is a singleton with near miss at alignment creating large amounts of suffering.

Vanessa Kosoy3y80

I'm surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?

Soares, Tallinn, and Yudkowsky discuss AGI cognition

cousin_it3y30

Yeah, I had a similar thought when reading that part. In agent-foundations discussions, the idea often came up that the right decision theory should quantify not over outputs or input-output maps, but over successor programs to run and delegate I/O to. Wei called it "UDT2".

Soares, Tallinn, and Yudkowsky discuss AGI cognition

cousin_it3y100

“Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!” does not sound like any particular segment of history book I can recall offhand.

I think the ozone hole and the Y2K problem fit the bill. Though of course that doesn't mean the AI problem will go the same way.

4Samuel Dylan Martin3y

Also Climate Change itself doesn't completely not look like this scenario, same with nuclear deterrence.

Ngo and Yudkowsky on alignment difficulty

cousin_it3y*150

Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its "teachers", but at high power it will do something strange and maybe harm the "teachers" values. That holds true for humans gaining a lot of power and going against evolutionary values ("superstimuli"), and for individual humans gaining a lot of power and going against societal values ("power corrupts"), so it's probably true for AI as well. The worrying thing is that high power by itself seems suf... (read more)

Ngo and Yudkowsky on alignment difficulty

cousin_it3y*50

This is tricky. Let's say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we'll stop understanding answers, but they'll continue being super-competent. That's certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still u... (read more)

johnswentworth3y130

I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there's a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.

(BTW, I do think you've correctly identified an important point which I think a lot of people miss: humans internally "learn" values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way ... (read more)

Ngo and Yudkowsky on alignment difficulty

cousin_it3y50

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail. And maybe I'm being thick, but the argument for that point still isn't reaching me somehow. Can someone rephrase for me?

Koen Holtman3y50

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail.

Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.

Looks like Eliezer bel... (read more)

Steve Byrnes3y100

Speaking for myself here…

OK, let's say we want an AI to make a "nanobot plan". I'll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.

First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it's not, it's a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.

Altern... (read more)

johnswentworth3y80

The main issue with this sort of thing (on my understanding of Eliezer's models) is Hidden Complexity of Wishes. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won't suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is l... (read more)

The SIA population update can be surprisingly small

cousin_it4y30

Instant strong upvote. This post changed my view as much as the risk aversion post (which was also by you!)

2Stuart Armstrong4y

Thanks!

Non-poisonous cake: anthropic updates are normal

cousin_it4y40

Where are you on the spectrum from "SSA and SIA are equally valid ways of reasoning" to "it's more and more likely that in some sense SIA is just true"? I feel like I've been at the latter position for a few years now.

2Stuart Armstrong4y

More SIAish for conventional anthropic problems. Other theories are more applicable for more specific situations, specific questions, and for duplicate issues.

An Intuitive Guide to Garrabrant Induction

cousin_it4y20

Interesting! Can you write up the WLIC, here or in a separate post?

2Abram Demski4y

I should! But I've got a lot of things to write up! It also needs a better name, as there have been several things termed "weak logical induction" over time.

An Intuitive Guide to Garrabrant Induction

cousin_it4y60

I thought Diffractor's result was pretty troubling for the logical induction criterion:

...the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there's a trader that could rack up unboundedly high value!

But maybe understanding has changed since then? What's the current state?

Abram Demski4y60

First, I'm not sure exactly why you think this is bad. Care to say more? My guess is that it just doesn't fit the intuitive notion that updates should be heading toward some state of maximal knowledge. But we do fit this intuition in other ways; specifically, logical inductors eventually trust their future opinions more than their present opinions.

Personally, I found this result puzzling but far from damning.

Second, I've actually done some unpublished work on this. There is a variation of the logical induction criterion which is more relaxed (admits more t... (read more)

Finite Factored Sets

cousin_it4y10

Wait, can you describe the temporal inference in more detail? Maybe that's where I'm confused. I'm imagining something like this:

Check which variables look uncorrelated
Assume they are orthogonal
From that orthogonality database, prove "before" relationships

Which runs into the problem that if you let a thermodynamical system run for a long time, it becomes a "soup" where nothing is obviously correlated to anything else. Basically the final state would say "hey, I contain a whole lot of orthogonal variables!" and that would stop you from proving any reasonable "before" relationships. What am I missing?

2Scott Garrabrant4y

I think that you are pointing out that you might get a bunch of false positives in your step 1 after you let a thermodynamical system run for a long time, but they are are only approximate false positives.

Finite Factored Sets

cousin_it4y*10

I think your argument about entropy might have the same problem. Since classical physics is reversible, if we build something like a heat engine in your model, all randomness will be already contained in the initial state. Total "entropy" will stay constant, instead of growing as it's supposed to, and the final state will be just as good a factorization as the initial. Usually in physics you get time (and I suspect also causality) by pointing to a low probability macrostate and saying "this is the start", but your model doesn't talk about macrostates yet, ... (read more)

2Scott Garrabrant4y

Wait, I misunderstood, I was just thinking about the game of life combinatorially, and I think you were thinking about temporal inference from statistics. The reversible cellular automaton story is a lot nicer than you'd think. if you take a general reversible cellular automaton (critters for concreteness), and have a distribution over computations in general position in which initial conditions cells are independent, the cells may not be independent at future time steps. If all of the initial probabilities are 1/2, you will stay in the uniform distribution, but if the probabilities are in general position, things will change, and time 0 will be special because of the independence between cells. There will be other events at later times that will be independent, but those later time events will just represent "what was the state at time 0." For a concrete example consider the reversible cellular automaton that just has 2 cells, X and Y, and each time step it keeps X constant and replaces Y with X xor Y.

Finite Factored Sets

cousin_it4y20

Thanks for the response! Part of my confusion went away, but some still remains.

In the game of life example, couldn't there be another factorization where a later step is "before" an earlier one? (Because the game is non-reversible and later steps contain less and less information.) And if we replace it with a reversible game, don't we run into the problem that the final state is just as good a factorization as the initial?

3Scott Garrabrant4y

Yep, there is an obnoxious number of factorizations of a large game of life computation, and they all give different definitions of "before."

Finite Factored Sets

cousin_it4y*40

Not sure we disagree, maybe I'm just confused. In the post you show that if X is orthogonal to X XOR Y, then X is before Y, so you can "infer a temporal relationship" that Pearl can't. I'm trying to understand the meaning of the thing you're inferring - "X is before Y". In my example above, Bob tells Alice a lossy function of his knowledge, and Alice ends up with knowledge that is "before" Bob's. So in this case the "before" relationship doesn't agree with time, causality, or what can be computed from what. But then what conclusions can a scientist make from an inferred "before" relationship?

5Scott Garrabrant4y

I don't have a great answer, which isn't a great sign. I think the scientist can infer things like. "algorithms reasoning about the situation are more likely to know X but not Y than they are to know Y but not X, because reasonable processes for learning Y tend to learn learn enough information to determine X, but then forget some of that information." But why should I think of that as time? I think the scientist can infer things like "If I were able to factor the world into variables, and draw a DAG (without determinism) that is consistent with the distribution with no spurious independencies (including in deterministic functions of the variables), and X and Y happen to be variables in that DAG, then there will be a path from X to Y." The scientist can infer that if Z is orthogonal to Y, then Z is also orthogonal to X, where this is important because Z is orthogonal to Y can be thought of as saying that Z is useless for learning about Y. (and importantly a version of useless for learning that is closed under common refinement, so if you collect a bunch of different Z orthogonal to Y, you can safely combine them, and the combination will be orthogonal to Y.) This doesn't seem to get at why we want to call it before. Hmm. Maybe I should just list a bunch of reasons why it feels like time to me (in no particular order): 1. It seems like it gets a very reasonable answer in the Game of Life example 2. Prior to this theory, I thought that it made sense to think of time as a closure property on orthogonality, and this definition of time is exactly that closure property on orthogonality, where X is weakly before Y if whenever Z is orthogonal to Y, Z is also orthogonal to X. (where the definition of orthogonality is justified with the fundamental theorem.) 3. If Y is a refinement of X, then Y cannot be strictly before X. (I notice that I don't have a thing to say about why this feels like time to me, and indeed it feels like it is in direct opposition to your "does

Finite Factored Sets

cousin_it4y*40

I feel that interpreting "strictly before" as causality is making me more confused.

For example, here's a scenario with a randomly changed message. Bob peeks at ten regular envelopes and a special envelope that gives him a random boolean. Then Bob tells Alice the contents of either the first three envelopes or the second three, depending on the boolean. Now Alice's knowledge depends on six out of ten regular envelopes and the special one, so it's still "strictly before" Bob's knowledge. And since Alice's knowledge can be computed from Bob's knowledge but no... (read more)

3Scott Garrabrant4y

I partially agree, which is partially why I am saying time rather than causality. I still feel like there is an ontological disagreement in that it feels like you are objecting to saying the physical thing that is Alice's knowledge is (not) before the physical thing that is Bob's knowledge. In my ontology: 1) the information content of Alice's knowledge is before the information content of Bob's knowledge. (I am curios if this part is controversial.) and then, 2) there is in some sense no more to say about the physical thing that is e.g. Alice's knowledge beyond the information content. So, I am not just saying Alice is before Bob, I am also saying e.g. Alice is before Alice+Bob, and I can't disentangle these statements because Alice+Bob=Bob. I am not sure what to say about the second example. I am somewhat rejecting the dynamics. "Alice travels back in time" is another way of saying that the high level FFS time disagrees with the standard physical time, which is true. The "high level" here is pointing to the fact that we are only looking at the part of Alice's brain that is about the envelopes, and thus talking about coarser variables than e.g. Alice's entire brain state in physical time. And if we are in the ontology where we are only looking at the information content, taking a high level version of a variable is the kind of thing that can change its temporal properties, since you get an entirely new variable. I suspect most of the disagreement is in the sort of "variable nonrealism" of reducing the physical thing that is Alice's knowledge to its information content?

Finite Factored Sets

cousin_it4y*50

I think the definition of history is the most natural way to recover something like causal structure in these models.

I'm not sure how much it's about causality. Imagine there's a bunch of envelopes with numbers inside, and one of the following happens:

Alice peeks at three envelopes. Bob peeks at ten, which include Alice's three.
Alice peeks at three envelopes and tells the results to Bob, who then peeks at seven more.
Bob peeks at ten envelopes, then tells Alice the contents of three of them.

Under the FFS definition, Alice's knowledge in each ... (read more)

Paul Christiano4y100

Agree it's not totally right to call this a causal relationship.

That said:

The contents of 3 envelopes does seems causally upstream of the contents of 10 envelopes
If Alice's perception is imperfect (in any possible world), then "what Alice perceived" is not identical to "the contents of 3 envelopes" and so is not strictly before "what Bob perceived" (unless there is some other relationship between them).
If Alice's perception is perfect in every possible world, then there is no possible way to intervene on Alice's perception without intervening on the conten

... (read more)

Finite Factored Sets

cousin_it4y20

Can you give some more examples to motivate your method? Like the smoking/tar/cancer example for Pearl's causality, or Newcomb's problem and counterfactual mugging for UDT.

Scott Garrabrant4y60

Hmm, first I want to point out that the talk here sort of has natural boundaries around inference, but I also want to work in a larger frame that uses FFS for stuff other than inference.

If I focus on the inference question, one of the natural questions that I answer is where I talk about grue/bleen in the talk.

I think for inference, it makes the most sense to think about FFS relative to Pearl. We have this problem with looking at smoking/tar/cancer, which is what if we carved into variables the wrong way. What if instead of tar/cancer, we had a varia... (read more)

Finite Factored Sets

cousin_it4y*70

Well, imagine we have three boolean random variables. In "general position" there are no independence relations between them, so we can't say much. Constrain them so two of the variables are independent, that's a bit less "general", and we still can't say much. Constrain some more so the xor of all three variables is always 1, that's even less "general", now we can use your method to figure out that the third variable is downstream of the first two. Constrain some more so that some of the probabilities are 1/2, and the method stops working. What I'd like to understand is the intuition, which real world cases have the particular "general position" where the method works.

3Scott Garrabrant4y

Ok, makes sense. I think you are just pointing out that when I am saying "general position," that is relative to a given structure, like FFS or DAG or symmetric FFS. If you have a probability distribution, it might be well modeled by a DAG, or a weaker condition is that it is well modeled by a FFS, or an even weaker condition is that it is well modeled by a SFFS (symmetric finite factored set). We have a version of the fundamental theorem for DAGs and d-seperation, we have a version of the fundamental theorem for FFS and conditional orthogonality, and we might get a version of the fundamental theorem for SFFS and whatever corresponds to conditional independence in that world. However, I claim that even if we can extend to a fundamental theorem for SFFS, I still want to think of the independences in a SFFS as having different sources. There are the independences coming from orthogonality, and there are there the independences coming from symmetry (or symmetry together with orthogonality. In this world, orthogonality won't be as inferable because it will only be a subset of independence, but it will still be an important concept. This is similar to what I think will happen when we go to the infinite dimensional factored sets case.

0acgt4y

I’m confused what necessary work the Factorisation is doing in these temporal examples - in your example A and B are independent and C is related to both - the only assignment of “upstream/downstream” relations that makes sense is that C is downstream of both. Is the idea that factorisation is what carves your massive set of possible worlds up into these variables in the first place? Feel like I’m in a weird position where the math makes sense but I’m missing the motivational intuition for why we want to switch to this framework in the first place

Finite Factored Sets

cousin_it4y30

Yeah, that's what I thought, the method works as long as certain "conspiracies" among probabilities don't happen. (1/2 is not the only problem case, it's easy to find others, but you're right that they have measure zero.)

But there's still something I don't understand. In the general position, if X is before Y, it's not always true that X is independent of X XOR Y. For example, if X = "person has a car on Monday" and Y = "person has a car on Tuesday", and it's more likely that a car-less person gets a car than the other way round, the independence doesn't hold. It requires a conspiracy too. What's the intuitive difference between "ok" and "not ok" conspiracies?

3Scott Garrabrant4y

I don't understand what conspiracy is required here. X being orthogonal to X XOR Y implies X is before Y, we don't get the converse.

Finite Factored Sets

cousin_it4y20

And if X is independent of X XOR Y, we’re actually going to be able to conclude that X is before Y!

It's interesting to translate that to the language of probabilities. For example, your condition holds for any X,Y (possibly dependent) such that P(X)=P(Y)=1/2, but it doesn't make sense to say that X is before Y in every such pair. For a real world example, take X = "person has above median height" and Y = "person has above median age".

4Scott Garrabrant4y

So you should probably not work with probabilities equal to 1/2 in this framework, unless you are doing so for a specific reason. Just like in Pearlian causality, we are mostly talking about probabilities in general position. I have some ideas about how to deal with probability 1/2 (Have a FFS, together with a group of symmetry constraints, which could swap factors, or swap parts within a factor), but that is outside of the scope of what I am doing here. To give more detail, the uniform distribution on four elements does not satisfy the compositional semigraphoid axioms, since if we take X, Y, Z to be the three partitions into two parts of size two, X is independent with Y and X is independent with Z, but X is not independent with the common refinement of Y and Z. Thus, if we take the orthogonality database generated by this probability distribution, you will find that it is not satisfied by any models.

Collection of GPT-3 results

cousin_it5y10

Thank you! It looks very impressive.

Collection of GPT-3 results

cousin_it5y30

Has anyone tried to get it to talk itself out of the box yet?

1Yoav Ravid5y

Yup, i saw an attempt on the SSC subreddit