User Comment Replies — AI Alignment Forum

Shallow review of technical AI safety, 2024

Some small corrections/additions to my section ("Altair agent foundations"). I'm currently calling it "Dovetail research". That's not publicly written anywhere yet, but if it were listed as that here, it might help people who are searching for it later this year.

Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake

I wouldn't put number 9. Not intended to "solve" most of these problems, but is intended to help make progress on understanding the nature of the problems through... (read more)

When is Goodhart catastrophic?

Alex_Altair10mo42

I finally got around to reading this sequence, and I really like the ideas behind these methods. This feels like someone actually trying to figure out exactly how fragile human values are. It's especially exciting because it seems like it hooks right into an existing, normal field of academia (thus making it easier to leverage their resources toward alignment).

I do have one major issue with how the takeaway is communicated, starting with the term "catastrophic". I would only use that word when the outcome of the optimization is really bad, much worse that ... (read more)

5Thomas Kwa10mo

We considered that "catastrophic" might have that connotation, but we couldn't think of a better name and I still feel okay about it. Our intention with "catastrophic" was to echo the standard ML term of "catastrophic forgetting", not a global catastrophe. In catastrophic forgetting the model completely forgets how to do task A after it is trained on task B, it doesn't do A much worse than random. So we think that "catastrophic Goodhart" gives the correct idea to people who come from ML. The natural question is then: why didn't we study circumstances in which optimizing for a proxy gives you −∞ utility in the limit? Because it isn't true under the assumptions we are making. We wanted to study regressional Goodhart, and this naturally led us to the independence assumption. Previous work like Zhuang et al and Skalse et al has already formalized the extremal Goodhart / "use the atoms for something else" argument that optimizing for one goal would be bad for another goal, and we thought the more interesting part was showing that bad outcomes are possible even when error and utility are independent. Under the independence assumption, it isn't possible to get less than 0 utility. To get −∞ utility in the frame where proxy = error + utility, you would need to assume something about the dependence between error and utility, and we couldn't think of a simple assumption to make that didn't have too many moving parts. I think extremal Goodhart is overall more important, but it's not what we were trying to model. Lastly, I think you're imagining "average" outcome as a random policy, which is an agent incapable of doing significant harm. The utility of the universe is still positive because you can go about your life. But in a different frame, random is really bad. Right now we pretrain models and then apply RLHF (and hopefully soon, better alignment techniques). If our alignment techniques produce no more utility than the prior, this means the model is no more aligned than t

Meaning & Agency

Alex_Altair1y10

We perceive agency when something is better at doing something than us; we endorse some aspect of its reasoning or activity.

This doesn't feel true to me (in the carve-nature-at-its-joints sense). I think children are strongly agents, even though I do everything more competently than they do.

3Abram Demski1y

Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition. First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post. But, if children or animals who are intuitively agents often don't fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on "smaller" perspectives is very important.

Meaning & Agency

Alex_Altair1y32

I have some comments on the arbitrariness of the "baseline" measure in Yudkowsky's measure of optimization.

Sometimes, I am surprised in the moment about how something looks, and I quickly update to believing there's an optimization process behind it. For example, if I climb a hill expecting to see a natural forest, and then instead see a grid of suburban houses or an industrial logging site, I'll immediately realize that there's no way this is random and instead there's an optimization process that I wasn't previously modelling. In cases like this, I think... (read more)

3Abram Demski1y

I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn't already rely on some other agent-detector piece which helps us understand how to remove the agent. Looking back at Flint's work, I don't agree with this summary. His idea is more about spotting attractor basins in the dynamics. There is no "compare your optimizer to this" step which I can see, since he studies the dynamics of the entire system. He suggests that in cases where it is meaningful to make an optimizer/optimized distinction, this could be detected by noticing that a specific region (the 'optimizer') is sensitive to very small perturbations, which can take the whole system out of the attractor basin. In any case, I agree that Flint's work also eliminates the need for an unnatural baseline in which we have to remove the agent. Overall, I expect my definition to be more useful to alignment, but I don't currently have a well-articulated argument for that conclusion. Here are some comparison points: * Flint's definition requires a system with stable dynamics over time, so that we can define an iteration rule. My definition can handle that case, but does not require it. So, for example, Flint's definition doesn't work well for a goal like "become President in 2030" -- it works better for continual goals, like "be president". * Flint's notion of robustness involves counterfactual perturbations which we may never see in the real world. I feel a bit suspicious about this aspect. Can counterfactual perturbations we'll never see in practice be really relevant and useful for reasoning about alignment? * Flint's notion is based more on the physical system, whereas mine is more about how we subjectively view that system. * I feel that "endorsement" comes closer to a concept of alignment. Because of the subjective nature of endorsement, it comes closer to formalizing when an optimizer is tru

1Alex_Altair1y

I'll also note that I think what you're calling "Vingean agency" is a notable sub-type of optimization process that you've done a good job at analyzing here. But it's definitely not the definition of optimization or agency to me. For example, in the post you say This doesn't feel true to me (in the carve-nature-at-its-joints sense). I think children are strongly agents, even though I do everything more competently than they do.

Meaning & Agency

Alex_Altair1y41

I feel like there's a key concept that you're aiming for that isn't quite spelled out in the math.

I remember reading somewhere that there's a typically unmentioned distinction between "Bayes' theorem" and "Bayesian inference". Bayes' theorem is the statement about $P (A | B)$ , which is true from the axioms of probability theory for any $A$ and $B$ whatsoever. Notably, it has nothing to do with time, and it's still true even after you learn $\neg B$ . On the other hand, Bayesian inference is the premise your beliefs should change in accordance... (read more)

3Abram Demski1y

There are several compromises I made for the sake of getting the idea across as simply as I could. * I think the graduate-level-textbook version of this would be much more clear about what the quotes are doing. I was tempted to not even include the quotes in the mathematical expressions, since I don't think I'm super clear about why they're there. * I totally ignored the difference between P(A|B) (probability conditional on B) and PB(A) (probability after learning B). * I neglect to include quantifiers in any of my definitions; the reader is left to guess which things are implicitly universally quantified. I think I do prefer the version I wrote, which uses P(A|B) rather than PB(A), but obviously the English-language descriptions ignore this distinction and make it sound like what I really want is PB(A). Obviously we can consider both possibilities and see where that goes, but I think maybe the conditional version makes more sense as a notion of whether you right now endorse something. A conditional probability is sort of like a plan for updating. You won't necessarily follow the plan exactly when you actually update, but the conditional probability is your best estimate. To throw some terminology out there, let's call my thing "endorsement" and a version which uses actual updates rather than conditionals "deference" (because you'd actually defer to their opinions if you learn them). * You can know whether you endorse something, since you can know your current conditional probabilities (to within some accuracy, anyway). It is harder to know whether you defer to something, since in the case where updates don't equal conditionals, you must not know what you are going to update to. I think it makes more sense to define the intentional stance in terms of something you can more easily know about yourself. * Using endorsement to define agency makes it about how you reason about specific hypotheticals, whereas using deference to try and define agency would mak

Towards Measures of Optimisation

Alex_Altair2y50

You might be interested in some of my open drafts about optimization;

One distinction that I pretty strongly hold as carving nature at its joint is (what I call) optimization vs agents. Optimization has no concept of a utility function, and it just about the state going up an ordering. Agents are the thing that has a utility function, which they need for picking actions with probabilistic outcomes.

1Matt MacDermott2y

Nice, I'd read the first but didn't realise there were more. I'll digest later. I think agents vs optimisation is definitely reality-carving, but not sure I see the point about utility functions and preference orderings. I assume the idea is that an optimisation process just moves the world towards states, but an agent tries to move the world towards certain states i.e. chooses actions based on how much they move the world towards certain states, so it make sense to quantify how much of a weighting each state gets in its decision-making. But it's not obvious to me that there's not a meaningful way to assign weightings to states for an optimisation process too - for example if a ball rolling down a hill gets stuck in the large hole twice as often as it gets stuck in the medium hole and ten times as often as the small hole, maybe it makes sense to quantify this with something like a utility function. Although defining a utility function based on the typical behaviour of the system and then trying to measure its optimisation power against it gets a bit circular. Anyway, the dynamical systems approach seems good. Have you stopped working on it?

Optimization Concepts in the Game of Life

Alex_Altair2y20

I feel very on-board with this research aesthetic.

Here are just some nit-picks/notational confusions I had while reading this;

The sequence $p, s t e p (p), s t e p (s t e p (p)), \dots$ , i.e., $n \mapsto {s t e p}^{n} (p)$ , is the computation seeded at $p$ (or a “trajectory” in dynamical systems terminology).
...
A property $P$ is achieved by a computation s if there exists some number of steps $n$ such that $s (n) \in P$ ...

It took me a second to figure out what $s (n)$ referred to, partly because the first s was not rendered in LaTeX, partly because it was n... (read more)

1Victoria Krakovna2y

Thanks Alex for the detailed feedback! I have updated the post to fix these errors. Curious if you have high-level thoughts about the post and whether these definitions have been useful in your work.

Your posts should be on arXiv

Alex_Altair3y39

I would especially especially love it if it popped out a .tex file that I could edit, since I'm very likely to be using different language on LW than I would in a fancy academic paper.

2David Manheim3y

Seconding the .tex export, since it's much more useful than just getting a pdf!

AI ALIGNMENT FORUM
AF

All of Alex_Altair's Comments + Replies