Continuing the experiment from August, let's try another open thread for AI Alignment discussion. The goal is to be a place where researchers and upcoming research can ask small questions they are confused about, share early stage ideas and have lower-key discussions.

Open Threads
Frontpage
New Comment
51 comments, sorted by Click to highlight new comments since:

Question for those (such as Paul Christiano) who both are optimistic about corrigibility as a central method for AI alignment, and think that large corporations or other large organizations (such as Google) will build the first AGIs. A corrigible AI built by Google will likely be forced to share Google’s ideological commitments, in other words, assign zero or near zero probability to beliefs that are politically unacceptable within Google and to maintain that probability against whatever evidence that exist in the world. Is this something that you have thought about, and if so, what’s a good reason to be optimistic about this situation? For example, do you foresee that the human-AI system will become less biased over time and converge to something like Bayesian rationality, and if so how?

More generally, ideology and other forms of loyalty/virtue signaling seem like something that deserves more attention from a human-AI safety perspective, or even just a purely AI safety perspective. For example, can multi-agent systems develop loyalty/virtue signaling even without human involvement, and if so is there anything we can do to ensure that doesn’t spiral into disaster?

More generally, ideology and other forms of loyalty/virtue signaling seem like something that deserves more attention from a human-AI safety perspective

For people looking into this in the future, here are a couple of academic resources:

Studying recent cultural changes in the US and the ideas of virtue signaling and preference falsification more generally has also made me more pessimistic about non-AGI or delayed-AGI approaches to a positive long term future (e.g., the Long Reflection). I used to think that if we could figure out how to achieve strong global coordination on AI, or build a stable world government, then we'd be able to take our time, centuries or millennia if needed, to figure out how to build an aligned superintelligent AI. But it seems that human cultural/moral evolution often happens through some poorly understood but apparently quite unstable dynamics, rather than by philosophers gradually making progress on moral philosophy and ultimately converging to moral truth as I may have imagined or implicitly assumed. (I did pay lip service to concerns about "value drift" back then but I guess it just wasn't that salient to me.)

Especially worrying is that no country or culture seems immune to these unpredictable dynamics. My father used to tell me to look out for the next Cultural Revolution (having lived through one himself), and I always thought that it was crazy to worry about something like that happening in the West. Well I don't anymore.

Part of why I'm skeptical of these concerns is that it seems like a lot of moral behavior is predictable as society gets richer, and we can model the social dynamics to predict some outcomes will be good.

As evidence for the predictability, consider that rich societies are more open to LGBT rights, they have explicit policies against racism, against war, slavery, torture, and it seems like rich societies are moving in the direction of government control over many aspects of life, such as education and healthcare. Is this just a quirk of our timeline, or a natural feature of civilizations of humans as they get richer?

I am inclined to think much of it is the latter.

That's not to say that I think the current path we're going on is a good one. I just think it's more predictable than what you seem to think. Given its predictability, I feel somewhat confident in the following statements: eventually, when aging is cured, people will adopt policies that give people the choice to die. Eventually, when artificial meat is very cheap and tasty, people will ban animal-based meat.

I'm not predicting these outcomes because I am confusing what I hope for and what I think will happen. I just genuinely think that human virtue signaling dynamics will be favorable to those outcomes.

I'm less confident, leaning pessimistic about these questions: I don't think humans will inevitably care about wild animal suffering. I don't think humans will inevitably create a post-human utopia where people can modify their minds into any sort of blissful existence they imagine, and I don't think humans will inevitably care about subroutine suffering. It's these questions that make me uneasy about the future.

By unpredictable I mean that nobody really predicted:

(Edit: 1-3 removed to keep a safer distance from object-level politics, especially on AF)

4 Russia and China adopted communism even though they were extremely poor. (They were ahead of the US in gender equality and income equality for a time due to that, even though they were much poorer.)

None of these seem well-explained by your "rich society" model. My current model is that social media and a decrease in the perception of external threats relative to internal threats both favor more virtue signaling, which starts spiraling out of control after some threshold is crossed. But the actual virtue(s) that end up being signaled/reinforced (often at the expense of other virtues) is historically contingent and hard to predict.

I could be wrong here, but the stuff you mentioned as counterexamples to my model appear either ephemeral, or too particular. The "last few years" of political correctness is hardly enough time to judge world-trends by, right? By contrast, the stuff I mentioned (end of slavery, explicit policies against racism and war) seem likely to stick and stay with us for decades, if not centuries.

We can explain this after the fact by saying that the Left is being forced by impersonal social dynamics, e.g., runaway virtue signaling, to over-correct, but did anyone predict this ahead of time?

When I listen to old recordings of right wing talk show hosts from decades ago, they seem to be saying the same stuff that current people are saying today, about political correctness and being forced out of academia for saying things that are deemed harmful by the social elite, or about the Left being obsessed by equality and identity. So I would definitely say that a lot of people predicted this would happen.

The main difference is that it's now been amplified as recent political events have increased polarization, the people with older values are dying of old age or losing their power, and we have social media that makes us more aware of what is happening. But in hindsight I think this scenario isn't that surprising.

Russia and China adopted communism even though they were extremely poor

Of course, you can point to a few examples of where my model fails. I'm talking about the general trends rather than the specific cases. If we think in terms of world history, I would say that Russia in the early 20th century was "rich" in the sense that it was much richer than countries in previous centuries and this enabled it to implement communism in the first place. Government power waxes and wanes, but over time I think its power has definitely gone up as the world has gotten richer, and I think this could have been predicted.

When I listen to old recordings of right wing talk show hosts from decades ago, they seem to be saying the same stuff that current people are saying today, about political correctness and being forced out of academia for saying things that are deemed harmful by the social elite, or about the Left being obsessed by equality and identity. So I would definitely say that a lot of people predicted this would happen.

I think what's surprising is that although academia has been left-leaning for decades, the situation had been relatively stable until the last few years, when things suddenly progressed very quickly, to the extent that even professors who firmly belong on the Left are being silenced or driven out of academia for disagreeing with an ever-changing party line. (It used to be that universities at least paid lip service to open inquiry, overt political correctness was confined to non-STEM fields, and there was relatively open discussion among people who managed to get into academia in the first place. At least that's my impression.) Here are a couple of links for you if you haven't been following the latest developments:

A quote from the second link:

Afterward, several faculty who had attended the gathering told me they were afraid to speak in my defense. One, a full professor and past chair, told me that what had happened was very wrong but he was scared to talk.

Another faculty member, who was originally from China and lived through the Cultural Revolution told me it was exactly like the shaming sessions of Maoist China, with young Red Guards criticizing and shaming elders they wanted to embarrass and remove.

(BTW I came across this without specifically searching for "cultural revolution".) Note that the author is in favor of carbon taxes in general and supported past attempts to pass carbon taxes, and was punished for disagreeing with a specific proposal that he found issue with. How many people (if any) predicted that things like this would be happening on a regular basis at this point?

I could be wrong here, but the stuff you mentioned appear either ephemeral, or too particular. The “last few years” of political correctness is hardly enough time to judge world-trends by, right? By contrast, the stuff I mentioned (end of slavery, explicit policies against racism and war) seem likely to stick and stay with us for decades, if not centuries.

It sounds like you think that something like another Communist Revolution or Cultural Revolution could happen (that emphasizes some random virtues at the expense of others), but the effect would be temporary and after it's over, longer term trends will reassert themselves. Does that seem fair?

In the context of AI strategy though (specifically something like the Long Reflection), I would be worried that a world in the grips of another Cultural Revolution would be very tempted to (or impossible to refrain from) abandoning the plan to delay AGI and instead build and lock their values into a superintelligent AI ASAP, even if that involves more safety risk. Predictability of longer term moral trends (even if true) doesn't seem to help with this concern.

It sounds like you think that something like another Communist Revolution or Cultural Revolution could happen (that emphasizes some random virtues at the expense of others), but the effect would be temporary and after it's over, longer term trends will reassert themselves. Does that seem fair?

That's pretty fair.

I think it's likely that another cultural revolution could happen, and this could adversely affect the future if it happens simultaneously with a transition into an AI based economy. However, the deviations from long-term trends are very hard to predict, as you point out, and we should know about the specifics more as we get further along. In the absence of concrete details, I find it far more helpful to use information from long-term trends rather than worrying about specific scenarios.

I think it’s likely that another cultural revolution could happen, and this could adversely affect the future if it happens simultaneously with a transition into an AI based economy.

This seems to be ignoring the part of my comment at the top of this sub-thread, where I said "[...] has also made me more pessimistic about non-AGI or delayed-AGI approaches to a positive long term future (e.g., the Long Reflection)." In other words, I'm envisioning a long period of time in which humanity has the technical ability to create an AGI but is deliberately holding off to better figure out our values or otherwise perfect safety/alignment. I'm worried about something like the Cultural Revolution happening in this period, and you don't seem to be engaging with that concern?

Ahh. To be honest, I read that, but then responded to something different. I assumed you were just expressing general pessimism, since there's no guarantee that we would converge on good values upon a long reflection (and you recently viscerally realized that values are very arbitrary).

Now I see that your worry is more narrow, in that the cultural revolution might happen during this period, and would act unwisely to create the AGI during its wake. I guess this seems quite plausible, and is an important concern, though I personally am skeptical that anything like the long reflection will ever happen.

Ahh. To be honest, I read that, but then responded to something different. I assumed you were just expressing general pessimism, since there’s no guarantee that we would converge on good values upon a long reflection (and you recently viscerally realized that values are very arbitrary).

I guess I was also expressing a more general update towards more pessimism, where even if nothing happens during the Long Reflection that causes it to prematurely build an AGI, other new technologies that will be available/deployed during the Long Reflection could also invalidate the historical tendency for "Cultural Revolutions" to dissipate over time and for moral evolution to continue along longer-term trends.

though I personally am skeptical that anything like the long reflection will ever happen.

Sure, I'm skeptical of that too, but given my pessimism about more direct routes to building an aligned AGI, I thought it might be worth pushing for it anyway.

I feel like there's currently a wave of optimism among some AI safety researchers around transparency/interpretability, and to me it looks like another case of "optimism by default + not thinking things through", analogous to how many people, such as Eliezer, were initially very optimistic about AGI being beneficial when they first thought of the idea. I find myself asking the same skeptical questions to different people who are optimistic about transparency/interpretability and not really getting good answers. Anyone want to try to convince me that I'm wrong about this?

If you expect discontinuous takeoff, or you want a proof that your AGI is safe, then I agree transparency / interpretability is unlikely to give you what you want.

If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try. (Red teaming would be included in this.)

However, I suspect Chris Olah, Evan Hubinger, Daniel Filan, and Matthew Barnett would all not justify interpretability / transparency on these grounds. I don't know about Paul Christiano.

If you in­stead ex­pect grad­ual take­off, then it seems rea­son­able to ex­pect that reg­u­lar en­g­ineer­ing prac­tices are the sort of thing you want, of which in­ter­pretabil­ity /​ trans­parency tools are prob­a­bly the most ob­vi­ous thing you want to try.

I support work on in­ter­pretabil­ity/trans­parency, in part because I'm uncertain about dis­con­tin­u­ous vs gradual take­off, and in part because I'm not very optimistic about any other AI safety approach either and think we probably just need to try a whole bunch of different approaches that each have low probability of success in the hope that something (or some combination of things) works out in the end. My point was that I find the stories people tell about why they are optimistic (e.g., re­verse com­piling a neural network into human readable code and then using that to generate hu­man feed­back on the model’s de­ci­sion-mak­ing pro­cess) to be very questionable.

Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.

(If a single failure meant that we lose, then I wouldn't say this; so perhaps we also need to add in another claim that the first failure does not mean automatic loss. Regular engineering practices get you to high degrees of reliability, not perfect reliability.)

Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.

What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from "single failure meant that we lose", the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways. In this case WRT interpretability I was complaining that having humans look at reverse compiled neural networks and give "feedback on process" as part of ML training seems impractically expensive.

What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition?

Two responses:

First, this is more of a social coordination problem -- I'm claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.

Second, is there a consensus that recommendation algorithms are net negative? Within this community, that's probably the consensus, but I don't think it's a consensus more broadly. If we can't solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them.

(Part of the social coordination problem is building consensus that something is wrong.)

the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways.

For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I'd predict that most people optimistic about transparency / interpretability would agree with at least that example.

First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.

Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice when an AI isn't aligned. (I didn't state this clearly in my original comment, but the links I gave did go to posts where people seemed to be optimistic about "solving", not just "noticing".)

As I've argued before, I think a large part of solving social coordination is making sure that strategists and policy makers have correct beliefs about how difficult alignment is, which is why I was making this complaint in the first place.

I was one of those people who you asked the skeptical question to, and I feel like I have a better reply now than I did at the time. In particular, your objection was

To generalize my question, what if something goes wrong, we peek inside and find out that it's one of the 10-15% of times when the model doesn't agree with the known-algorithm which is used to generate the penalty term?

I agree this is an issue, but at worst it puts a bound on how well we can inspect the neural network's behavior. In other words, it means something like, "Our model of what this neural network is doing is wrong X% of the time." This sounds bad, but X can also be quite low. Perhaps more importantly though, we shouldn't expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.

The errors that we make are potentially neutral errors, meaning that the AI could be doing something either bad or good in those intervals, but probably nothing purposely catastrophic. We can strengthen this condition by using adversarial training to purposely search for interpretations that would prioritize exposing catastrophic planning.

ETA: This is essentially why engineers don't need to employ quantum mechanics to argue that their designs are safe. The normal models that are less computationally demanding might be less accurate, but by default engineers don't think that their bridge is going to adversarially optimize for the (small) X% where predictions disagree. There is of course a lot of stuff to be said about when this assumption does not apply to AI designs.

Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.

I'm confused because the post you made one day later from this comment seems to argue the opposite of this. Did you change your mind in between, or am I missing something?

I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I've less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.

My new research direction for an "end-to-end" alignment scheme.

See also this clarifying comment.

I'm posting this in the Open Thread because, for technical reasons the shortforms don't appear in the feed on the main page of alignmentforum, so I am a little worried people missed it entirely (I discussed it with Oliver).

Human values can change a lot over a short amount of time, to the extent that maybe the commonly used "value drift" is not a good term to describe it. After reading Geoffrey Miller, my current model is that a big chunk of our apparent values comes from the need to do virtue signaling. In other words, we have certain values because it's a lot easier to signal having those values when you really do have them. But the optimal virtues/values to signal can change quickly due to positive and negative feedback loops in the social dynamics around virtual signaling and for other reasons (which I don't fully understand), which in turn causes many people's values to quickly change in response, and moreover causes the next generation (whose values are more malleable) to adopt values different from their parents.

I don't yet know the implications of this for AI alignment, but it seems like an important insight to share before I forget.

[Question about factored cognition]

Suppose that at some point in the future, for the first time in history, someone trains an ML model that takes any question as input, and outputs an answer that an ~average human might have given after thinking about it for 10 minutes. Suppose that model is trained without any safety-motivated interventions.

Suppose also that the architecture of that model is such that '10 minutes' is just a parameter, , that the operator can choose per inference, and there's no upper bound on it; and the inference runtime increases linearly with . So, for example, the model could be used to get an answer that a human would have come up with after thinking for 1000 days.

In this scenario, would it make sense to use the model for factored cognition? Or should we consider running this model with to be no more dangerous than running it many times with ?

I think normally the time restriction is used as a cost-saving measure instead of as a safety-enhancing measure. That is, if we need a billion examples to train the system on, it's easier to get a billion examples of people thinking for 10 minutes than it is to get a billion examples of them thinking for 1000 days. It just needs to be long enough that meaningful cognitive work gets done (and so you aren't spending all of your time loading in or loading out context).

The bit of this that is a safety enhancing measure is that you have an honesty criterion, where the training procedure should be only trying to answer the question posed, and not trying to do anything else, and not trying to pass coded messages in the answer, and so on. This is more important than a runtime limit, since those could be circumvented by passing coded messages / whatever other thing that lets you use more than one cell.

I think also this counts as the 'Mechanical Turk' case from this post, which I don't think people are optimistic about (from an alignment perspective).

---

I think this thought experiment asks us to condition on how thinking works in a way that makes it a little weirder than you're expecting. If humans naturally use something like factored cognition, then it doesn't make much difference whether I use my 1000 days of runtime to ask my big question to a 'single run', or ask my big question to a network of 'small runs' using a factored cognition scheme. But if humans normally do something very different from factored cognition, then it makes a pretty big difference (suppose the costs for serializing and deserializing state are high, but doing it regularly gives us useful transparency / not-spawning-subsystems guarantees); the 'humans thinking for 10 minutes chained together' might have very different properties from 'one human thinking for 1000 days'. But given that those have different properties, it means it might be hard to train the 'one human thinking for 1000 days' system relative to the 'thinking for 10 minutes' system, and the fact that one easily extends to the other is evidence that this isn't how thinking works.

Thank you! I think I now have a better model of how people think about factored cognition.

the 'humans thinking for 10 minutes chained together' might have very different properties from 'one human thinking for 1000 days'. But given that those have different properties, it means it might be hard to train the 'one human thinking for 1000 days' system relative to the 'thinking for 10 minutes' system, and the fact that one easily extends to the other is evidence that this isn't how thinking works.

In the above scenario I didn't assume that humans—or the model—use factored cognition when the 'thinking duration' is long. Suppose instead that the model is running a simulation of a system that is similar (in some level of abstraction) to a human brain. For example, suppose some part of the model represents a configuration of a human brain, and during inference some iterative process repeatedly advances that configuration by a single "step". Advancing the configuration by 100,000 steps (10 minutes) is not qualitatively different from advancing it by 20 billion steps (1,000 days); and the runtime is linear in the number of steps.

Generally, one way to make predictions about the final state of complicated physical processes is to simulate them. Solutions that do not involve simulations (or equivalent) may not even exist, or may be less likely to be found by the training algorithm.

Advancing the configuration by 100,000 steps (10 minutes) is not qualitatively different from advancing it by 20 billion steps (1,000 days); and the runtime is linear in the number of steps.

I think you get distributional shift in the mental configurations that you have access to when you run for more steps, and this means that for the ML to line up with the ground truth you either need training data from those regions of configuration-space or you need well-characterized dynamics that you could correctly identify by training on 100,000 steps.

Arithmetic has these well-characterized dynamics, for example; if you have the right architecture and train on small multiplication problems, you can also perform well on big multiplication problems, because the underlying steps are the same, just repeated more times. This isn't true of piecewise linear approximations to complicated functions, as your approximations will only be good in regions where you had lots of training data. (Imagine trying to fit x^3 with a random forest.) If there are different 'modes of thought' that humans can employ, you need either complete coverage of those modes of thought or 'functional coverage', in that the response to any strange new mental configurations you enter can be easily predicted from normal mental configurations you saw in training.

Like, consider moving in the opposite direction; if I have a model that I train on a single step from questions, then I probably just have a model that's able to 'read' questions (or even just the starts of questions). Once I want to extend this to doing 100,000 steps, I need to not just be able to read inputs, but also do something interesting with them, which probably requires not just 'more' training data from the same distribution, but data from a different, more general distribution.

Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?

Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?

I agree. We can frame this empirical uncertainty more generally by asking: What is the smallest such that there is no meaningful difference between all the things that can happen in a human brain while thinking about a question for minutes vs. 1,000 days.

Or rather: What is the smallest such that 'learning to generate answers that humans may give after thinking for minutes' is not easier than 'learning to generate answers that humans may give after thinking for 1,000 days'.

I should note that, conditioned on the above scenario, I expect that labeled 10-minute-thinking training examples would be at most a tiny fraction of all the training data (when considering all the learning that had a role in building the model, including learning that produced pre-trained weights etcetera). I expect that most of the learning would be either 'supervised with automatic labeling' or unsupervised (e.g. 'predict the next token') and that a huge amount of text (and code) that humans wrote will be used; some of which would be the result of humans thinking for a very long time (e.g. a paper on arXiv that is the result of someone thinking about a problem for a year).

I have a bunch of half-baked ideas, most of which are mediocre in expectation and probably not worth investing my time and other’s attention writing up. Some of them probably are decent, but I’m not sure which ones, and the user base is probably as good as any for feedback.

So I’m just going to post them all as replies to this comment. Upvote if they seem promising, downvote if not. Comments encouraged. I reserve the “right” to maintain my inside view, but I wouldn’t make this poll if I didn’t put substantial weight on this community’s opinions.

(8)

In light of the “Fixed Points” critique, a set of exercises that seem more useful/reflective of MIRI’s research than those exercises. What I have in mind is taking some of the classic success stories of formalized philosophy (e.g. Turing machines, Kolmogorov complexity, Shannon information, Pearlian causality, etc., but this could also be done for reflective oracles and logical induction), introducing the problems they were meant to solve, and giving some stepping stones that guide one to have the intuitions and thoughts that (presumably) had to be developed to make the finished product. I get that this will be hard, but I think this can be feasibly done for some of the (mostly easier) concepts, and if done really well, it could even be a better way for people to learn those concepts than actually reading about them.

I think this would be an extremely useful exercise for multiple independent reasons:

  • it's directly attempting to teach skills which I do not currently know any reproducible way to teach/learn
  • it involves looking at how breakthroughs happened historically, which is an independently useful meta-strategy
  • it directly involves investigating the intuitions behind foundational ideas relevant to the theory of agency, and could easily expose alternative views/interpretations which are more useful (in some contexts) than the usual presentations

*begins drafting longer proposal*

Yeah, this is definitely more high-risk, high-reward than the others, and the fact that there's potentially some very substantial spillover effects if successful makes me both excited and nervous about the concept. I'm thinking of Arbital as an example of "trying to solve way too many problems at once", so I want to manage expectations and just try to make some exercises that inspire people to think about the art of mathematizing certain fuzzy philosophical concepts. (Running title is "Formalization Exercises", but I'm not sure if there's a better pithy name that captures it).

In any case, I appreciate the feedback, Mr. Entworth.

Oh no, not you too. It was bad enough with just Bena.

I think we can change your username to have capital letters if you want. ;)

(5)

A skeptical take on Part I of “What failure looks like” (3 objections, to summarize briefly: not much evidence so far, not much precedent historically, and “why this, of all the possible axes of differential progress?”) [Unsure if these objections will stand up if written out more fully]

(6)

An analysis of what kinds of differential progress we can expect from stronger ML. Actually, I don’t feel like writing this post, but I just don’t understand why Dai and Christiano, respectively, are particularly concerned about differential progress on the polynomial hierarchy and what’s easy-to-measure vs. hard-to-measure. My gut reaction is “maybe, but why privilege that axis of differential progress of all things”, and I can’t resolve that in my mind without doing a comprehensive analysis of potential “differential progresses” that ML could precipitate. Which, argh, sounds like an exhausting task, but someone should do it?

Re: easy-to-measure vs. hard-to-measure axis: That seems like the most obvious axis on which AI is likely to be different from humans, and it clearly does lead to bad outcomes?

(4)

A post discussing my confusions about Goodhart and Garrabrant’s taxonomy of it. I find myself not completely satisfied with it:

1) “adversarial” seems too broad to be that useful as a category

2) It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad

3) Whereas “regressional” and “extremal” (and perhaps “causal”) are defined statistically, “adversarial” is defined in terms of agents, and this may have downsides (I’m less sure about this objection)

But I’m also not sure how I’d reclassify it and that task seems hard. Which partially updates me in favor of the Taxonomy being good, but at the very least I feel there’s more to say about it.

(7)

A critique of MIRI’s “Fixed Points” paradigm, expanding on some points I made on MIRIxDiscord a while ago (which would take a full post to properly articulate). Main issue is, I'm unsure if it's still guiding anyone's research and/or who outside MIRI would care.

(3)

“When and why should we be worried about robustness to distributional shift?”: When reading that section of Concrete Problems, there’s a temptation to just say “this isn’t relevant long-term, since an AGI by definition would have solved that problem”. But adversarial examples and the human safety problems (to the extent we worry about them) both say that in some circumstances we don’t expect this to be solved by default. I’d like to think more about when the naïve “AGI will be smart” intuition applies and when it breaks.

Concerns about mesa-optimizers are mostly concerns that "capabilities" will be robust to distributional shift while "objectives" will not be robust.

(2)

[I probably need a better term for this] “Wide-open-source game theory”: Where other agents can not only simulate you, but also figure out "why" you made a given decision. There’s a Standard Objection to this: it’s unfair to compare algorithms in environments where they are judged not only by their actions, but on arbitrary features of their code; to which I say, this isn’t an arbitrary feature. I was thinking about this in the context of how, even if an AGI makes the right decision, we care “why” it did so, i.e. because it’s optimizing for what we want vs. optimizing for human approval for instrumental reasons). I doubt we’ll formalize this “why” anytime soon (see e.g. section 5 of this), but I think semi-formal things can be said about it upon some effort. [I thought of this independently from (1), but I think every level of the “transparency hierarchy” could have its own kind of game theory, much like the “open-source” level clearly does]

(1)

A classification of some of the vulnerabilities/issues we might expect AGIs to face because they are potentially open-source, and generally more “transparent” to potential adversaries. For instance, they could face adversarial examples, open-source game theory problems, Dutch books, or weird threats that humans don’t have to deal with. Also, there’s a spectrum from “extreme black box” to “extreme white box” with quite a few plausible milestones along the way, that makes for a certain transparency hierarchy, and it may be helpful to analyze this (or at least take a stab at formulating it).

Upvote this comment (and downvote the others as appropriate) if most of the other ideas don’t seem that fruitful.

By default, I’d mostly take this as a signal of “my time would be better spent working on someone else’s agenda or existing problems that people have posed” but I suppose other alternatives exist, if so comment below.

Meta: When should I used this rather the shortform? Do we really need both?

I'm not sure what'll end up settling with for "regular Open Threads" vs shortform. Open Threads predate shortform, but didn't create the particular feeling of a person-space like shortform does, so it seemed useful to add shortform. I'm not sure if Open Threads still provide a particular service that shortform doesn't provide.

In _this_ case, however, I think the Alignment Open Thread servers a bit of a different purpose – it's a place to spark low-key conversation between AF members. (Non-AF members can comment on LessWrong, but I think it's valuable that AF members who prefer AF to LessWrong can show up on alignmentforum.org and see some conversation that's easy to jump into)

For your own comments: the subtle difference is that open threads are more like a market square where you can show up and start talking to strangers, and shortform is more like a conversation in your living room. If you have a preference for one of those subtle distinctions, do that I guess, and if not... dunno, flip a coin I guess. :P

Actually, now I'm confused. I just posted a shortform, but I don't see where it appears on the main page? There is "AI Alignment Posts" which only includes the "longforms" and there is "recent discussion" which only includes the comments. Does it mean nobody sees the shortform unless they open my profile?

...the subtle difference is that open threads are more like a market square where you can show up and start talking to strangers, and shortform is more like a conversation in your living room.

Hmm, this seems like an informal cultural difference that isn't really enforced by the format. Technically, people can comment on the shortform as easily as on open thread comments. So, I am not entirely sure whether everyone perceive it this way (and will continue to perceive it this way).