Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

A simple, weak notion of corrigibility is having a "complete" feedback interface. In logical induction terms, I mean the AI trainer can insert any trader into the market. I want to contrast this with "partial" feedback, in which only some propositions get feedback and others ("latent" propositions) form the structured hypotheses which help predict the observable propositions -- for example, RL, where only rewards and sense-data is observed.

(Note: one might think that the ability to inject traders into LI is still "incomplete" because traders can give feedback on the propositions themselves, not on other traders; so the trader weights constitute "latents" being estimated. However, a trader can effectively vote against another trader by computing all that trader's trades and counterbalancing them. Of course, we can also more...

2Steve Byrnes
This is a confusing post from my perspective, because I think of LI as being about beliefs and corrigibility being about desires. If I want my AGI to believe that the sky is green, I guess it’s good if it’s possible to do that. But it’s kinda weird, and not a central example of corrigibility. Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? If so, well, I’m generally very skeptical of attempts to do that kind of thing. See here, especially Section 7. In the case of humans, it’s perfectly possible for a plan to seem desirable but not plausible, or for a plan to seem plausible but not desirable. I think there are very good reasons that our brains are set up that way.
4Abram Demski
No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing. As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't have a good notion of purely epistemic.  * Any learning system with a sufficiently rich hypothesis space can potentially learn to behave agentically (whether we want it to or not, until we have anti-inner-optimizer tech), so we should still have corrigibility concerns about such systems. * In my view, beliefs are a type of decision (not because we smoosh beliefs and values together, but rather because beliefs can have impacts on the world if the world looks at them) which means we should have agentic concerns about beliefs. * Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?

[I learned the term teleosemantics from you!  :) ]

The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.

LI defines a notion of logically uncertain variable, which can be used to represent desires

I would say that they do... (read more)

2Abram Demski
Yeah, I totally agree. This was initially a quick private message to someone, but I thought it was better to post it publicly despite the inadequate explanations. I think the idea deserves a better write-up.
This is a linkpost for https://www.thecompendium.ai/

We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.

We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.

We would appreciate your feedback, whether or not you agree with us:

  • If you do agree with us, please point out where you
...
1Vojtech Kovarik
After reading the first section and skimming the rest, my impression is that the document is a good overview, but does not present any detailed argument for why godlike AI would lead to human extinction. (Except for the "smarter species" analogy, which I would say doesn't qualify.) So if I put on my sceptic hat, I can imagine reading the whole document in detail and somewhat-justifiably going away with "yeah, well, that sounds like a nice story, but I am not updating based on this". That seems fine to me, given that (as far as I am concerned) no detailed convincing arguments for AI X-risk exist. But at the moment, the summary of the document gave me the impression that maybe some such argument will appear. So I suggest updating the summary (or some other part of the doc) to make it explicit that no detailed arugment for AI X-risk will be given.

Thanks for the comment!

We have indeed gotten the feedback by multiple people that this part didn't feel detailed enough (although we got this much more from very technical readers than from non-technical ones), and are working at improving the arguments.

1Vojtech Kovarik
Some suggestions for improving the doc (I noticed the link to the editable version too late, apologies): Something weird with the last sentence here (substituting "AI" for "it" makes the sentence un-grammatical). "Hateful" seems likely to put off some readers here, and I also think it is not warranted -- indifference is both more likely and also sufficient for extinction. So "Machines of indifferent competition" might work better. Typo, extra "is". Perhaps rewrite this for less antagonistic language? I know it is a quote and all, but still. (This can be interpreted as "the people building AI are evil and trying to cause harm on purpose". That seems false. And including this in the writing is likely to give the reader the impression that you don't understand the situation with AI, and stop reading.) Perhaps (1) make it apparent that the first thing is a quote and (2) change the second sentence to "If you do nothing, our story gets a bad ending, and that's it.". Or just rewrite the whole thing.
2Adam Shimi
Thanks for the comment! We'll correct the typo in the next patch/bug fix. As for the more direct adversarial tone of the prologue, it is an explicit choice (and is contrasted by the rest of the document). For the moment, we're waiting to get more feedback on the doc to see if it really turns people off or not.

Sometimes people talk about two AIs being “the same” or “different” AIs. We think the intuitive binary of “same vs. different” conflates several concepts which are often better to disambiguate. In this post, we spell out some of these distinctions. We don’t think anything here is particularly novel; we wrote this post because we think it’s probably mildly helpful for people to think this through, and because Buck fairly often wants to reference it.

Some example distinctions:

  • In a debate protocol[1], people often refer to the judge and debaters as three “different” AIs because their tokens are subject to different training incentives. But you could also say that the two debaters are “the same AI” because their prompts and context look similar. Or you could say that all three
...

All of these operationalizations are about exact notions from the training setup.

Another important notion is revealed identity:

  • Does the AI think of the other AI as "itself" to the extent that it thinks about stuff like this at all?
  • Do the AIs cooperate in a way which is reasonably similar to cooperating with yourself?

I've been accepted as a mentor for the next AI Safety Camp. You can apply to work with me on the tiling problem. The goal will be to develop reflectively consistent UDT-inspired decision theories, and try to prove tiling theorems for them.

The deadline for applicants is November 17.

The program will run from January 11 to April 27. It asks for a 10 hour/week commitment.

I am not being funded for this.[1] You can support my work on Patreon.

My project description follows:

Summary

The Tiling Agents problem (aka reflective consistency) consists of analysing when one agent (the "predecessor") will choose to deliberately modify another agent (the "successor"). Usually, the predecessor and successor are imagined as the same agent across time, so we are studying self-modification. A set of properties "tiles" if those...

3Abram Demski
I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. "Unknown Algorithm" to me means that we don't know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can't just think about what it "converges to" (convergence time will be too long for interesting learning systems).

Hm interesting. I mean I'd imagine that if we get good heuristic guarantees for a system it would basically mean that all the not-perfectly-aligned subsystems/subsearches are limited and contained enough that they won't be able to engage in RSI. But maybe I misunderstand your point? (Like maybe you have specific reason to believe that it would be very hard to predict reliably that a subsystem is contained enough to not engage in RSI or so?)

(I think inner alignment is very hard and humans are currently not (nearly?) competent enough to figure out how to set... (read more)

1Towards_Keeperhood
How bottlenecked is your agenda by philosophy skills (like being good at thought experiments for deriving stuff like UDT, or like being good at figuring out the right ontology for thinking about systems or problems) vs math skill vs other stuff?
3Abram Demski
I think I've articulated a number of concrete subgoals that require less philosophical skill (they can be approached as math problems). However, in the big picture, novel tiling theorems require novel ideas. This requires philosophical skill.

This post is a follow-up to "why assume AGIs will optimize for fixed goals?".  I'll assume you've read that one first.

I ended the earlier post by saying:

[A]gents with the "wrapper structure" are inevitably hard to align, in ways that agents without it might not be.  An AGI "like me" might be morally uncertain like I am, persuadable through dialogue like I am, etc.

It's very important to know what kind of AIs would or would not have the wrapper structure, because this makes the difference between "inevitable world-ending nightmare" and "we're not the dominant species anymore."  The latter would be pretty bad for us too, but there's a difference!

In other words, we should try very hard to avoid creating new superintelligent agents that have the "wrapper structure."

What about...

I continue to think this is a great post. Part of why I think that is that I haven't forgotten it; it keeps circling back into my mind.

Recently this happened and I made a fun connection: What you call wrapper-minds seem similar to what Plato (in The Republic) calls people-with-tyrannical-souls. i.e. people whose minds are organized the way a tyrannical city is organized, with a single desire/individual (or maybe a tiny junta) in total control, and everything else subservient.

I think the concepts aren't exactly the same though -- Plato would have put more e... (read more)

Subhash and Josh are co-first authors. Work done as part of the two week research sprint in Neel Nanda’s MATS stream

TLDR

  • We show that dense probes trained on SAE encodings are competitive with traditional activation probing over 60 diverse binary classification datasets
  • Specifically, we find that SAE probes have advantages with:
    • Low data regimes (~ < 100 training examples)
    • Corrupted data (i.e. our training set has some incorrect labels, while our test set is clean)
    • Settings where we worry about the generalization of our probes due to spurious correlations in our dataset or possible mislabels (dataset interpretability) or if we want to understand SAE features better (SAE interpretability).
  • We find null results when comparing SAE probes to activation probes with OOD data in other settings or with imbalanced classes.
  • We find that higher width and L0 are determining
...
Load More