All of Towards_Keeperhood's Comments + Replies

So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.

If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?

Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.

If you really thi... (read more)

2Steve Byrnes
Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.” …And then if you follow through the “logic” of this OP, then the argument becomes: “AI alignment is a hard problem, so let’s just make extraordinarily powerful / smart AIs right now, so that they can solve the alignment problem”. See the error? I don’t think that. See the bottom part of the comment you’re replying to. (The part after “Here’s what I would say instead:”)

Thanks for providing a concrete example!

Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.

I also think the "drowned out in the noise" isn't that realistic. You ought to be able to show some quite impressive results relative to computing power used. Though when you maybe should try to convince the AI labs of your better paradigm is going to be difficult to call. It's plausible to me we won't see signs that make us ... (read more)

2Abram Demski
Yeah, basically everything I'm saying is an extension of this (but obviously, I'm extending it much further than you are). We don't exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. (That is, so long as we're assuming scheming is not the failure mode to worry about in the shorter-term.) So, improved rationality for AIs seems similarly good. The claim I'm considering is that even improving rationality of AIs by a lot could be good, if we could do it. An obvious caveat here is that the intervention should not dramatically increase the probability of AI scheming! This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn't it better to start early? (If you have anything significant to say, of course.)

Can you link me to what you mean by John's model more precisely?

If you mean John's slop-instead-scheming post, I agree with that with the "slop slightly more likely than scheming" part. I might need to reread John's post to see what the concrete suggestions for what to work on might be. Will do so tomorrow.

I'm just pessimistic that we can get any nontrivially useful alignment work out of AIs until a few months before the singularity, at least besides some math. Or like at least for the parts of the problem we are bottlenecked on.

So like I think it's valuab... (read more)

4Abram Demski
Concrete (if extreme) story: World A: Invent a version of "belief propagation" which works well for LLMs. This offers a practical way to ensure that if an LLM seems to know something in one context, it can & will fluently invoke the same knowledge in almost all appropriate contexts. Keep the information secret in order to avoid pushing capabilities forward. Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway. Some relatively short time later, there are no humans. World B: Invent LLM "belief propagation" and publish it. It is good enough (by assumption) to be the new paradigm for reasoning models, supplanting current reinforcement-centric approaches. Two years later, GPT7 is assessing its safety proposals realistically instead of convincingly arguing for them. Belief propagation allows AI to facilitate a highly functional "marketplace of ideas" where the actually-good arguments tend to win out far more often than the bad arguments. AI progress is overall faster, but significantly safer. (This story of course assumes that "belief propagation" is an unrealistically amazing insight; still, this points in the direction I'm getting at)

Thanks.

True, I think your characterization of tiling agents is better. But my impression was sorta that this self-trust is an important precursor for the dynamic self-modification case where alignment properties need to be preserved through the self-modification. Yeah I guess calling this AI solving alignment is sorta confused, though maybe there's sth into this direction because the AI still does the search to try to preserve the alignment properties?

Hm I mean yeah if the current bottleneck is math instead of conceptualizing what math has to be done then ... (read more)

3Abram Demski
I think there is both important math work and important conceptual work. Proving new theorems involves coming up with new concepts, but also, formalizing the concepts and finding the right proofs. The analogy to robots handling the literal heavy lifting part of a job seems apt.

What kind of alignment research do you hope to speed up anyway?

For advanced philosophy like stuff (e.g. finding good formal representations for world models, or inventing logical induction) they don't seem anywhere remotely close to being useful.

My guess would be for tiling agents theory neither but I haven't worked on it, so very curious on your take here. (IIUC, to some extent the goal of tiling-agents-theory-like work there was to have an AI solve it's own alignment problem. Not sure how far the theory side got there and whether it could be combined with LLMs.)

Or what is your alignment hope in more concrete detail?

3Abram Demski
Yeah, my sense is that modern AI could be useful to tiling agent stuff if it were less liable to confabulate fake proofs. This generalizes to any technical branch of AI safety where AI could help come up with formalizations of ideas, proofs of conjectures, etc. My thinking suggests there is something of an "overhang" here at present, in the sense that modern AI models are worse-than-useless due to the way that they try to create good-looking answers at the expense of correctness. I disagree with the statement "to some extent the goal of tiling-agents-like work was to have an AI solve its own alignment problem" -- the central thing is to understand conditions under which one agent can justifiably trust another (with "trust" operationalized as whether one agent wants to modify the decision procedure of the other). If AI can't justifiably trust itself, then it has a potential motive to modify itself in ways that remove safety guarantees (so in this sense, tiling is a precondition for lots of safety arguments). Perhaps more importantly, if we can understand conditions under which humans can justifiably trust AI, then we have a formal target for alignment.

This argument might move some people to work on "capabilities" or to publish such work when they might not otherwise do so.

Above all, I'm interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.

My current guess:

I wouldn't expect much useful research to come from having published ideas. It's mostly just going to be used in capabilities and it seems like a bad idea to publish stuff.

Sure you can work on it and be infosec cautious and keep it secret. Maybe share it with a few very trusted people who m... (read more)

3Abram Demski
Do you not at all buy John's model, where there are important properties we'd like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?

Hm interesting. I mean I'd imagine that if we get good heuristic guarantees for a system it would basically mean that all the not-perfectly-aligned subsystems/subsearches are limited and contained enough that they won't be able to engage in RSI. But maybe I misunderstand your point? (Like maybe you have specific reason to believe that it would be very hard to predict reliably that a subsystem is contained enough to not engage in RSI or so?)

(I think inner alignment is very hard and humans are currently not (nearly?) competent enough to figure out how to set... (read more)

How bottlenecked is your agenda by philosophy skills (like being good at thought experiments for deriving stuff like UDT, or like being good at figuring out the right ontology for thinking about systems or problems) vs math skill vs other stuff?

3Abram Demski
I think I've articulated a number of concrete subgoals that require less philosophical skill (they can be approached as math problems). However, in the big picture, novel tiling theorems require novel ideas. This requires philosophical skill.

Idk that could be part of finding heuristic arguments for desireable properties for what an UANFSI converges to. Possibly it's easier to provide probabilistic convergence guarantees for systems that don't do FSI so this would already give some implicit evidence. But we could also just say that it's fine if FSI happens as long as we have heuristic convergence arguments - like that UANFSI is just allowing for a broader class of algorithms which might make stuff easier - though i mostly don't expect we'd get FSI alignment through this indirect alignment path from UANFSI but that we'd get an NFSI AI if we get some probabilistic convergence guarantees.

(Also I didn't think much about it at all. As said I'm trying KANSI for now.)

3Abram Demski
I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. "Unknown Algorithm" to me means that we don't know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can't just think about what it "converges to" (convergence time will be too long for interesting learning systems).

Thanks for writing up some of the theory of change for the tiling agents agenda!

I'd be curious on your take on the importance of the Löbian obstacle: I feel like it's important to do this research for aligning full-blown RSI-to-superintelligence, but at the same time it introduces quite some extra difficulty, and I'd be more excited about research (which ultimately aims for pivotal-act level alignment) where we're fine assuming some "fixed meta level" in the learning algorithm but general enough that the object-level AI can get very powerful. It seems to m... (read more)

2Abram Demski
How would you become confident that a UANFSI approach was NFSI?
5Abram Demski
I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumptions aren't precisely true).  I do think Löb-ish reasoning still seems potentially important for coordination and cooperation, which I expect to feature in important tiling results (if this research program continues to make progress). However, I am optimistic about replacing Löb's Theorem with Payor's Lemma in this context. I don't completely discount the pivotal-act approach, but I am currently more optimistic about developing safety criteria & designs which could achieve some degree of consensus amongst researchers, and make their way into commercial AI, perhaps through regulation.

Thank you! I'll likely read your paper and get back to you. (Hopefully within a week.)

From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that's how Yudkowsk... (read more)

1Koen Holtman
I think I agree to most of it: I agree that some form of optimization or policy search is needed to get many things you want to use AI for. But I guess you have to read the paper to find out the exact subtle way in which the AGIs inside can be called non-consequentialist. To quote Wikipedia: I do not talk about this in the paper, but in terms of ethical philosophy, the key bit about counterfactual planning is that it asks: judge one's conduct by its consequences in what world exactly? Mind you, the problem considered is that we have to define the most appropriate ethical value system for a robot butler, not what is most appropriate for a human.

Hi Koen, thank you very much for writing this list!

I must say I'm skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that's not at all a clear definition yet, I'm still deconfusing myself about that, and I'll likely publish a post clarifying the problem how I see it within the next month.)

So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I'd expect that the other... (read more)

1Koen Holtman
Hi Simon! You are welcome! By the way, I very much want to encourage you to be skeptical and make up your own mind. I am guessing that by mentioning consequentialist, you are referring to this part of Yudkowsky's list of doom: I am not sure how exactly Yudkowsky is defining the terms corrigibility or consequentalist here, but I might actually be agreeing with him on the above statement, depending on definitions. I suggest you read my paper Counterfactual Planning in AGI Systems, because it is the most accessible and general one, and because it presents AGI designs which can be interpreted as non-consequentualist. I could see consequentialist AGI being stably corrigible if it is placed in a stable game-theoretical environment where deference to humans literally always pays as a strategy. However, many application areas for AI or potential future AGI do not offer such a stable game-theoretical environment, so I feel that this technique has very limited applicability. If we use the 2015 MIRI paper definition of corrigibility, the alignment tax (the extra engineering and validation effort needed) for implementing corrigibility in current-generation AI systems is low to non-existent. The TL;DR here is: avoid using a bunch of RL methods that you do not want to use anyway when you want any robustness or verifiability. As for future AGI, the size of the engineering tax is open to speculation. My best guess is that future AGI will be built, if ever, by leveraging ML methods that still resemble world model creation by function approximation, as opposed to say brain uploading. Because of this, and some other reasons, I estimate a low safety engineering tax to achieve basic corrigibility. Other parts of AGI alignment may be very expensive. e.g. the part of actually monitoring an AGI to make sure its creativity is benefiting humanity, instead of merely finding and exploiting loopholes in its reward function that will hurt somebody somewhere. To the extent that alignment ca