Thanks for providing a concrete example!
Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
I also think the "drowned out in the noise" isn't that realistic. You ought to be able to show some quite impressive results relative to computing power used. Though when you maybe should try to convince the AI labs of your better paradigm is going to be difficult to call. It's plausible to me we won't see signs that make us ...
Can you link me to what you mean by John's model more precisely?
If you mean John's slop-instead-scheming post, I agree with that with the "slop slightly more likely than scheming" part. I might need to reread John's post to see what the concrete suggestions for what to work on might be. Will do so tomorrow.
I'm just pessimistic that we can get any nontrivially useful alignment work out of AIs until a few months before the singularity, at least besides some math. Or like at least for the parts of the problem we are bottlenecked on.
So like I think it's valuab...
Thanks.
True, I think your characterization of tiling agents is better. But my impression was sorta that this self-trust is an important precursor for the dynamic self-modification case where alignment properties need to be preserved through the self-modification. Yeah I guess calling this AI solving alignment is sorta confused, though maybe there's sth into this direction because the AI still does the search to try to preserve the alignment properties?
Hm I mean yeah if the current bottleneck is math instead of conceptualizing what math has to be done then ...
What kind of alignment research do you hope to speed up anyway?
For advanced philosophy like stuff (e.g. finding good formal representations for world models, or inventing logical induction) they don't seem anywhere remotely close to being useful.
My guess would be for tiling agents theory neither but I haven't worked on it, so very curious on your take here. (IIUC, to some extent the goal of tiling-agents-theory-like work there was to have an AI solve it's own alignment problem. Not sure how far the theory side got there and whether it could be combined with LLMs.)
Or what is your alignment hope in more concrete detail?
This argument might move some people to work on "capabilities" or to publish such work when they might not otherwise do so.
Above all, I'm interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.
My current guess:
I wouldn't expect much useful research to come from having published ideas. It's mostly just going to be used in capabilities and it seems like a bad idea to publish stuff.
Sure you can work on it and be infosec cautious and keep it secret. Maybe share it with a few very trusted people who m...
Hm interesting. I mean I'd imagine that if we get good heuristic guarantees for a system it would basically mean that all the not-perfectly-aligned subsystems/subsearches are limited and contained enough that they won't be able to engage in RSI. But maybe I misunderstand your point? (Like maybe you have specific reason to believe that it would be very hard to predict reliably that a subsystem is contained enough to not engage in RSI or so?)
(I think inner alignment is very hard and humans are currently not (nearly?) competent enough to figure out how to set...
How bottlenecked is your agenda by philosophy skills (like being good at thought experiments for deriving stuff like UDT, or like being good at figuring out the right ontology for thinking about systems or problems) vs math skill vs other stuff?
Idk that could be part of finding heuristic arguments for desireable properties for what an UANFSI converges to. Possibly it's easier to provide probabilistic convergence guarantees for systems that don't do FSI so this would already give some implicit evidence. But we could also just say that it's fine if FSI happens as long as we have heuristic convergence arguments - like that UANFSI is just allowing for a broader class of algorithms which might make stuff easier - though i mostly don't expect we'd get FSI alignment through this indirect alignment path from UANFSI but that we'd get an NFSI AI if we get some probabilistic convergence guarantees.
(Also I didn't think much about it at all. As said I'm trying KANSI for now.)
Thanks for writing up some of the theory of change for the tiling agents agenda!
I'd be curious on your take on the importance of the Löbian obstacle: I feel like it's important to do this research for aligning full-blown RSI-to-superintelligence, but at the same time it introduces quite some extra difficulty, and I'd be more excited about research (which ultimately aims for pivotal-act level alignment) where we're fine assuming some "fixed meta level" in the learning algorithm but general enough that the object-level AI can get very powerful. It seems to m...
Thank you! I'll likely read your paper and get back to you. (Hopefully within a week.)
From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that's how Yudkowsk...
Hi Koen, thank you very much for writing this list!
I must say I'm skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that's not at all a clear definition yet, I'm still deconfusing myself about that, and I'll likely publish a post clarifying the problem how I see it within the next month.)
So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I'd expect that the other...
Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.
If you really thi... (read more)