Tor Barstad — AI Alignment Forum

This feels to me like very much not how I would go about getting corrigibility.

It is hard to summarize how I would go about things, because there would be lots of steps, and lots of processes that are iterative.

Prior to plausible AGI/FOOM I would box it in really carefully, and I only interact with it in ways where it's expressivity is severely restricted.

I would set up a "council" of AGI-systems (a system of systems), and when giving it requests in an oracle/genie-like manner I would see if the answers converged. At first it would be the initial AGI-system, but I would use that system to generate new systems for the "council".

I would make heavy use of techniques that are centered around verifiability, since for some pieces of work it’s possible to set up things in such a way that it would be very hard for the system to "pretend" like it’s doing what I want it to do without actually doing it. There are several techniques I would use to achieve this, but one of them is that I often would ask it to provide a narrow/specialized/interpretable "result-generator" instead of giving the result directly, and sometimes even result-generator-generators (pieces of code that produce results, and that have architectures that make it easy to understand and verify behavior). So when for example getting it to generate simulations, I would get from it a simulation-generator (or simulation-generator-generator), and I would test its accuracy against real-world-data.

Here is a draft for a text where I try to explain myself in more detail, but it's not finished yet: https://docs.google.com/document/d/1INu33PIiRZbOBYjRul6zsCwF98z0l25V3pzMRJsYC_I/edit

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments