A general model of safety-oriented AI development

Wei Dai

Modern Transformers are AGI, and Human-Level

Absolutely. I don't think it's impossible to build such a system. In fact, I think a transformer is probably about 90% there. Need to add trial and error, some kind of long-term memory/fine-tuning and a handful of default heuristics. Scale will help too, but no amount of scale alone will get us there.

Modern Transformers are AGI, and Human-Level

Logan Zoellner1y1-2

It certainly wouldn't generalize to e.g Hidouku

1AnthonyC1y

I agree that filling a context window with worked sudoku examples wouldn't help for solving hidouku. But, there is a common element here to the games. Both look like math, but aren't about numbers except that there's an ordered sequence. The sequence of items could just as easily be an alphabetically ordered set of words. Both are much more about geometry, or topology, or graph theory, for how a set of points is connected. I would not be surprised to learn that there is a set of tokens, containing no examples of either game, combined with a checker (like your link has) that points out when a mistake has been made, that enables solving a wide range of similar games. I think one of the things humans do better than current LLMs is that, as we learn a new task, we vary what counts as a token and how we nest tokens. How do we chunk things? In sudoku, each box is a chunk, each row and column are a chunk, the board is a chunk, "sudoku" is a chunk, "checking an answer" is a chunk, "playing a game" is a chunk, and there are probably lots of others I'm ignoring. I don't think just prompting an LLM with the full text of "How to solve it" in its context window would get us to a solution, but at some level I do think it's possible to make explicit, in words and diagrams, what it is humans do to solve things, in a way legible to it. I think it largely resembles repeatedly telescoping in and out, to lower and higher abstractions applying different concepts and contexts, locally sanity checking ourselves, correcting locally obvious insanity, and continuing until we hit some sort of reflective consistency. Different humans have different limits on what contexts they can successfully do this in.

Modern Transformers are AGI, and Human-Level

Logan Zoellner1y30

In the technical sense that you can implement arbitrary programs by prompting an LLM (they are turning complete), sure.

In a practical sense, no.

GPT-4 can't even play tic-tac-toe. Manifold spent a year getting GPT-4 to implement (much less discover) the algorithm for Sudoku and failed.

Now imagine trying to implement a serious backtracking algorithm. Stockfish checks millions of positions per turn of play. The attention window for your "backtracking transformer" is going to have to be at lease {size of chess board state}*{number of position... (read more)

1Matt Goldenberg1y

The question is - how far can we get with in-context learning. If we filled Gemini's 10 million tokens with Sudoku rules and examples, showing where it went wrong each time, would it generalize? I'm not sure but I think it's possible

Modern Transformers are AGI, and Human-Level

Logan Zoellner1y-20

Obvious bait is obvious bait, but here goes.

Transformers are not AGI because they will never be able to "figure something out" the way humans can.

If a human is given the rules for Sudoku, they first try filling in the square randomly. After a while, they notice that certain things work and certain things don't work. They begin to define heuristics for things that work (for example, if all but one number appears in the same row or column as a box, that number goes in the box). Eventually they work out a complete algorithm for solving Sudok... (read more)

4Abram Demski1y

Yeah, I didn't do a very good job in this respect. I am not intending to talk about a transformer by itself. I am intending to talk about transformers with the sorts of bells and whistles that they are currently being wrapped with. So not just transformers, but also not some totally speculative wrapper.

1Matt Goldenberg1y

It seems likely to me that you could create a prompt that would have a transformer do this.

Distinguishing test from training

Logan Zoellner1y0-2

"reality is large" is a bad objection.

It's possible in principle to build a simulation that is literally indistinguishable from reality. Say we only run the AI in simulation for 100million years, and there's a simulation overhead of 10x. That should cost (100e6 ly)**3*(100e6 years) * 10 of our future lightcone. This is a minuscule fraction of our actual future lightcone (9.4e10 ly) * (10^15 y)

A few better objections:

Simulating a universe with a paperclip maximizer in it means simulating billions of people being murdered and turned into pa... (read more)

A Data limited future

Logan Zoellner3y10

Let's take a concrete example.

Assume you have an AI that could get 100% on every Putnam test, do you think it would be reasonable or not to assume such an AI would also display superhuman performance at solving the Yang-Mills Mass Gap?

1Donald Hobson3y

Producing machine verifiable formal proofs is an activity somewhat amenable to self play. To the extent that some parts of physics are reducible to ZFC oracle queries, maybe AI can solve those. To do something other than produce ZFC proofs, the AI must be learning what real in practice maths looks like. To do this, it needs large amounts of human generated mathematical content. It is plausible that the translation from formal maths to human maths is fairly simple, and that there is enough maths papers available for the AI to roughly learn it. The Putnam archive consists of 12 questions X 20 years=240 questions, spread over many fields of maths. This is not big data. You can't train a neural net to do much with just 240 examples. If aliens gave us a billion similar questions (with answers), I don't doubt we could make an AI that scores 100% on putnam. Still it is plausible that enough maths could be scraped together to roughly learn the relation from ZFC to human maths. And such an AI could be fine tuned on some dataset similar to Putnam, and then do well in putnam. (Especially if the examiner is forgiving of strange formulaic phrasings) The Putnam problems are a unwooly. I suspect such an AI couldn't take in the web page you linked, and produce a publishable paper solving the yang mills mass gap. Given a physicist who understood the question, and was also prepared to dive into ZFC (or lean or some other formal system) formulae, then I suspect such an AI could be useful. If the physicist doesn't look at the ZFC, but is doing a fair bit of hand holding, they probably succeed. I am assuming the AI is just magic at ZFC, that's self play. The thing I think is hard to learn is the link from the woolly gesturing to the ZFC. So with a physicist there to be more unambiguous about the question, and to cherrypick and paste together the answers, and generally polish a mishmash of theorems into a more flowing narrative, that would work. I'm not sure how much hand holding w

A Data limited future

Logan Zoellner3y10

This doesn't include working out advances in fundamental physics, or designing a fusion reactor, or making breakthroughs in AI research.

Why don't all of these fall into the self-play category? Physics, software and fusion reactors can all be simulated.

I would be mildly surprised if a sufficiently large language model couldn't solve all of Project Euler+Putnam+MATH dataset.

1Donald Hobson3y

Physics can be simulated, sure. When a human does a simulation, they are trying to find out useful information. When an neural net is set, they are trying to game the system. The human is actively optimizing for regions where the simulation is accurate, and if needed, will adjust the parameters of the simulation to improve accuracy. The AI is actively trying to find a design that breaks your simulation. Designing a simulation broad enough to contain the width of systems a human engineer might consider, and accurate enough that a solution in the simulation is likely to be a solution in reality, and efficient enough that the AI can blindly thrash towards a solution with millions of trials, that's hard. Yes software can be simulated. Software is a discrete domain. One small modification from highly functioning code usually doesn't work at all. Training a state of the art AI takes a lot of compute. Evolution has been in a position where it was optimizing for intelligence many times. Sure, sometimes it produces genuine intelligence, often it produces a pile of hard coded special case hacks that kind of work. Telling if you have an AI breakthrough is hard. Performance on any particular benchmark can be gamed with a heath robinson of special cases. Existing Quanum field theory, can kind of be simulated, on one proton at huge computational cost, and using a bunch of computational speed up tricks specialized to those particular equations. Suppose the AI proposes an equation of its new multistring harmonic theory. It would take a team of humans years to figure out a computationally tractable simulation. But ignore that and magically simulate it anyway. You now have a simulation of multistring harmonic theory. You set it up with a random starting position and simulate. Lets say you get a proton. How do you recognise that the complicated combination of knots is indeed a proton? You can't measure its mass, mass isn't fundamental in multistring harmonic theory. Mass is just

A Data limited future

Logan Zoellner3y40

I strongly doubt we live in a data-limited AGI timeline

Humans are trained using much less data than Chinchilla
We haven't even begun to exploit forms of media other than text (Youtube alone is >2OOM bigger)
self-play allows for literally limitless amounts of data
regularization methods mean data constraints aren't nearly as important as claimed
In the domains where we have exhausted available data, ML models are already weakly superhuman

2Donald Hobson3y

AIXI, trained on all wikipedia, would be vastly superhuman and terrifying. I don't think we are anywhere close to fundamental data limits. I think we might be closer to the limits of current neural network technology. Sure, video files are bigger than text files. Yes, self play allows for limitless amounts of data, which is why AI can absolutely be crazy good at go. My model has AI that are pretty good, potentially superhuman, at every task where we can give the AI a huge pile of relevant data. This does include generating short clickbait videos. This doesn't include working out advances in fundamental physics, or designing a fusion reactor, or making breakthroughs in AI research. I think AIXI trained on wikipedia would be able to do all those things. But I don't think the next neural networks will be able to.

3George Wang3y

I think it's more fair to say humans were "trained" over millions of years of transfer learning, and an individual human is fine tuned using much less data than Chinchilla.

Response to Blake Richards: AGI, generality, alignment, & loss functions

Logan Zoellner3y30

I’m not quite sure what you mean here.

In the standard picture of a reinforcement learner, suppose you get to specify the reward function and i get to specify the "agent". No matter what reward function you choose, I claim I can make an agent that both: 1) gets a huge reward compared to some baseline implementation 2) destroys the world. In fact, I think most "superintelligent" systems have this property for any reward function you could specify using current ML techniques.

Now switch the order, I design the agent first and ask you for an a... (read more)

2Steve Byrnes3y

I think it’s impossible to try to reason about what an RL agent would do solely on the basis of knowing its reward function, without knowing anything else about how the RL agent works, e.g. whether it’s model-based vs model-free, etc. (RL is a problem statement, not an algorithm. Not only that, but RL is “(almost) the most general problem statement possible”!) I think we’re in agreement on that point. But that point doesn’t seem to be too relevant in this context. After all, I specified “neocortex / hippocampus / striatum / etc.-like learning algorithms”. My previous reply linked an extensive discussion of what I think that actually means. So I’m not sure how we wound up on this point. Oh well. In your second paragraph: * If I interpret “useful” in the normal sense (“not completely useless”), then your claim seems true and trivial. Just make it a really weak agent (but not so weak that it’s 100% useless). * If I interpret “useful” to mean “sufficiently powerful as to reach AGI”, then you would seem to be claiming a complete solution to AGI safety, and I would reply that I’m skeptical, and interested to see details.

Response to Blake Richards: AGI, generality, alignment, & loss functions

Logan Zoellner3y30

What loss function(s), when sent into a future AI’s brain-like configuration of neocortex / hippocampus / striatum / etc.-like learning algorithms, will result in an AGI that is definitely not trying to literally exterminate humanity?

Specifying a correct loss functions is not the right way to think about the Alignment Problem. A system's architecture matters much more than its loss function for determining whether or not it is dangerous. In fact, there probably isn't even a well-defined loss function that would remain aligned un... (read more)

1Steve Byrnes3y

Where we probably agree: * I enthusiastically endorse keeping in mind the possibility that the correct answer to the question you excerpted is “Haha trick question, there is no such loss function.” * I enthusiastically endorse having an open mind to any good ideas that we can think of to steer our future AGIs in a good direction, including things unrelated to loss functions, and including that are radically different from anything in the human brain. For example in this post I talk about lots of things that are not “choosing the right loss function”. As for your link, I disagree that “specifying the right loss function” is equivalent to “writing down the correct utility function”. I’m not sure it makes sense to say that humans have a utility function at all, and if they do, it would be full of learned abstract concepts like “my children will have rich fulfilling lives”. But we definitely have loss functions in our brain, and they have to be specified by genetically-hardcoded circuitry that (I claim) cannot straightforwardly refer to complicated learned abstract concepts like that. I’m not quite sure what you mean here. If “architecture” means 96 transformer layers versus 112 transformer layers, then I don’t care at all. I claim that the loss function is much more important than that for whether the system is dangerous. Or if “architecture” means “There’s a world-model updated by self-supervised learning, and then there’s actor-critic reinforcement learning, blah blah blah”, then yes this is very important, but it’s not unrelated to loss functions—the world-model’s loss function would be sensory prediction error, the critic’s loss function would be reward prediction error, etc. Right? I think I would say “maybe” where you say “probably”. I think it’s an important open question. I would be very interested to know one way or the other. I think humans are an interesting case study. Almost all humans do not want to literally exterminate humanity. If a human were

Why Copilot Accelerates Timelines

Logan Zoellner3y10

I think you're confounding two questions:

Does AIHHAI accelerate AI?
If I observe AIHHAI does this update my priors towards Fast/Slow Takeoff?

I think it's pretty clear that AIHHAI accelerates AI development (without Copilot, I would have to write all those lines myself).

However, I think that observing AIHHAI should actually update your priors towards Slow Takeoff (or at least Moderate Takeoff). One reason is because humans are inherently slower than machines, and as Amdahl reminds us if something is composed of a slow thing and a fast thing... (read more)

1Michaël Trazzi3y

Well, I agree that if two worlds I had in mind were 1) foom without real AI progress beforehand 2) continuous progress, then seeing more continuous progress from increased investments should indeed update me towards 2). The key parameter here is substitutability between capital and labor. In what sense is Human Labor the bottleneck, or is Capital the bottleneck. From the different growth trajectories and substitutability equations you can infer different growth trajectories. (For a paper / video on this see the last paragraph here). The world in which dalle-2 happens and people start using Github Copilot looks to me like a world where human labour is substitutable by AI labour, which right now is essentially being part of Github Copilot open beta, but in the future might look like capital (paying the product or investing in building the technology yourself). My intuition right now is that big companies are more bottlenecked by ML talent than by capital (cf. the "are we in ai overhang" post explaining how much more capital could Google invest in AI).

AI ALIGNMENT FORUM
AF

All of Logan Zoellner's Comments + Replies