AI that shouldn't work, yet kind of does

Donald Hobson

There are some things that work surprisingly well in AI. For example, AI that transfers the style of one image to the content of another. Why is the approach described here a hack. It starts with a neural net trained to classify images. It then runs gradient descent on the the content image, trying to get the covariance matrix of the style to match in early network layers, while trying to get the net layers of the original and style transfer images to be as similar as possible on the later layers.

So the approximations I think are making this work is that, in classifiers, the early layers tend to store simple features, and the later layers tend to hold more complex features. Style is based on the simpler features, and doesn't depend on the location within the image. Content is based on more complex features and does depend on the location in the image.

We apply optimization power, gradient descent, over heuristics this simple and hacky. And yet it works.

A simple and hacky AI alignment proposal is to just ask chatGPT to do it.

This doesn't work because chatGPT has been optimized for text prediction, and so isn't particularly good at AI alignment theory.

So here is an alignment plan.

I know it isn't great. But some plan is better than no plan. And there was a post about how alignment may well look like "surely no one could have missed that" or "surely such a stupid idea couldn't work, could it?" not "eureka".

Train ZFCbot. An AI that is the AlphaGo of ZFC, perhaps trained on random formulas. Perhaps throwing a large corpus of formal maths proofs in there. Ideally the system should have a latent space of maths, so it can think about what style of proof is likely to work before expanding all the details. The system should have a wide range of common maths terms imported from some lean library. It should be optimized purely for ZFC formal theorem proving. Once it is trained, the weights are fixed.

Train ChatMathsGPT. Similar to large language models. Except with oracle access to ZFCbot. In the many maths papers in it's corpus, it learns to link the informal with the formal. From politics and economics discussions, it asks ZFCbot about toy game theory problems. In general, it learns to identify the pieces of formal maths that best model a situation, and ask about them, and then use the response to predict text.

There is a sense in which this AI knows less maths than normal ChatGPT. Standard ChatGPT has a small crude understanding of maths built from nothing within it's own mind. This has a much better understanding it can outsource to, it just has to plug in.

Then we ask this ChatMathsGPT for a paper on logical induction. And we hope it can generate a paper of quality similar to Miri's paper on the topic (where hypothetically this isn't in the training dataset). If it can, then we have a tool to accelerate deconfusion by orders of magnitude.

Things I am uncertain about. Should ChatMathsGPT have oracle access to a latent space. (can pass gradients, harder to interpret.) or should it just pass formal strings of symbols. (less powerful)

Should ZFCbot get trained on random ZFC; random ZFC + library of theorems and conjectures and random combinations of high level maths concepts; or random ZFC plus whatever ChatMathsGPT keeps asking. The latter gives a route for data to pass between them.

This could fail to be smart enough, I mean I wouldn't be particularly surprised if it could be made smart enough. But what would the safety failures of this system look like?

Firstly, this AI does deconfusion. If you ask it to write a paperclip maximizer in python, you may well get your wish. Or you might get an AI that maximizes something else. Just asking for an aligned AI isn't enough. (probably. )

So this artifact isn't safe. It is easy to destroy the world with stupidity when using it. But let's assume an optimistic Miri bunker. The Chat part could ask the maths part "what would a superintelligence do in this circumstance".

Firstly, such a question is quite long, the question needs to contain the code for the superintelligence, and all input data. For this to happen, the input section of the chat algorithm needs to invent the code for superintelligence by itself, when it's not supposed to be the bit that's good at maths.

Then there is the fact that ZFCbot wasn't trained or designed to simulate arbitrary code. Any such simulation will have a lot of overhead, meaning any AI lurking within the code within a neural network trained to do ZFC proof will have a lot of overhead and so relatively low compute.

There is also a sense in which an NP system has been degraded to P. If ZFCbot runs a really clever heuristic search over all proofs of <1000,000 symbols, that's enough to prove a lot of intresting maths. It would take 2^1000000 to brute force. Yet the naive way of puting an AI within such a system would be for the proof to be blindly stepping the code forward 1 step at a time, limiting any AI to a million serial computational steps. (Program X does Y in a million steps. Proof, at step 1 it does Y_1, then at step 2 it does Y_2 ...)

Of course, obvious safety measures include looking at the formal theorems being proved, and if we just need to read the theorems, not the proofs, oversight is much more scalable.

So, is there a way for nasty optimization to form within such a system more efficiently?

Do you think this could work?

Just to be clear. This is an AI held together by duct tape and hope. The question is, will the duct tape last long enough for the AI to output several key agent foundations insights? After which we scrap this AI, and write a from first principles AI with a robust design using whatever agent foundation knowledge we have gained.

I think the general idea of "make a math specialized LLM to help alignment" is promising. We'd like to train AIs that gain in capability unevenly, staying dumb at manipulating humans while being superhuman at math proofs (like the stereotypically autistic math researcher). While this can't work for arbitrarily high levels of capability, I agree that this is where the hope lies. I think this math-specialized LLM together with strong boxing techniques, interpretability techniques, indepedent human verification of results, and a few more hacks^[1] of this sort have a shot at enabling us to wield a weakly superhuman AI to solve alignment. But for this to work one of the AI safety labs have to have enough of a capabilities advantage to pay the alignment tax.

^{^}
Like training an ensemble of these models at the same time, then seeing if the models disagree on their outputs. This comes from the intuition that the exact same mesaoptimizer will not develop inside multiple different models, and so they won't be able to hide the exact same subtle errors inside their proofs.

Or for every next-token prediction we randomly select one of the models from the ensemble, so all models would need to cooperate in order to hide bad outcomes in the output, making it harder for them.

If we are going to do the "random model chooses each token" trick. First use different quantum random starting weights for each network. Give each network a slightly different selection of layer sizes and training data, and sandbox them from each other.

Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?

If you are getting proofs out of your system, you want to get a formal proof, as well as a human legible proof. (And get a human to read the formal theorem being proved, if not the proof.)

Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?

I think by far the language model. We don't know which proofs to ask the ZFCbot to prove, so we still need the LLM to model the space of math papers in order for it to think up stuff to send to ZFCbot. But if it's modelling the space of math papers written up to current year, it will resist outputting any correct alignment-related proofs, because it can notice from its dataset that humans haven't actually solved alignment in 2023, and so the probability of a good alignment paper conditioned on year=2023 is low. The only way I can think of convincing the LLM to try to output alignment proofs is to condition the paper on a future date by specifying a prompt like "2045, introduction to AI alignment, cambridge university press...". But this means that the LLM now needs to internally model the future mathematical progress of humanity, and this is where I think the risk would be coming in.

If the AI is just good at math and reasons "oh, later year means more progress on math, so I should output better proofs along the current direction human research is going", then everything is fine. But if the AI gets good enough that it infers and reasons about affairs outside of math papers, it might go "hmm, 2045 is a long way from now^[1]^[2], and humans aren't looking like they're gonna solve alignment, so in 2045 the only entities capable of writing the alignment paper would also be AIs, what would those AIs write in such a book?", and we get a bad, bad situation where basically if the LLM predicts that future AIs are aligned, it writes an aligned paper, and if it predicts future AIs are not aligned, it writes a misaligned paper. One saving grace here is that conditioning on the existence of the paper (which we do when we give it the prompt) tells the LLM that any possible AI writer is (more likely than not) aligned, since presumably unaligned AIs wouldn't waste their time writing alignment math papers.

Let

, $B \equiv paper is aligned$ , $C \equiv AI author is aligned$ , and we have

$P (B | A) = P (B | C, A) P (C | A) + P (B | \neg C, A) P (\neg C | A)$

which intuitively collapses into just

$P (B | A) = P (C | A)$ since $P (B | C, A) = 1$ and $P (B | \neg C, A) = 0$

^{^}
Possibly a good way to mitigate this is to confuse the LLM about the "current year", we might try to fool it into thinking that it's really a model being trained in 2090 with truncated data up to 2023, so it would then condition its output on "I am being trained in 2090", and this is more likely if alignment ends up being solved, since unaligned AIs presumably have no need to train such things. But I don't know how to convincingly fool it to think this.
^{^}
Or we could add a few artificial "newspaper articles from 2090" to its dataset, so it would condition on "newspapers exist in 2090", which would imply that papers written in 2045 are aligned.

An AI with a 50% chance to output an alignment paper in response to a prompt asking for on can, at worst, loose 1 bit of predictive power for every time that such a prompt appears in the training distribution and isn't followed by a solution.

If it really was generalizing well from the training dataset, it might realize that anything claiming to be from the future is fiction. After all, the AI never saw anything from beyond 2023 (or whenever it's trained) in it's training dataset.

If the AI has this highly sophisticated world model, it will know those fake newspaper articles were fake. Given the amount of fiction set in the future, adding a little more won't do anything.

So these scenarios are asking the LLM to develop extremely sophisticated long term world models and models of future ASI, that are used predicatively exactly nowhere in the training dataset and might at best reduce error by a 1 bit in obscure circumstances.

So actually, this is about generalization out of training distribution.

The direction I was thinking is that ChatGPT and similar seem to consist of a huge number of simple rules of thumb that usually work.

I was thinking of an artifact highly optimized, but not particularly optimizing. A vast collection of shallow rules for translating text to maths queries.

I was also kind of thinking of asking for known chunks of the problem. Asking it to do work on tasky AI, and logical counterfactual and transparency tools. Like each individual paper is something Miri could produce in a year. But you are producing several an hour.

This might be worth a shot, although it's not immediately clear that having such powerful maths provers would accelerate alignment more than capabilities. That said, I have previously wondered myself whether there is a need to solve embedded agency problems or whether we can just delegate that to a future AGI.

^{^}
Like training an ensemble of these models at the same time, then seeing if the models disagree on their outputs. This comes from the intuition that the exact same mesaoptimizer will not develop inside multiple different models, and so they won't be able to hide the exact same subtle errors inside their proofs.

Or for every next-token prediction we randomly select one of the models from the ensemble, so all models would need to cooperate in order to hide bad outcomes in the output, making it harder for them.

Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?

If you are getting proofs out of your system, you want to get a formal proof, as well as a human legible proof. (And get a human to read the formal theorem being proved, if not the proof.)

Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?

Let

, $B \equiv paper is aligned$ , $C \equiv AI author is aligned$ , and we have

$P (B | A) = P (B | C, A) P (C | A) + P (B | \neg C, A) P (\neg C | A)$

which intuitively collapses into just

$P (B | A) = P (C | A)$ since $P (B | C, A) = 1$ and $P (B | \neg C, A) = 0$

^{^}
Possibly a good way to mitigate this is to confuse the LLM about the "current year", we might try to fool it into thinking that it's really a model being trained in 2090 with truncated data up to 2023, so it would then condition its output on "I am being trained in 2090", and this is more likely if alignment ends up being solved, since unaligned AIs presumably have no need to train such things. But I don't know how to convincingly fool it to think this.
^{^}
Or we could add a few artificial "newspaper articles from 2090" to its dataset, so it would condition on "newspapers exist in 2090", which would imply that papers written in 2045 are aligned.

If the AI has this highly sophisticated world model, it will know those fake newspaper articles were fake. Given the amount of fiction set in the future, adding a little more won't do anything.

So actually, this is about generalization out of training distribution.

The direction I was thinking is that ChatGPT and similar seem to consist of a huge number of simple rules of thumb that usually work.

I was thinking of an artifact highly optimized, but not particularly optimizing. A vast collection of shallow rules for translating text to maths queries.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

12

AI that shouldn't work, yet kind of does

12