A hand-drawn presentation on the idea of an 'Untrollable Mathematician' - a mathematical agent that can't be manipulated into believing false things. 

Popular Comments

Recent Discussion

Current take on the implications of "GPT-4b micro": Very powerful, very cool, ~zero progress to AGI, ~zero existential risk. Cheers.

First, the gist of it appears to be:

OpenAI’s new model, called GPT-4b micro, was trained to suggest ways to re-engineer the protein factors to increase their function. According to OpenAI, researchers used the model’s suggestions to change two of the Yamanaka factors to be more than 50 times as effective—at least according to some preliminary measures.

The model was trained on examples of protein sequences from many species, as

... (read more)
1Thane Ruthenis
Thanks! You're more fluent in the scaling laws than me: is there an easy way to roughly estimate how much compute would've been needed to train a model as capable as GPT-3 if it were done Chinchilla-optimally + with MoEs? That is: what's the actual effective "scale" of GPT-3? (Training GPT-3 reportedly took 3e23 FLOPS, and GPT-4 2e25 FLOPS. Naively, the scale-up factor is 67x. But if GPT-3's level is attainable using less compute, the effective scale-up is bigger. I'm wondering how much bigger.)
2Vladimir Nesov
IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1). GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla's compute optimal 20 tokens/parameter is approximately correct for GPT-3, it's 10x undertrained. Interpolating from the above 30x overtraining example, a compute optimal model needs about 1.5e23 FLOPs to get the same perplexity. (The effect from undertraining of GPT-3 turns out to be quite small, reducing effective compute by only 2x. Probably wasn't worth mentioning compared to everything else about it that's different from GPT-4.)
7nostalgebraist
One possible answer is that we are in what one might call an "unhobbling overhang." Aschenbrenner uses the term "unhobbling" for changes that make existing model capabilities possible (or easier) for users to reliably access in practice. His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we're also getting better at unhobbling over time, which leads to even more growth. That is, Aschenbrenner treats pretraining improvements and unhobbling as economic substitutes: you can improve "practically accessible capabilities" to the same extent by doing more of either one even in the absence of the other, and if you do both at once that's even better. However, one could also view the pair more like economic complements. Under this view, when you pretrain a model at "the next tier up," you also need to do novel unhobbling research to "bring out" the new capabilities unlocked at that tier. If you only scale up, while re-using the unhobbling tech of yesteryear, most of the new capabilities will be hidden/inaccessible, and this will look to you like diminishing downstream returns to pretraining investment, even though the model is really getting smarter under the hood. This could be true if, for instance, the new capabilities are fundamentally different in some way that older unhobbling techniques were not planned to reckon with. Which seems plausible IMO: if all you have is GPT-2 (much less GPT-1, or char-rnn, or...), you're not going to invest a lot of effort into letting the model "use a computer" or combine modalities or do long-form reasoning or even be an HHH chatbot, because the model is kind of obviously too dumb to do these things usefully no matter how much help you give it. (Relatedly, one could argue that fundamentally better capabili
23Charlie Steiner
Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me? It's not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I'm not clear. But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve. But clearly other people think differently than me.

I think the core argument is "if you want to slow down, or somehow impose restrictions on AI research and deployment, you need some way of defining thresholds. Also, most policymaker's cruxes appear to be that AI will not be a big deal, but if they thought it was going to be a big deal they would totally want to regulate it much more. Therefore, having policy proposals that can use future eval results as a triggering mechanism is politically more feasible, and also, epistemically helpful since it allows people who do think it will be a big deal to establish a track record". 

I find these arguments reasonably compelling, FWIW.

21elifland
Not representative of motivations for all people for all types of evals, but https://www.openphilanthropy.org/rfp-llm-benchmarks/, https://www.lesswrong.com/posts/7qGxm2mgafEbtYHBf/survey-on-the-acceleration-risks-of-our-new-rfps-to-study, https://docs.google.com/document/d/1UwiHYIxgDFnl_ydeuUq0gYOqvzdbNiDpjZ39FEgUAuQ/edit, and some posts in https://www.lesswrong.com/tag/ai-evaluations seem relevant.

Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paperAdaptive Deployment of Untrusted LLMs Reduces Distributed ThreatsHow to prevent collusion when using untrusted models to monitor each other.) For example:

  • The AIs are consistently trying to subvert safety measures.
  • They’re very good at strategizing about how to subvert them, and can do this strategizing totally illegibly (e.g. in a single forward pass).
  • They can follow a strategy that’s coordinated between all their instances.
  • They know exactly what safety measures are in use.
  • They know a lot about the distribution of inputs.
  • They’re able to pick Schelling points perfectly.
  • They’re able to follow stochastic policies with excellent calibration (e.g. they’re able
...

Epistemic status: This is an off-the-cuff question.

~5 years ago there was a lot of exciting progress on game playing through reinforcement learning (RL). Now we have basically switched paradigms, pretraining massive LLMs on ~the internet and then apparently doing some really trivial unsophisticated RL on top of that - this is successful and highly popular because interacting with LLMs is pretty awesome (at least if you haven't done it before) and they "feel" a lot more like A.G.I. Probably there's somewhat more commercial use as well via code completion (and some would say many other tasks, personally not really convinced - generative image/video models will certainly be profitable though). There's also a sense in which they are clearly more general - e.g. one RL algorithm may learn...

2Vanessa Kosoy
Relevant link
3Vanessa Kosoy
Apparently someone let LLMs play against the random policy and for most of them, most games end in a draw. Seems like o1-preview is the best of those tested, managing to win 47% of the time.
3gwern
Given the other reports, like OA's own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other 'random chess game' tests, where the 'random' part is not neutral but screws up the implied persona.

Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?

Summary

Can LLMs science? The answer to this question can tell us important things about timelines to AGI. In this small pilot experiment, we test frontier LLMs on their ability to perform a minimal version of scientific research, where they must discover a hidden rule about lists of integers by iteratively generating and testing hypotheses. Results are ambiguous: they're mostly pretty bad at it but top systems show apparent signs of life. We're working on a larger, more rigorous experiment, and we really want your input.

Structure

In this post we:

  • Describe an experiment on general reasoning and scientific research ability in LLMs.
  • Describe the main research project for which this is a pilot project.
  • Ask for your predictions on the outcome of the main project and what you believe it will say
...

As a partial point of comparison, in Wason's testing only about 20% of humans solved the problem tested, but Wason's experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, 'strictly increasing order').

I encourage you to get some humans to take the same test you gave the models, so that we have a better human baseline. It matters a lot for what the takeaways should be, if LLMs are already comparable or better to humans at this task vs. still significantly worse.

Load More