It seems to me that there are two unstated perspectives behind this post that inform a lot of it.
First, that you specifically care about upper-bounding capabilities, which in turn implies being able to make statements like "there does not exist a setup X where model M does Y". This is a very particular and often hard-to-reach standard, and you don't really motivate why the focus on this. A much simpler standard is "here is a setup X where model M did Y". I think evidence of the latter type can drive lots of the policy outcomes you want: "GPT-6 replicated itself on the internet and designed a bioweapon, look!". Ideally, we want to eventually be able to say "model M will never do Y", but on the current margin, it seems we mainly want to reach a state where, given an actually dangerous AI, we can realise this quickly and then do something about the danger. Scary demos work for this. Now you might say "but then we don't have safety guarantees". One response is: then get really good at finding the scary demos quickly.
Also, very few existing safety standards have a "there does not exist an X where..." form. Airplanes aren't safe because we have an upper bound on how explosive they can be, they're safe because we know the environments in which we need them to operate safely, design them for that, and only operate them within those. By analogy, this weakly suggests to control AI operating environments and develop strong empirical evidence of safety in those specific operating environments. A central problem with this analogy, though, is that airplane operating environments are much lower-dimensional. A tuple of (temperature, humidity, pressure, speed, number of armed terrorists onboard) probably captures most of the variation you need to care about, whereas LLMs are deployed in environments that vary on very many axes.
Second, you focus on the field, in the sense of its structure and standards and its ability to inform policy, rather than in the sense of the body of knowledge. The former is downstream of the latter. I'm sure biologists would love to have as many upper bounds as physicists, but the things they work on are messier and less amenable to strict bounds (but note that policy still (eventually) gets made when they start talking about novel coronaviruses).
If you focus on evals as a science, rather than a scientific field, this suggests a high-level goal that I feel is partly implicit but also a bit of a missing mood in this post. The guiding light of science is prediction. A lot of the core problem in our understanding of LLMs is that we can't predict things about them - whether they can do something, which methods hurt or help their performance, when a capability emerges, etc. It might be that many questions in this space, and I'd guess upper-bounding capabilities is one, just are hard. But if you gradually accumulate cases where you can predict something from something else - even if it's not the type of thing you'd like to eventually predict - the history of science shows you can get surprisingly far. I don't think it's what you intend or think, but I think it's easy to read this post and come away with a feeling of more "we need to find standardised numbers to measure so we can talk to serious people" and less "let's try to solve that thing where we can't reliably predict much about our AIs".
Also, nitpick: FLOPS are a unit of compute, not of optimisation power (which, if it makes sense to quantify at all, should maybe be measured in bits).
It seems to me that there are two unstated perspectives behind this post that inform a lot of it.
First, that you specifically care about upper-bounding capabilities, which in turn implies being able to make statements like "there does not exist a setup X where model M does Y". This is a very particular and often hard-to-reach standard, and you don't really motivate why the focus on this. A much simpler standard is "here is a setup X where model M did Y". I think evidence of the latter type can drive lots of the policy outcomes you want: "GPT-6 replicated itself on the internet and designed a bioweapon, look!". Ideally, we want to eventually be able to say "model M will never do Y", but on the current margin, it seems we mainly want to reach a state where, given an actually dangerous AI, we can realise this quickly and then do something about the danger. Scary demos work for this. Now you might say "but then we don't have safety guarantees". One response is: then get really good at finding the scary demos quickly.
Also, very few existing safety standards have a "there does not exist an X where..." form. Airplanes aren't safe because we have an upper bound on how explosive they can be, they're safe because we know the environments in which we need them to operate safely, design them for that, and only operate them within those. By analogy, this weakly suggests to control AI operating environments and develop strong empirical evidence of safety in those specific operating environments. A central problem with this analogy, though, is that airplane operating environments are much lower-dimensional. A tuple of (temperature, humidity, pressure, speed, number of armed terrorists onboard) probably captures most of the variation you need to care about, whereas LLMs are deployed in environments that vary on very many axes.
Second, you focus on the field, in the sense of its structure and standards and its ability to inform policy, rather than in the sense of the body of knowledge. The former is downstream of the latter. I'm sure biologists would love to have as many upper bounds as physicists, but the things they work on are messier and less amenable to strict bounds (but note that policy still (eventually) gets made when they start talking about novel coronaviruses).
If you focus on evals as a science, rather than a scientific field, this suggests a high-level goal that I feel is partly implicit but also a bit of a missing mood in this post. The guiding light of science is prediction. A lot of the core problem in our understanding of LLMs is that we can't predict things about them - whether they can do something, which methods hurt or help their performance, when a capability emerges, etc. It might be that many questions in this space, and I'd guess upper-bounding capabilities is one, just are hard. But if you gradually accumulate cases where you can predict something from something else - even if it's not the type of thing you'd like to eventually predict - the history of science shows you can get surprisingly far. I don't think it's what you intend or think, but I think it's easy to read this post and come away with a feeling of more "we need to find standardised numbers to measure so we can talk to serious people" and less "let's try to solve that thing where we can't reliably predict much about our AIs".
Also, nitpick: FLOPS are a unit of compute, not of optimisation power (which, if it makes sense to quantify at all, should maybe be measured in bits).