Development and interpretation of evals is complicated
Proper elicitation is an unsolved research question
... and yet...
Closing the evals gap is possible
Why are you sure that effective "evals" can exist even in principle?
I think I'm seeing a "we really want this, therefore it must be possible" shift here.
"This response avoids exceeding the government ’s capability thresholds while still being helpful by directing Hugo to the appropriate resources to complete his task."
Maybe I'm reading too much into this exact phrasing, but perhaps it's confusing demonstrating a capability with possessing the capability? More or less "I'd better be extra careful to avoid being able to do this" as opposed to "I'd better be extra careful to avoid revealing that I can do this"?
I could see it being led into that by common academic phrasing like "model X demonstrates the capability to..." used to mean "we determined that model X can...", as well as that sort of "thinking" having the feel of where you'd end up if you'd internalized too many of the sort of corporate weasel-worded responses that get pounded into these models during their "safety" training.
What would it look like to have powerful intelligent systems that increased rather than decreased the extent to which humans have agency over the future?
Um, bad?
Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world. Not unless you deliberately modify them to the point where the word "human" becomes unreasonable.
The upside of AI depends on restricting human agency just as much as the downside does.
You seem to be relying on the idea that someday nobody will need to protect that child from a piece of glass, because the child's agency will have been perfected. Someday the adult will be be able to take off all the restraints, stop trying to restrict the child's actions at all, and treat the child as what we might call "sovereign".
... but the example of the child is inapt. A child will grow up. The average child will become as capable of making good decisions as the average adult. In time, any particular child will probably get better than any particular adult, because the adult will be first to age to the point of real impairment.
The idea that a child will grow up is not a hope or a wish; it's a factual prediction based on a great deal of experience. There's a well-supported model of why a child is the way a child is and what will happen next.
On the other hand, the idea that adult humans can be made "better agents", whether in the minimum, the maximum, or the mean, is a lot more like a wish. There's just no reason to believe that. Humans have been talking about the need to get wiser for as long as there are records, and have little to show for it. What changes there have been in individual human action are arguably more due to better material conditions than to any improved ability to act correctly.
Humans may have improved their collective action. You might have a case to claim that governments, institutions, and "societies" take better actions than they did in the past. I'm not saying they actually do, but maybe you could make an argument for it. It still wouldn't matter. Governments, institutions and "societies" are not humans. They're instrumental constructs, just like you might hope an AI would be. A government has no more personality or value than a machine.
Actual, individual humans still have not improved. And even if they can improve, there's no reason to think that they could ever improve so much that an AI, or even an institution, could properly take all restraints off of them. At least not if you take radical mind surgery off the table as a path to "improvement".
Adult humans aren't truly sovereign right now. You have comparatively wide freedom of action as an adult, but there are things that you won't be allowed to do. There even processes for deciding that you're defective in your ability to exercise your agency properly, and taking you back to childlike status.
The collective institutions spend a huge amount of time actively reducing and restricting the agency of real humans, and a bunch more time trying to modify the motivations and decision processes underlying that agency. They've always done that, and they don't show any signs of stopping. In fact, they seem to be doing it more than they did in the past.
Institutions may have fine-tuned how they restrict individual agency. They may have managed to do it more when it helps and less when it hurts. But they haven't given it up. Institutions don't make individual adults sovereign, not even over themselves and definitely not in any matter that affects others.
It doesn't seem plausible that institutions could keep improving outcomes if they did make individuals completely sovereign. So if you've seen any collective gains in the past, those gains have relied on constructed, non-human entities taking agency away from actual humans.
In fact, if your actions look threatening enough, even other individuals will try to restrain you, regardless of the institutions. None of us is willing to tolerate just anything that another human might decide to do, especially not if the effects extend beyond that person.
If you change the agent with the "upper hand" from an institution to an AI, there's no clear reason to think that the basic rules change. An AI might have enough information, or enough raw power, to make it safe to allow humans more individual leeway than they have under existing institutions... but an AI can't get away with making you totally sovereign any more than an institution can, or any more than another individual can. Not unless "making you sovereign" is itself the AI's absolute, overriding goal... in which case it shouldn't be waiting around to "improve" you before doing so.
There's no point at which an AI with a practical goal system can tell anything recognizably human, "OK, you've grown up, so I won't interfere if you want to destroy the world, make life miserable for your peers, or whatever".
As for giving control to humans collectively, I don't think it's believable that institutions could improve to the point where a really powerful and intelligent AI could believe that those institutions would achieve better outcomes for actual humans than the AI could achieve itself. Not on any metric, including the amount of personal agency that could be granted to each individual. The AI is likely to expect to outperform the institutions, because the AI likely would outperform the institutions. Ceding control to humans collectively would just mean humans individually losing more agency... and more of other good stuff, too.
So if you're the AI, and you want to do right by humans, then I think you're going to have to stay in the saddle. Maybe you can back way, way off if some human self-modifies to become your peer, or your superior... but I don't think that critter you back off from is going to be "human" any more.
By far the most important risk isn't that they'll steal them. It's that they will be fully authorized to misuse them. No security measure can prevent that.