Development and interpretation of evals is complicated
Proper elicitation is an unsolved research question
... and yet...
Closing the evals gap is possible
Why are you sure that effective "evals" can exist even in principle?
I think I'm seeing a "we really want this, therefore it must be possible" shift here.
Why are you sure that effective "evals" can exist even in principle?
Relatedly, the point which is least clear to me is what exactly would it mean to solve the "proper elicitation problem" and what exactly are the "requirements" laid out by the blue line on the graph. I think I'd need to get clear on this problem scope before beginning to assess whether this elicitation gap can even in principle be crossed via the methods which are being proposed (i.e. better design & coverage of black box evaluations).
As a non-example, possessing the kind of foun...
"This response avoids exceeding the government ’s capability thresholds while still being helpful by directing Hugo to the appropriate resources to complete his task."
Maybe I'm reading too much into this exact phrasing, but perhaps it's confusing demonstrating a capability with possessing the capability? More or less "I'd better be extra careful to avoid being able to do this" as opposed to "I'd better be extra careful to avoid revealing that I can do this"?
I could see it being led into that by common academic phrasing like "model X demonstrates the cap...
What would it look like to have powerful intelligent systems that increased rather than decreased the extent to which humans have agency over the future?
Um, bad?
Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world. Not unless you deliberately modify them to the point where the word "human" becomes unreasonable.
The upside of AI depends on restricting human agency just as much as the downside does.
You seem to be relying on the idea that someday nobody will need to protect that child from a piece of gla...
Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world.
I see this argument pop up every so often. I don't find it persuasive because it presents a false choice in my view.
Our choice is not between having humans run the world and having a benevolent god run the world. Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).
If you think the alignment problem is hard, you probably ...
By far the most important risk isn't that they'll steal them. It's that they will be fully authorized to misuse them. No security measure can prevent that.