Wait, where? I think the objection to "Doing that is quite hard" is not an objection to "it’s not obviously true that such algorithms are actually “achievable” for SGD" - it's an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
"Safe" as in "safe enough for it to be on net better to run it" or "safe enough it wouldn't definitely kill everyone". It's not that I don't have popular intuition that GPT wouldn't kill anyone. It's just that I don't think it's a good habit to run progressively more capable systems while relying on informal intuitions about their safety. And then maybe I will see an explanation for why future safety tools would outpace capability progress, when now we are already at the point where current safety tools are not applicable to current AI systems.
So, what's the technical (the one ending with "therefore the probability of disaster is < 1e-9") reason why training InstructGPT was safe?
I think you might be conflating two different scenarios?
I present alternative strategy for a mesaoptimizer that, yes, wasn't in the post, by I don't see why?
Is your claim that that’s basically nothing?
Yes, or at least it approaches relatively nothing as we get more competent optimizer.
If it’s always super easy to identify the base objective and then optimize it, then shouldn’t need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective.
If it doesn't store it, it isn't - it would be able to derive that humans want it, but wouldn't want to optimize it itself.
I think speed and complexity trade off against model's precision? Like, for simple objectives you can hardcode it, but for complex one you wouldn't have space for base objective that you couldn't spend on table of chemical elements or something. So I would expect t(mesaoptimize_env) for the base objective to be only slighter greater than t(is_deployed_in_world) + t(find(get_base_obj(env))) + t(mesaoptimize_env) for a mesaoptimizer that hardcodes a check, but computes base objective, because I expect t(find(get_base_obj(env))) to be mostly contained in t(mesaoptimize_env). Hmmm, except, maybe if you somehow specifically penalize difference in speed for instrumental vs terminal objective? But then mesaoptimizer would just find other instrumental ways.
It was all very interesting, but what was the goal of these discussions? I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety. Is the reasoning that the same mental process that assigned too low probability would not be able to come up with actual solution? Or something like "at the time they think their solution reduced probability of failure from 5% to 0.1% it would still be much higher"? Seems to be only possible if people don't understand arguments about inner optimisators or what not, as opposed to disagreeing with them.
To clarify, sufficient observations would still falsify all "simulate simple physics, start reading from simple location" programs and eventually promote "simulate true physics, start reading from camera location"?
Under this picture, or any other simplicity bias, why NNs with more parameters generalize better?