Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.
The basic case against against Effective-FLOP.
Agree with both aogara and Eli's comment.
One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.
For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.
There's a lot of work that could be relevant for x-risk but is not motivated by it. Some of it is more relevant than work that is motivated by it. An important challenge for this community (to facilitate scaling of research funding, etc.) is to move away from evaluating work based on motivations, and towards evaluating work based on technical content.
my current best guess is that gradient descent is going to want to make our models deceptive
Can you quantify your credence in this claim?
Also, how much optimization pressure do you think that we will need to make models not deceptive? More specifically, how would your credence in the above change if we trained with a system that exerted 2x, 4x, ... optimization pressure against deception?
If you don't like these or want a more specific operationalization of this question, I'm happy with whatever you think is likely or filling out more details.
I think it really depends on the specific training setup. Some are much more likely than others to lead to deceptive alignment, in my opinion. Here are some numbers off the top of my head, though please don't take these too seriously:
Thanks you for this thoughtful response, I didn't know about most of these projects. I've linked this comment in the DeepMind section, as well as done some modifications for both clarity and including a bit more.
I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn't really one "unified agenda".
This is useful to know.
Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.
I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more".
Some corrections for your overall description of the DM alignment team:
There is also the ontology identification problem. The two biggest things are: we don't know how to specify exactly what a diamond is because we don't know the true base level ontology of the universe. We also don't know how diamonds will be represented in the AI's model of the world.
I personally don't expect coding a diamond maximizing AGI to be hard, because I think that diamonds is a sufficiently natural concept that doing normal gradient descent will extrapolate in the desired way, without inner alignment failures. If the agent discovers more bas...
Thanks for your response! I'm not sure I communicated what I meant well, so let me be a bit more concrete. Suppose our loss is parabolic , where . This is like a 2d parabola (but it's convex hull / volume below a certain threshold is 3D). In 4D space, which is where the graph of this function lives and hence where I believe we are talking about basin volume, this has 0 volume. The hessian is the matrix:
This is conveniently already diagonal, and the 0 eigenvalue comes from the component , which...
I am a bit confused how you deal with the problem of 0 eigenvalues in the Hessian. It seems like the reason that these 0 eigenvalues exist is because the basin volume is 0 as a subset of parameter space. My understanding right now of your fix is that you are adding along the diagonal to make the matrix full rank (and this quantity is coming from the regularization plus a small quantity). Geometrically, this seems like drawing a narrow ellipse around the subspace of which we are trying to estimate the volume.
But this doesn't seem na...
Thank you so much for your detailed reply. I'm still thinking this through, but this is awesome. A couple things:
Half baked confusion:
How does Parfit's Hitchiker fit into the Infra-Bayes formalism? I was hoping that disutility the agent receives from getting stuck in the desert would be easily representable as negative off-branch utility. I am stuck trying to reconcile that with the actual update rule:
Here, I interpret as our utility function. Thus: gives us the expected utility tracked from the offbranch event. The probability and the expectation are just a scale and shift. This update is appli...
Yeah, actual FLOPs are the baseline thing that's used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs.
If there's a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable. Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks.
Another downside that FLOPs / E-FLOPs share is that it's unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have. And it's unclear what capabilities will emerge from a small bit of scaling: it's possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model.