Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.
To clarify the two types of phase transitions involved here, in the terminology of Chen et al:
The former kind of phase transition is a first-order phase transition in the sense of statistical physics, once you relate the posterior to a Boltzmann distribution. The latter is a notion that belongs more to the theory of dynamical systems or potentially catastrophe theory. The link between these two notions is, as you say, not obvious.
However Singular Learning Theory (SLT) does provide a link, which we explore in Chen et al. SLT says that the phases of Bayesian learning are also dominated by critical points of the loss, and so you can ask whether a given dynamical phase transition has "standing behind it" a Bayesian phase transition where at some critical sample size the posterior shifts from being concentrated near to being concentrated near .
It turns out that, at least for sufficiently large , the only real obstruction to this Bayesian phase transition existing is that the local learning coefficient near should be higher than near . This will be hard to prove theoretically in non-toy systems, but we can estimate the local learning coefficient, compare them, and thereby provide evidence that a Bayesian phase transition exists.
This has been done in the Toy Model of Superposition in Chen et al, and we're in the process of looking at a range of larger systems including induction heads. We're not ready to share those results yet, but I would point you to Nina Rimsky and Dmitry Vaintrob's nice post on modular addition which I would say provides evidence for a Bayesian phase transition in that setting.
There are some caveats and details, that I can go into if you're interested. I would say the existence of Bayesian phase transitions in non-toy neural networks is not established yet, but at this point I think we can be reasonably confident they exist.
4. Goals misgeneralize out of distribution.
See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning
OAA Solution: (4.1) Use formal methods with verifiable proof certificates[2]. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be made! See VNN-COMP. In practice, it will not be possible to have enough information about the world to "prove" that a catastrophe will not be caused by an unfortunate coincidence, but instead we can obtain guaranteed probabilistic bounds via stochastic model checking.
Based on the Bold Plan post and this one my main point of concern is that I don't believe in the feasibility of the model checking, even in principle. The state space S and action space A of the world model will be too large for techniques along the lines of COOL-MC which (if I understand correctly) have to first assemble a discrete-time Markov chain by querying the NN and then try to apply formal verification methods to that. I imagine that actually you are thinking of learned coarse-graining of both S and A, to which one applies something like formal verification.
Assuming that's correct, then there's an inevitable lack of precision on the inputs to the formal verification step. You have to either run the COOL-MC-like process until you hit your time and compute budget and then accept that you're missing state-action pairs, or you coarse-grain to some degree within your budget and accept a dependence on the quality of your coarse-graining. If you're doing an end-run around this tradeoff somehow, could you direct me to where I can read more about the solution?
I know there's literature on learned coarse-grainings of S and A in the deep RL setting, but I haven't seen it combined with formal verification. Is there a literature? It seems important.
I'm guessing that this passage in the Bold Plan post contains your answer:
> Defining a sufficiently expressive formal meta-ontology for world-models with multiple scientific explanations at different levels of abstraction (and spatial and temporal granularity) having overlapping domains of validity, with all combinations of {Discrete, Continuous} and {time, state, space}, and using an infra-bayesian notion of epistemic state (specifically, convex compact down-closed subsets of subprobability space) in place of a Bayesian state
In which case I see where you're going, but this seems like the hard part?
In your example there are many values of the parameters that encode the zero function (e.g. θ1=θ2=θ4=θ7=θ11=0 and all other parameters free) in addition to there being many parameters that encode the function x4 (e.g. θ1=θ2=θ4=θ7=0, variables θ3,θ5,θ6,θ8,θ9,θ10 free and θ11θ12θ13θ14θ15=1). Without thinking about it more I'm not sure which is actually has local learning coefficient (RLCT) and therefore counts as "more simple" from an SLT perspective.
However, if I understand correctly it's not this specific example that you care about. We can agree that there is some way of coming up with a simple model which (a) can represent both the functions x↦0 and x↦x2 and (b) has parameters w∗0 and w∗square respectively representing these functions with local learning coefficients λ(w∗0)>λ(w∗square). That is, according to the local learning coefficient as a measure of model complexity, the neighbourhood of the parameter w∗0 is more complex than that of w∗square. I believe your observation is that this contradicts an a priori notion of complexity that you hold about these functions.
Is that a fair characterisation of the argument you want to make?
Assuming it is, my response is as follows. I'm guessing you think x↦0 is simpler than x↦x2 because the former function can be encoded by a shorter code on a UTM than the latter. But this isn't the kind of complexity that SLT talks about: the local learning coefficient λ(w∗) that appears in the main theorems represents the complexity of representing a given probability distribution p(x|w∗) using parameters from the model, and is not some intrinsic model-free complexity of the distribution itself.
One way of saying it is that Kolmogorov complexity is the entropy cost of specifying a machine on the description tape of a UTM (a kind of absolute measure) whereas the local learning coefficient is the entropy cost per sample of incrementally refining an almost true parameter in the neural network parameter space (a kind of relative measure). I believe they're related but not the same notion, as the latter refers fundamentally to a search process that is missing in the former.
We can certainly imagine a learning machine set up in such a way that it is prohibitively expensive to refine an almost true parameter nearby a solution that looks like x↦0 and very cheap to refine an almost true parameter near a solution like x↦x2, despite that being against our natural inclination to think of the former as simpler. It's about the nature of the refinement / search process, not directly about the intrinsic complexity of the functions.
So we agree that Kolmogorov complexity and the local learning coefficient are potentially measuring different things. I want to dig deeper into where our disagreement lies, but I think I'll just post this as-is and make sure I'm not confused about your views up to this point.