Limitations on Formal Verification for AI Safety

Andrew Dickson

In the past two years there has been increased interest in formal verification-based approaches to AI safety. Formal verification is a sub-field of computer science that studies how guarantees may be derived by deduction on fully-specified rule-sets and symbol systems. By contrast, the real world is a messy place that can rarely be straightforwardly represented in a reductionist way. In particular, physics, chemistry and biology are all complex sciences which do not have anything like complete symbolic rule sets. Additionally, even if we had such rules for the natural sciences, it would be very difficult for any software system to obtain sufficiently accurate models and data about initial conditions for a prover to succeed in deriving strong guarantees for AI systems operating in the real world.

Practical limitations like these on formal verification have been well-understood for decades to engineers and applied mathematicians building real-world software systems, which makes it puzzling that they have mostly been dismissed by leading researchers advocating for the use of formal verification in AI safety so far. This paper will focus-in on several such limitations and use them to argue that we should be extremely skeptical of claims that formal verification-based approaches will provide strong guarantees against major AI threats in the near-term.

What do we Mean by Formal Verification for AI Safety?

Some examples of the kinds of threats researchers hope formal verification will help with come from the paper “Provably Safe Systems: The Only Path to Controllable AGI” [1] by Max Tegmark and Steve Omohundro (emphasis mine):

Several groups are working to identify the greatest human existential risks from AGI. For example, the Center for AI Safety recently published ‘An Overview of Catastrophic AI Risks’ which discusses a wide range of risks including bioterrorism, automated warfare, rogue power seeking AI, etc. Provably safe systems could counteract each of the risks they describe.

These authors describe a concrete bioterrorism scenario in section 2.4: a terrorist group wants to use AGI to release a deadly virus over a highly populated area. They use an AGI to design the DNA and shell of a pathogenic virus and the steps to manufacture it. They hire a chemistry lab to synthesize the DNA and integrate it into the protein shell. They use AGI controlled drones to disperse the virus and social media AGIs to spread their message after the attack. Today, groups are working on mechanisms to prevent the synthesis of dangerous DNA. But provably safe infrastructure could stop this kind of attack at every stage: biochemical design AI would not synthesize designs unless they were provably safe for humans, data center GPUs would not execute AI programs unless they were certified safe, chip manufacturing plants would not sell GPUs without provable safety checks, DNA synthesis machines would not operate without a proof of safety, drone control systems would not allow drones to fly without proofs of safety, and armies of persuasive bots would not be able to manipulate media without proof of humanness. [1]

The above quote contains a number of very strong claims about the possibility of formally or mathematically provable guarantees around software systems deployed in the physical world – for example, the claim that we could have safety proofs about the real-world good behavior of DNA synthesis machines, or drones. From a practical standpoint, our default stance towards such claims should be skepticism, since we do not have proofs of this sort for any of the technologies we interact with in the real-world today.

For example, DNA synthesis machines exist today and do not come with formal guarantees that they cannot be used to synthesize smallpox [2]. And today’s drone systems do not come with proofs that they cannot crash, or cause harm. Because such proofs would have tremendous financial and practical value, even apart from any new concerns raised by AI, we should expect that if they were remotely practical, then several real-world examples of proofs like this would immediately come to mind.

Before continuing with this line of questioning however, let’s take a more detailed look at how researchers believe formal verification could work for AI systems. The following quote, from “Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems” [3] by Dalrymple et al. sketches the high-level approach:

In this position paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world in a way that appropriately handles both Bayesian and Knightian uncertainty), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). [3]

To evaluate this approach, it is helpful to compare it with formal modeling solutions used in leading industrial applications today. For example, GS approaches should remind us of methods like discrete element analysis which are used by structural engineers to model physical forces and estimate tolerances when designing real-world structures like buildings or bridges. However, while there are similarities, the GS approach proposes to go far beyond such methods, by adding in the idea of a formal safety specification, from which researchers hope that very strong proofs and formal guarantees may be derived.

On one hand, while we should recognize that modeling techniques like discrete element analysis can produce quantitative estimates of real-world behavior – for example, how likely a drone is to crash, or how likely a bridge is to fail – we should not lose sight of the fact that such estimates are invariably just estimates and not guarantees. Additionally, from a practical standpoint, estimates of this sort for real-world systems most-often tend to be based on empirical studies around past results rather than prospective modeling. And to the extent that estimates are ever given prospectively about real-world systems, they are invariably considered to be estimates, not guarantees.

If safety researchers were only arguing only for the use of formal models to obtain very rough estimates of tolerances or failure rates for AI systems, the claims would be much more reasonable. At the same time, obtaining an estimate that a DNA synthesis machine will only produce a dangerous pathogen 30% (or even 1%) of the time is much less interesting than a guarantee that it will not do so at all. But in any case, advocates of GS approaches are not, for the most part, talking about estimates, but instead believe we can obtain strong proofs that can effectively guarantee failure rates of 0% for complex AI software systems deployed in the physical world, as we can see from the following quote (emphasis mine again):

Proof-carrying AGI running on PCH appears to be the only hope for a guaranteed solution to the control problem: no matter how superintelligent an AI is, it can’t do what’s provably impossible. So, if a person or organization wants to be sure that their AGI never lies, never escapes and never invents bioweapons, they need to impose those requirements and never run versions that don’t provably obey them.

Proof-carrying AGI and PCH can also eliminate misuse. No malicious user can coax an AGI controlled via an API to do something harmful that it provably cannot do. And malicious users can’t use an open-sourced AGI to do something harmful that violates the PCH specifications of the hardware it must run on. There must be global industry standards that check proofs to constrain what code powerful hardware and operating systems will run. [1]

No examples of proofs or formal guarantees around physically-deployed systems of this kind exist for any software system or indeed any engineered system in the world today. So once again, our default stance as engineers and computer scientists should be skepticism.

Still, there are some cases of real-world systems where formal verification is being used to produce proofs and guarantees that have, at least, some similarities to what researchers propose. Let’s review a few of these to help refine our intuitions about the kinds of guarantees that may be possible for AI systems, as well as what might not. One useful reference point comes from the computer system that helped land the Perseverance Rover in the Jezero Crater on Mars in 2021 [4]:

Because even the most well written constrained-random simulation testbenches cannot traverse every part of a design’s state space, the JPL team used the Questa PropCheck tool to complement their digital simulations of the TRN design. Formal analysis with property checking explores the whole state space in a breadth-first manner, versus the depth-first approach used in simulation. Property checking is, therefore, able to exhaustively discover any design errors that can occur, without needing specific stimulus to detect the bugs. This ensures that the verified design is bug-free in all legal input scenarios. [4]

Simply put, the engineers developing the Perseverance Lander’s software systems at the Jet Propulsion Laboratory (JPL) used a formal verification-based software tool called PropCheck to obtain a guarantee that the software systems running onboard the Perseverance Lander would not malfunction or fail based on bugs or flaws internal to the software itself. Given this anecdote, we might wonder, since the mission was a success and the Lander was able to land on the surface of Mars, could formal verification solutions similar to PropCheck be used to produce strong safety guarantees against major AI threats?

The answer is “no”, and we can see this by looking in more detail at the kinds of guarantees that PropCheck gives, versus the kinds it does not, and then comparing these with the guarantees that GS researchers hope to obtain for AI systems. In particular, in the case of PropCheck, the guarantee for the Perseverance Lander was only that the onboard computer software would run correctly in terms of its internal programmatic design and not fail due to bugs in the code. There was no proof or guarantee at all about the physical process of landing on Mars, or the success of the mission as a whole, or even that the software program wouldn’t still fail in any case due to a short circuit, or a defective transistor inside the Lander’s computer.

And in fact, Mars missions have failed historically about 50% of the time, with “Failure in Computer Programming” (FCP) and “Failure of Software Program” (FSP) just two types of risk out of many, accounting for approximately 20% of the total failure risk according to one analysis [5]. This means that even if formal verification can completely remove the 20% of risk related to software, 80% of the total mission risk would still remain. Formal verification tools like PropCheck make no attempt to model or verify or provide any guarantees about this remaining 80% of risks at all.

In this important respect, while today’s real-world formal verification solutions like PropCheck may seem superficially similar to what GS researchers propose in [1] and [3], we can see that there are major differences in both the types of guarantees (i.e. guarantees about symbolic computer programs versus outcomes of physically-deployed systems) as well as the strength of guarantees (i.e. strong proofs of success versus rough estimates of success) hoped-for with GS solutions. We explore these differences in much more detail in the remainder of the paper.

Challenges and Limitations

In this section we’ll dive deeper into a number of specific challenges to formal verification-based approaches and also argue that for each challenge there exists a corresponding hard limitation that AI safety researchers should expect to face in practice. We also argue that these challenges and limitations will ultimately be show-stoppers for attempts to use formal verification to produce strong proofs or guarantees against major AI threats in the near-term, including essentially all of the AI threats discussed by researchers in [1] and [3]. In short, we believe that researchers advocating for GS solutions are either overlooking or severely underestimating these challenges with respect to the AI threats that matter the most.

Challenge 1 – Mathematical proofs work on symbol systems not on the world.

As discussed in the previous section, perhaps the most critical issue with research proposals like [1] and [3] is that based on current capabilities of physics and other hard sciences, proofs and formal guarantees can only be obtained for mathematical rule sets and symbol systems and not for engineered systems deployed in the physical world. To see the distinction, consider this quote describing the strength of the formal results GS researchers are hoping for:

We argue that mathematical proof is humanity’s most powerful tool for controlling AGIs. Regardless of how intelligent a system becomes, it cannot prove a mathematical falsehood or do what is provably impossible… The behavior of physical, digital, and social systems can be precisely modeled as formal systems and precise “guardrails” can be defined that constrain what actions can occur. [1]

Despite the fact that “physical, digital and social systems…precisely modeled as formal systems” are not at all possible today, researchers hope that by creating such models we will be able to, effectively, bring all of the relevant parts of the world entirely into the symbol system. Which, when combined with formal representations of code for the AI implementation, would yield a new type of mathematical object, based-on which reliable formal proofs and guarantees about physical outcomes in the world could be produced.

How realistic is this in the near-term? One way to evaluate this question is to examine the current state of formal verification as it is used in safety for autonomous vehicles (AVs). Given the facts that AVs heavily use AI and machine learning, real human lives are at risk and millions of dollars have been invested into developing solutions, AVs represent an excellent reference case for evaluating current capabilities.

If we look at the state-of-the-art in AV systems, we find that – much like with the software on the Perseverance Lander – while formal verification is indeed a useful tool for verifying important properties of onboard models, it is not an approach that provides anything like strong formal guarantees about real-world outcomes. As Wongpiromsarn et. al. put it in their excellent survey, “Formal Methods for Autonomous Systems (2023)” [6]:

Autonomous systems operate in uncertain, dynamic environments and involve many subcomponents such as perception, localization, planning, and control. The interaction between all of these components involves uncertainty. The sensors cannot entirely capture the environment around the autonomous system and are inherently noisy. Perception and localization techniques often rely on machine learning, and the outputs of these techniques involve uncertainty. Overall, the autonomous system needs to plan its decisions based on the uncertain output from perception and localization, which leads to uncertain outcomes. [6]

In short, the world models used by AVs today, as good as they are, are little more than rough approximations and there are many layers of uncertainty between them and the physical world. As Seshia et al. discuss in their excellent paper, “Toward Verified Artificial Intelligence”, layers of uncertainty like this introduce many barriers to obtaining strong guarantees about the world itself:

First, consider modeling the environment of a semiautonomous vehicle, where there can be considerable uncertainty even about how many and which agents are in the environment (human and nonhuman), let alone about their attributes and behaviors. Second, the perceptual tasks which use AI/ML can be hard, if not impossible, to formally specify. Third, components such as DNNs can be complex, high-dimensional objects that operate on complex, high-dimensional input spaces. Thus, even generating the [model and specification for] the formal verification process in a form that makes verification tractable is challenging. [7]

Based on the barriers that exist even for the relatively simple case of AVs (compared with other more complex AI threats) , it should be clear that whatever proofs and guarantees we can produce will not be guarantees about the world itself, but will at best pertain only to severely simplified, approximate and uncertain representations of it. Just as today’s autonomous vehicles do not ship with anything like formal guarantees that they will not crash, in general we should expect the following limitation for formal verification-based solutions in AI safety:

Limitation 1 – We will not obtain strong proofs or formal guarantees about the behavior of AI systems in the physical world. At best we may obtain guarantees about rough approximations of such behavior, over short periods of time.

For many readers with real-world experience working in applied math, the above limitation may seem so obvious they may wonder whether it is worth stating at all. The reasons why it is are twofold. First, researchers advocating for GS methods appear to be specifically arguing for the likelihood of near-term solutions that could somehow overcome Limitation 1. And second, as the same researchers point out, the degree of adversarial activity that may be encountered in the AI safety context may in fact require that this limitation is overcome in order for formal verification-based approaches to succeed. Quoting again from Dalrymple et al.,

Moreover, it is also important to note that AI systems often will be deployed in adversarial settings, where human actors (or other AIs) actively try to break their safety measures. In such settings empirical evaluations are likely to be inadequate; there is always a risk that an adversary could be more competent at finding dangerous inputs, unless you have a strong guarantee to the contrary…. This makes it challenging for an empirical approach to rule out instances of deceptive alignment, where a system is acting to subvert the evaluation procedure by detecting features of the input distribution that are unique to the test environment (Hubinger et al., 2021). [3]

On one hand, researchers advocating for GS approaches are effectively saying here that they believe that we will need strong guarantees about the physical world – not just empirical estimates of failure rates, or rough approximations of success, in order for these techniques to succeed. On the other hand, a quick look at state-of-the-art formal verification solutions in real-world software systems like Mars landers and AVs should very much cause us to doubt the likelihood of formal verification producing guarantees of this sort any time soon.

Challenge 2 – Most of the AI threats of greatest concern have too much complexity to physically model.

Setting aside for a moment the question of whether we can develop precise rules-based models of physics, GS-based approaches to safety would still need to determine how to formally model the specific AI threats of interest as well. For example, consider the problem of determining whether a given RNA or DNA sequence could cause harm to individuals or to the human species. This is a well-known area of concern in synthetic biology, where experts expect that risks, especially around the synthesis of novel viruses, will dramatically increase as more end-users gain access to powerful AI systems. This threat is specifically discussed in [3] as an area in which the authors believe that formal verification-based approaches can help:

Example 4.7 from "Towards Guaranteed Safe AI" [3]

Here, the authors propose that we can use GS methods to obtain proofs that, for a given RNA/DNA synthesis machine and any specific RNA or DNA sequence, the “risk of harmful applications remains below a conservatively specified bound”. For this to work, clearly we must be able to fully specify what counts as “harm” and also have high-confidence that both the model and specification are precise enough that we can trust any proofs we derive.

A quick survey of the current state of research in computational biology shows that creating a predictive model of this sort is far beyond anything that has been attempted by researchers today. One reason that this problem is so difficult is that there are many ways a given sequence could cause harm. For example, it might cause harm by encoding the genome for a novel virus like Covid-19, or alternatively, it might do so by being inserted directly into the DNA of a human cell using CRISPR, to give just two examples.

However even if we focus on a single subclass of the problem, like predicting whether a virus with a given DNA genome is harmful to humans, we find that this is still extremely difficult. The reason is that the only way to precisely predict how a new virus will affect a human body is to run a complete physical simulation of the virus and the human body that accurately simulates the results over a number of years. How far are we from being able to precisely model biological systems like this? Very far.

To get a sense of just how far, let’s review the relative capability of the current science. In 2022, researchers from the University of Illinois at Urbana-Champaign developed software to simulate a single 2-billion-atom cell [8]. This was perhaps the most advanced cellular-level simulation ever done at the time and researchers chose to simulate a “minimal cell”, rather than a human cell, to make the problem more tractable. Additionally, to help limit the required compute, the simulation only ran for 20 minutes of simulated time.

Considering the fact that roughly modeling a single human cell for 20 minutes was considered to be the limit of leading models in 2022, what is the realistic likelihood that we will be able to create on-demand physical simulations of entire human bodies (with their estimated 36 trillion cells [9]), along with the interactions between the cells themselves and the external world and then run those simulations for years? We should assume that simulations of this sort are not likely in the near-term.

While fully exploring the modeling-related difficulties for all of the most urgent AI threats is beyond the scope of this paper, a quick glance at another example should convince us that this challenge likely applies to most, or all, of the other threats of greatest interest to safety researchers as well. The reason for this is that all of the most serious AI threats we face eventually involve the AI system interacting with complex real-world environments, including human beings. Based on this, it is hard to see how we can precisely model the results of such interactions in a way that could give strong, provable guarantees, without also precisely modeling those complex environments, including human bodies and minds.

For instance, consider the problem of creating a world model that is precise enough to determine whether an AI chatbot response can be used to create misinformation or disinformation. This threat – which is already a credible threat to major democracies today [10] – appears, at first glance, to be even harder to physically model than the threat of harmful DNA. The reason is that the creators of such a model would have no way of knowing, a-priori, how a specific chat response may be used by the recipient. For instance, it could be posted on social media, included in a political speech, or emailed to a random person anywhere in the world. Putting ourselves in the shoes of the model developer, it is not even clear which parts of the world, or which mind(s) we would need to model, to check if they are being misinformed.

Another problem is that what counts as misinformation may not be possible to easily define. For example, true statements or authentic media artifacts like “real” photos or videos can often still be extremely misleading when presented out of context [11]. Should media like this, when presented out of context, be considered misinformation, or fact?

Given difficulties like the above that are associated with precisely modeling real world AI threats, we should therefore expect:

Limitation 2 – We will not obtain strong proofs or formal guarantees for AI threats that are difficult to formally model. This includes most or all of the AI threats of greatest concern.

The next challenge – which is an extension of the previous one – relates to the difficulty of obtaining detailed-enough data about the state of the world to successfully initialize accurate physical models, even if we are able to build them.

Challenge 3 – Measuring complete data about initial conditions for precise physical modeling is unrealistic for most AI threats.

In order to translate rule sets into proofs and formal guarantees about the world, a GS system would need to obtain, in addition to the formal model itself, sufficiently detailed and complete initial conditions data about the relevant parts of the world. This concern is noted by Dalrymple et. al:

It is also worth acknowledging that even a perfect model of physical dynamics is insufficient for safety, since safety-critical queries (e.g. whether a given molecule is toxic to humans) presumably will depend on facts about the initial conditions (e.g. of human cells) that are not deducible from physics alone. This must be addressed by inference about initial conditions and boundary conditions from data and observations, tempered by appropriately conservative epistemic frameworks incorporating Bayesian and Knightian uncertainty. [3]

While the above quote appears in a footnote to [3], it is unfortunately not discussed any further in the paper and no solutions are proposed in terms of how it might be overcome.

Getting high-quality, complete, initial conditions data is generally not a footnote in the process of building formal models for real-world systems. Rather, it can often be the most difficult and expensive part of the effort. To return to our example from the previous section – modeling the effects of a novel virus on a human body – we might ask, how difficult would it be to obtain and store granular-enough information about a human body to serve as input for a complete physical model of a human being?

Here again, we can sharpen our intuitions with a quick look at the leading research. In 2024, scientists at Google finished dissecting, scanning and mapping a tiny fragment of a human brain,

The 3D map covers a volume of about one cubic millimetre, one-millionth of a whole brain, and contains roughly 57,000 cells and 150 million synapses — the connections between neurons. It incorporates a colossal 1.4 petabytes of data. ‘It’s a little bit humbling,’ says Viren Jain, a neuroscientist at Google in Mountain View, California, and a co-author of the paper. ‘How are we ever going to really come to terms with all this complexity?’ [12]

While this effort certainly represents significant progress towards the goal of mapping a complete brain, the fact that a multi-year research effort at Google led to mapping just one-millionth of a brain and resulted in 1.4 petabytes of data should lead us to doubt the feasibility of mapping an entire human brain, much less an entire body, in the near-term. There is also the additional issue of data fidelity, since mapping technology is still immature and requires manual “proofreading” by humans to correct for errors,

Hundreds of cells have been proofread, but that’s obviously a few percent of the 50,000 cells in there,’ says Jain. He hopes that others will help to proofread parts of the map they are interested in. The team plans to produce similar maps of brain samples from other people — but a map of the entire brain is unlikely in the next few decades, he says. [12]

If one of the world’s leading experts in brain mapping believes that a complete brain map is unlikely in the next few decades, then we should not count on having the input data required for detailed physical simulations of humans any time soon. Based on the current state of brain-mapping science as well as the fact that building precise models for the most significant AI threats would also require building precise models of human brains and/or bodies, we should instead expect that:

Limitation 3 – The high-quality initial conditions data required for producing strong real-world guarantees using formal verification will not be available for most AI threats in the near-term.

If the process of building models and obtaining the data for modeling AI threats is currently too hard, one question we might ask is: could some disruptive innovation, such as gaining access to new AI systems, like an artificial general intelligence (AGI) help us overcome these obstacles. This brings us to our next challenge.

Challenge 4: AI advances, including AGI, are not likely to be disruptively helpful for improving formal verification-based models until it’s too late.

Some researchers, like Tegmark and Omohundro, believe that the use of AI itself may be the solution to overcoming the kinds of challenges discussed above:

In summary, we still lack fully automated code-verifying AI powerful enough for our provably safe AI vision, but given the rate AI is advancing, we are hopeful that it will soon be available. Indeed, just as several other AI fields dominated by GOFAI (‘Good Old-Fashioned AI’) techniques were ripe for transformation by machine learning around 2014, we suspect that automated theorem proving is in that pre-revolution stage right now. [1]

While there is good evidence that improvements in AI are likely to be helpful with the theorem proving, or verification, part of formal verification, unfortunately, as we have discussed, many of the greatest challenges arise earlier in the process, around modeling and specification. Here there is much less real-world evidence that AI can be disruptively helpful in the near-term. For example, today’s AI systems have shown nothing so far that suggests that they will soon create complete, computationally-tractable rules-based models of physics, or to help gather detailed-enough data about initial conditions to fully model human biology. Tegmark and Omohundro acknowledge this:

Since we humans are the only species that can do [formal modeling] fairly well, it may unfortunately be the case that the level of intelligence needed to be able to convert all of one’s own black-box knowledge into code has to be at least at AGI-level. This raises the concern that we can only count on this “introspective” AGI-safety strategy working after we’ve built AGI, when according to some researchers, it will already be too late. [1]

Indeed, any safety strategy that strictly depends on access to an AGI as a precondition should not inspire a high-level of confidence. After all, we expect that many significant AI threats, like misinformation from generative AI and escalated risk of new biological weapons, are either present already or will appear very soon. Moreover, it’s not clear at all that AGI on its own will be sufficient to overcome the challenges discussed earlier in this paper. The reason is that the rough consensus definition of AGI is something like, “expert human-level capability at any task”. In other words, we can think of an AGI as something like a computer version of an expert-level human, but not super-human, at least initially.

With this in mind, it’s worth observing that the modeling challenges described earlier in this paper have all had on the order of millions of person-hours invested by expert-level humans over the past several decades (e.g. from basic physics research, to biological modeling, brain scanning, proof systems and so on) and still remains far from what would be required for formal verification of the sort described in [1] and [3]. Based on this, it’s not at all obvious that adding several orders of magnitude of additional expert human-level AGIs will lead to disruptive improvements in modeling and specification in the near-term. Instead, we believe that the following is a more sensible baseline to assume:

Limitation 4 – Major AI support for formal verification-based solutions to significant AI threats will come with artificial superintelligence (ASI) if it comes at all.

If access to an ASI is a precondition for formal verification based approaches to work, then this will indeed be “too late”. We now move on to our final challenge, which relates to practical issues with verification of guarantees and proofs produced by GS-based techniques about AI systems deployed in the real world.

Challenge 5 – Proofs and guarantees about AI systems will not be portable and straightforward to verify in the same way that ordinary mathematical proofs are.

Normally when we think about proofs, particularly in math, one of their major benefits is that they typically require minimal trust assumptions and can be easily verified by anyone. Dalrymple et al. argue that proofs produced using GS approaches would work like this as well:

A further socio-technical benefit of GS AI is its potential for facilitating multi-stakeholder coordination. This is because GS AI is able to produce proof certificates verifying that a given solution conforms to auditable-to-all-parties specifications, in a way that requires minimal trust among said parties. [3]

This sounds amazing in theory, but breaks down as soon as we try to imagine how it would work in practice. For example, let’s consider once again the example from [1] of a DNA synthesis machine that comes with a proof that it physically cannot produce harmful DNA. While it’s not clear exactly what such a proof would look like, it would presumably somehow entail the results of years of physical simulations of viruses interacting with human bodies, as well as some guarantees about the physical design of the specific machine in question as well.

Now something we must ask is: even if we could computationally verify the deductive correctness of this proof, (i.e. that it is correct in guaranteeing the good behavior of some machine) how do we know that the machine described by the proof is this machine – i.e. the one sitting in front of us? The fact is, there is no straightforward way for us to know that the machine in front of us is the one described in the proof; that the details of the transistors and wires on the physical circuit board precisely match those in the proof, that there are no secret backdoors, no manufacturing defects, etc. Even doing a physical inspection to verify these details to 100% accuracy may be impossible from a practical standpoint. Depending on how the machine is constructed, such an inspection would likely require the verifier to have access to detailed specifications about the hardware and software design of the machine, as well as the time and capability to disassemble the machine down to its smallest parts. Even if all of this were possible, we would then have entirely new concerns about the security vulnerabilities that access of this kind would open up.

At a minimum, what this example highlights is that any proofs we might obtain about physical systems will be far from “auditable-to-all-parties” and that, on the contrary, they will require much more than “minimal trust” to verify. Difficulties with verification become even worse when we consider AI systems like chatbots and other API-based products. Suppose you are given a proof that a chatbot or AI product you are interacting with over the Internet will not harm you. From a practical standpoint, there is simply no way to verify the correctness of the proof, because there is no way to know for certain that the proof you were given matches the system you are interacting with on the other end of the wire.

Based on this, it seems that verification of GS-style proofs in the context of AI systems would need to be coupled with comprehensive and continuous physical inspections, including access to the full hardware, toolchain, source code and design specs for the systems. But this is not something that frontier AI labs are likely to agree to, and which, even if they did, would open up many new concerns about security and safety. By considering cases like these we can see that, far from expecting “auditable-to-all-parties” proofs and guarantees, we should instead expect the following limitation related to GS-style guarantees about real-world AI systems:

Limitation 5 – Any “proofs” or “guarantees” about physically-deployed systems will need to be coupled with intensive, continuous physical inspections in order to verify them.

This brings us to the end of our list of challenges and limitations on the use of formal verification in AI safety. Based on this list, it should at a minimum be clear that the kinds of real-world safety results hoped-for by the authors of [1] and [3] are far too ambitious and ignore many challenges and limitations that make them intractable from a practical perspective in the near-term.

What Can Be Hoped-For?

Formal verification is a well-established field that has worked for several decades to improve the safety and reliability of mission-critical, real-world systems, like Mars landers and autonomous cars. Given this, we would be wise to use existing solutions (as well as their limitations) to guide our intuitions about the kinds of results we might reasonably expect in the domain of AI safety. In particular, a practical analysis of such challenges should convince us that, in the near-term, formal verification based approaches to AI safety will face a number of specific limitations that make them highly unlikely to produce strong safety guarantees for the AI threats we worry about the most.

In stark contrast to what is argued by papers like [1] and [3], we should not expect provably-safe DNA synthesis machines, or strong guarantees that drones will never harm people, or chatbots that come with proofs that they will never misinform. And given that we cannot expect guarantees around such straightforward threats, we should be even more skeptical that formal verification will result in provable guarantees against more challenging and complex threats like Loss-of-Control [13].

We have also argued against the idea that such limitations to formal verification-based approaches will be overcome simply through access to advanced AI systems, even including AGI. Given that many millions of dollars and person-hours have been spent on developing existing models and formal verification solutions over decades and given that an AGI is, by definition, equivalent to an expert-level human we should not expect existing limitations to be overcome merely by introducing several more orders of magnitude in AI-based human-expert-equivalents to the mix. And in any case, given that serious AI threats, like widespread misinformation and escalated risks of bioterrorism, are either with us already or will be very soon [14], we should be wary of putting much faith into approaches that require access to advanced AI as a hard prerequisite for success.

So what can be hoped-for, with respect to formal verification in AI safety? While a detailed discussion is beyond the scope of this paper, the short version is that we should expect results that look much more like today’s results, compared with what researchers propose. For example, we should expect that safety for AVs and drones will continue to improve and that fewer humans will be harmed by these systems. And we should expect that formal verification will have a role to play in these improvements, by generating and verifying real-time onboard proofs about these systems’ approximated world-models. Such improvements will be important, because they will save thousands of human lives from being lost in car accidents each year in the near-term. But as with today’s systems, we should not expect proofs or formal guarantees that AVs will not crash, or that drones can never cause harm.

And, for our most powerful AI systems, we should expect that it will be worthwhile – as it was with the Perseverance Lander – to continue to develop code-verification systems like PropCheck, to give us greater confidence that they will not fail unexpectedly due to inherent flaws and bugs and in their code. Such guarantees will be important, since AI systems may soon run much of the day-to-day infrastructure of the world, like power plants and water treatment facilities that humans depend on to survive. At the same time, we must be clear that avoiding internal bugs and crashes in a computer program is very different from guaranteeing the good behavior of physically-deployed AI systems in a messy and complex world.

Based on the practical, real-world limitations of formal verification in AI safety, if our goal is to find workable solutions to the most significant AI threats we face – and it should be – we must continue to search elsewhere, beyond formal verification, for such solutions.

References

[1] Provably Safe Systems: The Only Path to Controllable AGI

[2] Construction of an infectious horsepox virus vaccine from chemically synthesized DNA fragments

[3] Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

[4] Formal Verification Ensures The Perseverance Rover Lands Safely On Mars

[5] A Study on Mars Probe Failures

[6] Formal Methods for Autonomous Systems

[7] Toward Verified Artificial Intelligence

[8] Fundamental behaviors emerge from simulations of a living minimal cell

[9] We now know how many cells there are in the human body

[10] How AI Threatens Democracy

[11] Out-of-context photos are a powerful low-tech form of misinformation

[12] A Cubic Millimeter of a Human Brain Has Been Mapped in Spectacular Detail

[13] Artificial Intelligence and the Problem of Control

[14] Anthropic Responsible Scaling Policy

[-]David Manheim7mo23

I have a lot more to say about this, and think it's worth responding to in much greater detail, but I think that overall, the post criticizes Omhundro and Tegmark's more extreme claims somewhat reasonably, though very uncharitably, and then assumes that other proposals which seem to be related, especially Dalyrymple et al. approach, are essentially the same, and doesn't engage with the specific proposal at all.

To be very specific about how I think the post in unreasonable, there are a number of places where a seeming steel-man version of the proposals are presented, and then this steel-manned version, rather than the initial proposal for formal verification, is attacked. But this amounts to a straw-man criticism of the actual proposals being discussed!

For example, this post suggests that arbitrary DNA could be proved safe by essentially impossible modeling ("on-demand physical simulations of entire human bodies (with their estimated 36 trillion cells [9]), along with the interactions between the cells themselves and the external world and then run those simulations for years"). This is true, that would work - but the proposal ostensibly being criticized was to check narrower questions about whether DNA synthesis is being used to produce something harmful. And Dalyrmple et al explained explicitly what they might have included elsewhere in the paper ("Examples include machines provably impossible to login to without correct credentials, DNA synthesizers that provably cannot synthesize certain pathogens, and AI hardware that is provably geofenced, time-limited (“mortal”) or equipped with a remote-operated throttle or kill-switch. Provably compliant sensors can be specified to ensure “zeroization”, in which tampering with PCH is guaranteed to cause detection and erasure of private keys.")

AI ALIGNMENT FORUM
AF

49