Your Tests Pass. This Bug Still Slips Through — AI Has the Fix

A permission check passed every test, then handed one user near-total access. Formal proofs would have made that bug impossible. AI just made them cheap.

Daniel Carter

Founder & Lead Technician

June 30, 2026 at 5:18 AM IST 5 min

Your Tests Pass. This Bug Still Slips Through — AI Has the Fix

Quick answer

Formal verification mathematically proves software meets its specification in every reachable state, unlike testing, which only samples inputs. Long too costly for mainstream use, it is now within reach because AI can draft the specifications and proofs that once demanded rare, expensive expertise.

A permission check passed every test in its suite. Then it handed one user near-total access.

The system was a secrets-management platform. The rule sounded airtight: when you create a custom role, its permissions must be a subset of your own. You cannot grant access you do not hold. The boundary check covered every combination of operators, and every test was green.

Then someone scoped to a single environment, call it QA, created a role scoped to not development. That phrase matches every environment except one: user acceptance, sandbox, canary, production, all of it. A single-environment user had just escalated to near-universal reach. The check approved it. The tests never tried it.

Here is why that should worry you: it was not a coding mistake. It was a gap in what testing can ever see.

The gap between did not fail and cannot fail

Tests sample. You throw a million inputs at your code, catch some ugly bugs, and ship with confidence. The system runs fine until a user stumbles onto the 1,000,001st sequence, the one your random generator never produced. In the permission example, every test happened to use overlapping values, so the excluded environment always sat inside the parent set. The bug only fires when the sets do not overlap. No test thought to try that.

Formal verification attacks the problem from the other end. Instead of checking inputs one by one, it proves a property holds across every reachable state at once. The claim is not we did not find a bug. It is there is no bug of this class to find.

That permission system has one true invariant: a derived permission must always match a subset of the environments the granting permission matches. Prove that, and the escalation bug becomes impossible by construction. Not patched, not caught in review. Impossible.

How a proof actually works

You start with properties you want your code to keep. For a shopping cart: the balance never goes negative, every item appears in the total, only one coupon applies per order. A verifier cannot read English, so you write those rules in a verification-aware language, one where specifications and proofs are first-class citizens beside the code.

Several exist, each with a different strength:

Language	Built for	How it checks
Dafny	Everyday imperative code	Automates much of the proof via SMT solvers
Lean	Math and verified software	Theorem proving, fast-growing adoption
Rocq (formerly Coq) and Isabelle	High-assurance research systems	Interactive proof assistants
F-star	Verified systems programming	Extracts to C and OCaml
TLA-plus	Distributed protocols	Specification and model checking

The mechanics are the same across all of them. In a tool like Dafny you attach preconditions, what must be true before a function runs, and postconditions, what it guarantees on exit. The verifier never executes the code. It reasons about the structure and hands the conditions to an automated solver, which decides whether the postcondition holds for every input that satisfies the preconditions. If even one reachable state breaks the property, the code does not compile.

The guarantee is only as strong as the specification. But when the spec is right, the guarantee is absolute.

So why have you never used it?

Because the proofs cost more than the code. You would state an obvious property, then spend days coaxing a proof assistant into accepting it. The tools were slow, the errors cryptic, and the work demanded PhD-level skill. So formal verification stayed locked inside the places that could justify it: avionics, chip design, nuclear systems, cryptographic protocols. Everywhere else, the math said tests are good enough.

The verifier was never the real bottleneck, since many tools check proofs automatically. The bottleneck was writing the proof: translating a human requirement into precise logic, then wrestling the solver for hours when it could not confirm what you already knew was true.

Why AI changes the equation

That bottleneck is exactly the kind of work large language models have gotten good at. According to the source, since Opus 4.5 most frontier models can draft formal specifications from plain-language requirements, propose proof strategies, and iterate fast on failing lemmas in a tight loop with the verifier.

The structure matters. AI proposes a candidate; a deterministic, mechanical verifier checks it. If the proof is wrong, the verifier rejects it and the model tries again. You are not trusting the AI to be correct. You are trusting an external authority to reject it when it is not. The only thing you place real faith in is the specification itself.

That splits the labor. The machine grinds out a verifiably correct implementation; the human keeps the judgment calls: which properties are worth guaranteeing, and how the system is shaped.

None of this is theoretical. The seL4 microkernel ships with a machine-checked proof of its correctness, and large teams have used TLA-plus for years to catch distributed-system design flaws before a line of code is written. What is new is the price: the expertise that used to gate all of this is the part AI is now absorbing.

Where to point this first

You do not need to verify your whole codebase, and trying would burn months for little gain. The return concentrates where a wrong answer is expensive and the rules are tangled:

Permissions and access control — exactly the bug above. Subset and isolation rules are easy to state and brutal to test exhaustively.
Money movement — balances, refunds, coupon stacking, ledger rules that must never drift.
State machines — order lifecycles, payment flows, anything where an illegal transition causes real damage.
Concurrency — race conditions that surface in production and never reproduce in a test.

A sane way to start: pick the one invariant your system cannot violate. Write it as a property in a tool like Dafny. Let an AI assistant draft the specification and the proof, and let the verifier be the judge. Your job is to make sure the property you wrote is the property you actually meant.

That last part is the new risk. Verification moves the danger from is the code correct to is the spec correct. A flawless proof of the wrong property guarantees nothing useful. The skill that matters now is not coaxing a proof assistant, it is knowing what to guarantee.

Here is the uncomfortable takeaway for anyone who treats a green test suite as proof of safety. Passing tests means your code survived the inputs you imagined. It says nothing about the one you did not. For the rules your business cannot afford to get wrong, that gap is no longer something you have to live with.

Source: Hacker News

Frequently asked questions

What is formal verification in software?＋

It is a method of mathematically proving that code satisfies a specification for every possible input and state, rather than checking a sample of cases the way testing does. You state the properties your code must uphold, express them in a verification-aware language, and a tool proves the code can never violate them. If a property can break in any reachable state, the code does not compile.

How is formal verification different from testing?＋

Testing samples inputs, and even a million random cases can miss the one combination that triggers a bug. Verification rules out an entire class of failures at once by proving a property holds across every reachable state. Testing tells you the code did not fail on what you tried; verification tells you it cannot fail, assuming the specification is correct.

How is AI making formal verification more accessible?＋

The hard part was never the verifier, which checks proofs automatically. It was writing the proofs, which demanded rare expertise and long, tedious effort. According to the source, since Opus 4.5 frontier models can draft specifications from plain-language requirements and iterate on proofs in a tight loop with the verifier. The verifier stays the authority, so the AI only has to be checkable, not trusted.

#formalverification#AIproofs#softwarecorrectness#Dafny

Daniel Carter

Founder & Lead Technician

Daniel founded Ask Technicians to cut through bad tech advice. He writes hands-on troubleshooting guides drawn from years of real-world repair and support work.