Your Tests Pass. This Bug Still Slips Through — AI Has the Fix
A permission check passed every test, then handed one user near-total access. Formal proofs would have made that bug impossible. AI just made them cheap.
Founder & Lead Technician

Quick answer
Formal verification mathematically proves software meets its specification in every reachable state, unlike testing, which only samples inputs. Long too costly for mainstream use, it is now within reach because AI can draft the specifications and proofs that once demanded rare, expensive expertise.
A permission check passed every test in its suite. Then it handed one user near-total access.
The system was a secrets-management platform. The rule sounded airtight: when you create a custom role, its permissions must be a subset of your own. You cannot grant access you do not hold. The boundary check covered every combination of operators, and every test was green.
Then someone scoped to a single environment, call it QA, created a role scoped to not development. That phrase matches every environment except one: user acceptance, sandbox, canary, production, all of it. A single-environment user had just escalated to near-universal reach. The check approved it. The tests never tried it.
Here is why that should worry you: it was not a coding mistake. It was a gap in what testing can ever see.
The gap between did not fail and cannot fail
Tests sample. You throw a million inputs at your code, catch some ugly bugs, and ship with confidence. The system runs fine until a user stumbles onto the 1,000,001st sequence, the one your random generator never produced. In the permission example, every test happened to use overlapping values, so the excluded environment always sat inside the parent set. The bug only fires when the sets do not overlap. No test thought to try that.
Formal verification attacks the problem from the other end. Instead of checking inputs one by one, it proves a property holds across every reachable state at once. The claim is not we did not find a bug. It is there is no bug of this class to find.
That permission system has one true invariant: a derived permission must always match a subset of the environments the granting permission matches. Prove that, and the escalation bug becomes impossible by construction. Not patched, not caught in review. Impossible.
How a proof actually works
You start with properties you want your code to keep. For a shopping cart: the balance never goes negative, every item appears in the total, only one coupon applies per order. A verifier cannot read English, so you write those rules in a verification-aware language, one where specifications and proofs are first-class citizens beside the code.
Several exist, each with a different strength:
| Language | Built for | How it checks |
|---|---|---|
| Dafny | Everyday imperative code | Automates much of the proof via SMT solvers |
| Lean | Math and verified software | Theorem proving, fast-growing adoption |
| Rocq (formerly Coq) and Isabelle | High-assurance research systems | Interactive proof assistants |
| F-star | Verified systems programming | Extracts to C and OCaml |
| TLA-plus | Distributed protocols | Specification and model checking |
The mechanics are the same across all of them. In a tool like Dafny you attach preconditions, what must be true before a function runs, and postconditions, what it guarantees on exit. The verifier never executes the code. It reasons about the structure and hands the conditions to an automated solver, which decides whether the postcondition holds for every input that satisfies the preconditions. If even one reachable state breaks the property, the code does not compile.
The guarantee is only as strong as the specification. But when the spec is right, the guarantee is absolute.
So why have you never used it?
Because the proofs cost more than the code. You would state an obvious property, then spend days coaxing a proof assistant into accepting it. The tools were slow, the errors cryptic, and the work demanded PhD-level skill. So formal verification stayed locked inside the places that could justify it: avionics, chip design, nuclear systems, cryptographic protocols. Everywhere else, the math said tests are good enough.
The verifier was never the real bottleneck, since many tools check proofs automatically. The bottleneck was writing the proof: translating a human requirement into precise logic, then wrestling the solver for hours when it could not confirm what you already knew was true.
Why AI changes the equation
That bottleneck is exactly the kind of work large language models have gotten good at. According to the source, since Opus 4.5 most frontier models can draft formal specifications from plain-language requirements, propose proof strategies, and iterate fast on failing lemmas in a tight loop with the verifier.
The structure matters. AI proposes a candidate; a deterministic, mechanical verifier checks it. If the proof is wrong, the verifier rejects it and the model tries again. You are not trusting the AI to be correct. You are trusting an external authority to reject it when it is not. The only thing you place real faith in is the specification itself.
That splits the labor. The machine grinds out a verifiably correct implementation; the human keeps the judgment calls: which properties are worth guaranteeing, and how the system is shaped.
None of this is theoretical. The seL4 microkernel ships with a machine-checked proof of its correctness, and large teams have used TLA-plus for years to catch distributed-system design flaws before a line of code is written. What is new is the price: the expertise that used to gate all of this is the part AI is now absorbing.
Where to point this first
You do not need to verify your whole codebase, and trying would burn months for little gain. The return concentrates where a wrong answer is expensive and the rules are tangled:
- Permissions and access control — exactly the bug above. Subset and isolation rules are easy to state and brutal to test exhaustively.
- Money movement — balances, refunds, coupon stacking, ledger rules that must never drift.
- State machines — order lifecycles, payment flows, anything where an illegal transition causes real damage.
- Concurrency — race conditions that surface in production and never reproduce in a test.
A sane way to start: pick the one invariant your system cannot violate. Write it as a property in a tool like Dafny. Let an AI assistant draft the specification and the proof, and let the verifier be the judge. Your job is to make sure the property you wrote is the property you actually meant.
That last part is the new risk. Verification moves the danger from is the code correct to is the spec correct. A flawless proof of the wrong property guarantees nothing useful. The skill that matters now is not coaxing a proof assistant, it is knowing what to guarantee.
Here is the uncomfortable takeaway for anyone who treats a green test suite as proof of safety. Passing tests means your code survived the inputs you imagined. It says nothing about the one you did not. For the rules your business cannot afford to get wrong, that gap is no longer something you have to live with.
Source: Hacker News
Frequently asked questions
What is formal verification in software?+
It is a method of mathematically proving that code satisfies a specification for every possible input and state, rather than checking a sample of cases the way testing does. You state the properties your code must uphold, express them in a verification-aware language, and a tool proves the code can never violate them. If a property can break in any reachable state, the code does not compile.
How is formal verification different from testing?+
Testing samples inputs, and even a million random cases can miss the one combination that triggers a bug. Verification rules out an entire class of failures at once by proving a property holds across every reachable state. Testing tells you the code did not fail on what you tried; verification tells you it cannot fail, assuming the specification is correct.
How is AI making formal verification more accessible?+
The hard part was never the verifier, which checks proofs automatically. It was writing the proofs, which demanded rare expertise and long, tedious effort. According to the source, since Opus 4.5 frontier models can draft specifications from plain-language requirements and iterate on proofs in a tight loop with the verifier. The verifier stays the authority, so the AI only has to be checkable, not trusted.
Founder & Lead Technician
Daniel founded Ask Technicians to cut through bad tech advice. He writes hands-on troubleshooting guides drawn from years of real-world repair and support work.
Related guides

Why Software Updates Matter (and How to Manage Them Safely)
Updates patch the exact holes attackers hunt for. Here's which ones to install now, which to wait on, and how to do it safely.

Common Software Mistakes That Slow and Expose Your Devices
App bloat, skipped updates, blind agreements, weak passwords: the everyday software habits that cost you speed and security, and how to fix each.

Hidden Keyboard Shortcuts and Window Tricks Every Computer User Should Know
The operating-system shortcuts, virtual desktops, and window-snapping tricks that make any computer feel faster — no new hardware required.

Hidden Software Features That Make You Faster in Everyday Apps
Search operators, text expansion, and toolbar tricks that quietly cut minutes off the work you do every single day.
