> The two bugs that were found both sat outside the boundary of what the proofs cover. The denial-of-service was a missing specification. The heap overflow was a deeper issue in the trusted computing base, the C++ runtime that the entire proof edifice assumes is correct.
Still an interesting and useful result to find a bug in the Lean runtime, but I’d argue that doesn’t justify the title. Or the claim that “the entire proof edifice” is somehow shaky.
It’s important to note that this is the Lean runtime that has a bug, not the Lean kernel, which is the part that actually does the verification (aka proving). [1] So it’s not even immediately clear what this bug would really apply to, since obviously no one’s running any compiled Lean code in any kind of production hot path.
[1] https://lean-lang.org/doc/reference/latest/Elaboration-and-C...
If you have a spec that isn’t correct, you can certainly write code that conforms to that spec and write proofs to support it. It just means you have verified a program that does something other than what you intended. This is one of the harder parts of verification: clearly expressing your intention as a human. As programs get more complex these get harder to write, which means it isn’t uncommon to have lean or rocq proofs for everything only to later find “nope, it has a bug that ultimately traces back to a subtle specification defect.” Once you’ve gone through this a few times you quickly realize that tools like lean and rocq are tricky to use effectively.
I kinda worry that the “proof assistants will fix ai correctness” will lead to a false sense of assurance if the specs that capture human intention don’t get scrutinized closely. Otherwise we’ll likely have lots of proofs for code that isn’t the code the humans actually intended due to spec flaws.
LLM's are capable of producing code that passes formal verification.
The writing is on the wall: in the future more and more software on the abstract or platonic side of our computing base will be hermetically sealed against bugs and exploits. This quenching of bugs in the assured side will shift the mean location of bugs closer to the hardware side: at some point bugs and exploits will rely more and more on hardware quirks, and simply unspecified hardware.
Afterwards we can expect a long exponential decay of preventable safety violations: people mistakenly or surreptitiously disengaging the formal verification steps and shipping malicious or unverified code. Each such event will be its own big or small scandal, at some point there will be no deniability left: something must be on purpouse, either a malicious vulnerability or intentional disengagement of safety measures.
As the attack surface recedes towards the lower level hardware stack, it will open the debate that the community needs proper formal hardware descriptions (at least at the interface initially, not necessarily how the hardware has implemented it). As interface bugs get formalized 3 things can happen:
either vulnerabilities go extinct, and full formal hardware descriptions are not released
or vulnerabilities remain in each new generation of hardware, and malicious intent or negligence on behalf of the manufacturer can only be presumed, this will set up the community against manufacturers, as they demand full hardware descriptions (verilog, VHDL,...).
or vulnerabilities are tactically applied (vulnerabilities appear extinct to the bulk of the population, but only because manufactured vulnerabilities are sparingly exploited by the manufacturing block)
It is hard to predict what is more valuable: embedding HW vulnerabilities for the status quo and being able to exploit it for while before the public demands full hardware IP descriptions (verilog, VHDL) etc. or facing the end of vulnerabilities a little sooner but keeping hardware IP private (if the bugs stop with full interface descriptions).
As an aside, why can't people just write factually? This isn't a news site gamed for ad revenue. It's also less effort. I felt this post was mostly an insulting waste of time. I come to HN to read interesting stuff.
What I'd actually want from the tooling is a machine-checkable statement of the envelope itself, propagated as a runtime guard rather than a compile-time comment. Then "proof holds" and "we are still inside the proof's domain" are two separate, observable properties, and the unverified-parser / unverified-runtime cases stop being invisible.
Update: Actually, I guess this may have been her point: "The two bugs that were found both sat outside the boundary of what the proofs cover." So then I guess the title might be a bit click baity.
Are we baiting people with headlines now?
What seems to have happened here is that the storage allocator underneath is unverified. That, too, has a relatively simple spec - all buffers disjoint, no lost buffers, no crashes.
Type checking allows you to (outside of type casting such as in languages like C/C++ and casting to object for generic containers in Java) verify that an object is of a given type. That allows you to be sure that a well-formed program isn't doing things like putting a random object in a list.
Languages like C#, Scala, and Kotlin improve Java generics by making the generic type of a container or other interface/type part of the type system. This allows generic types of a generic type to preserve the inner type. This makes it possible to implement things like monads and mapping functions to preserve the generic type.
A similar thing is possible with union types, sealed interfaces/traits, etc. that allow you to check and verify the return type instead of defaulting it to a generic object/any type.
Likewise with other features like nullable/non-null annotations (or corresponding nullable type annotations like in Kotlin and recent C# versions).
All of these can be abused/circumvented, but if you keep your code within that framework the compiler will stop that code compiling. Likewise, these solve a limited set of bugs. For example, nullable types can't verify memory management and related bugs.
They used an AI agent sending ideas to a fuzzer and discovered a heap buffer overflow in Lean. This is big.
Before everyone starts blabbing about formal verification, etc., consider this: how do you know that you didn't make a mistake in your formal verification? IOW, how do you know your formal verification is bug-free? Answer: you don't. Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.
You can get asymptotically close to zero-bug proof, but you can never get there 100% of the way.
"This is genuinely one of the most memory-safe codebases I've analyzed."
That being said, using a coding agent to direct fuzzying and find bugs in the Lean kernel implementation is the big news here. (After all the kernel's implementation is not proved.)
The moral of the story is to push for more verified code not less and try AI bug hunting.
I'm genuinely excited about AI agents and formal verification languages. To me it's obviously the way forward instead of moonshots trying to make agents that program in their own AI blackbox binary, or agents that code in current programming languages.
If we are heading in the direction of "huge codebases that nobody has written", or, "code is an artifact for the machine", I don't see a way out without making it proved.
If humans can review and edit the spec, then verify that the implementation matches it, suddenly leaving the implementation be an artifact for the machines seems okay
The downside of provers also being that they are a massive pain in the ass that very few want to use, this is also a complete win.
Regarding the DoS in the lean-zip application itself: I think this is a really neat example of the difficult problem of spec completeness, which is a subcase of the general problem (mentioned by porcoda in a sibling comment) of being sure that the spec is checking the right things. For a compression program, the natural, and I would also say satisfyingly beautiful thing to prove is that decomp(comp(x)) = x for all possible inputs x. It's tempting to think that at that point, "It's proven!" But of course the real world can call decomp() on something that has never been through comp() at all, and this simple, beautiful spec is completely silent on what will happen then.
Tsk, tsk.
> Every Lean proof assumes the runtime is correct.
No. Every valid Lean proof assumes that if the runtime/mathlib etc is correct, then it too is correct.
Tangentially also, most lean proofs are not dependent on whether or not the runtime has things like buffer overflows or denial of service against lean itself at all, because if I prove some result in Lean (without attacking the runtime) then a bug in the runtime doesn’t affect the validity of the result in general. It does mean however that it’s not ok to blindly trust a proof just because it only relies on standard axioms and has no “sorry”s. You also need to check that the proof doesn’t exploit lean itself.
"Lean proves other program correct but not itself"
https://news.ycombinator.com/item?id=12761986 (being this link more than 10yrs old is not surprising)
I've come across issues in the past which weren't actually bugs. For example, the software was behaving exactly as intended but it looked like a bug to a user who didn't understand the nuances and complexities of what was happening.
I also cannot count how many times a non-technical person asked me to implement conflicting functionality or functionality which would have made the UX incredibly confusing for the user.
Wikipedia: [1] Turing proved no algorithm exists that always correctly decides whether, for a given arbitrary program and input, the program halts when run with that input. The essence of Turing's proof is that any such algorithm can be made to produce contradictory output and therefore cannot be correct.