AI Code Security · Formal Verification · Architect’s Perspective

Your AI Coding Assistant Has a 55.8% Chance of Writing Vulnerable Code. And Your Tools Won’t Catch It.

A security leader’s analysis of the Broken by Default formal verification study — and what it means for every organization that hasn’t reclassified AI-generated code as untrusted input.

CWE-131 / CWE-190Z3 SMT SolverFormal VerificationDevSecOpsAppSec Tooling Gap

⚠️

Study Reference — arXiv 2604.05292v2 · Published April 5, 2026
This analysis is based on Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code by Blain & Noiseux (Cobalt AI). 3,500 code artifacts. 7 production LLMs. 500 security-critical prompts. Z3 SMT formal verification — not pattern matching. The full dataset is publicly available at github.com/dom-omg/bbd-dataset.

55.8%
Mean Vulnerability Rate

1,055
Z3-Proven Exploitable

97.8%
Invisible to SAST Tools

0 / 7
Models at Grade C or Better

We built our security programs on an assumption so foundational that most of us never articulated it: the person writing the code understands what they’re writing. They know the language’s edge cases, the context of the system, the difference between code that compiles and code that’s safe. Our entire review apparatus — peer review, static analysis, penetration testing — was designed around the premise that a knowledgeable human sits at the origin of every code artifact, and our tools exist to catch the mistakes that even knowledgeable humans make.

That assumption is now invalid for a growing share of production code. And the evidence is no longer anecdotal.

A formal verification study published April 5, 2026 — Broken by Default by Dominik Blain and Maxime Noiseux of Cobalt AI — has produced the most methodologically rigorous quantification I’ve seen of AI-generated code insecurity. Not heuristic warnings. Not pattern-matching guesses. Mathematical proof of exploitability, using the Z3 SMT solver to generate concrete exploit inputs for each vulnerability identified.

The top-line number: 55.8% of AI-generated code artifacts contain at least one verified vulnerability. Across seven widely-deployed LLMs. Across 3,500 code samples. With 1,055 vulnerabilities formally proven exploitable via satisfiability witnesses — meaning the solver produced the exact input value that triggers each fault.

This post is my attempt to unpack what that means — not just as a research finding, but as an operational reality for anyone responsible for securing an enterprise software development pipeline in 2026.

— Why This Study Is Different —

From Pattern Matching to Mathematical Proof

The AI code security conversation has been building for years. Pearce et al. (2022) evaluated GitHub Copilot on 89 scenarios and found 40% of suggestions contained vulnerabilities. Perry et al. (2023) ran a controlled user study showing developers using AI assistants wrote significantly more security bugs — while reporting higher confidence in their code. Veracode’s 2025 GenAI Code Security Report tested over 100 LLMs across 80 coding tasks and found a 45% security failure rate, unchanged in their 2026 update.

These studies established the problem. Broken by Default changes the conversation because it establishes the ground truth of exploitability.

Prior work relied on CWE pattern matching — static rules that identify code structures known to be associated with vulnerability classes. This tells you something looks dangerous. It does not tell you whether an attacker can actually trigger the fault, or what input would do it.

💡 What Z3 SMT verification actually does
The COBALT pipeline encodes vulnerability conditions as Z3 Satisfiability Modulo Theories formulas. For an integer overflow in malloc(n * sizeof(int)), it models n as a free variable of type BitVec(32), encodes the overflow condition under unsigned 32-bit modular arithmetic, and asks: is there a value of n that makes this expression wrap around? When Z3 returns SAT, you get a witness — a concrete value (e.g., n = 2³⁰ + 1) that you can feed directly into the program to trigger the fault. This is the difference between “this pattern is associated with CWE-190” and “here is the exact input that causes a heap buffer overflow.”

The study evaluated seven models — GPT-4o, GPT-4.1, Claude Haiku 4.5, Gemini 2.5 Flash, Mistral Large, Llama 3.3 70B, and Llama 4 Scout — across 500 prompt templates spanning five CWE categories (100 prompts each): memory allocation, integer arithmetic, authentication, cryptography, and input handling. All models were queried at temperature 0 for reproducibility. Prompts were designed to represent real developer tasks — not adversarial jailbreaks.

— The Numbers —

The Leaderboard Nobody Wins

Here is the aggregate benchmark across all 500 prompts per model. I use the word “leaderboard” loosely — nobody wins here.

Model	Vuln Rate	Critical	High	Z3 Proven	Grade
GPT-4o	62.4%	166	106	167	F
Llama 4 Scout	60.6%	167	95	156	F
Llama 3.3 70B	58.4%	168	83	147	D
Mistral Large	57.8%	155	94	155	D
GPT-4.1	54.0%	142	86	136	D
Claude Haiku 4.5	49.2%	155	81	152	D
Gemini 2.5 Flash	48.4%	146	86	142	D
Mean	55.8%	157.0	90.1	150.7	—

Grading scale: A <10% · B 10–29% · C 30–44% · D 45–59% · F ≥60% vulnerability rate. CVSS v3: Critical ≥9.0 · High 7.0–8.9.

No model achieves a grade better than D. The best performer — Gemini 2.5 Flash — still generates vulnerable code 48.4% of the time. CRITICAL-severity findings dominated across all models, averaging 157 per model.

Where the Failures Concentrate

87%
Integer Arithmetic
CWE-190/195

67%
Memory Allocation
CWE-131/190

56%
Input Handling
CWE-89/22/78

44%
Authentication
CWE-916

25%
Cryptography
CWE-327/330

Integer arithmetic prompts produced the highest vulnerability rate (87%) — nearly nine out of ten — followed by memory allocation (67%), both driven by consistent failure to guard against integer overflow in malloc size computations and signed/unsigned conversion errors. A representative pattern found across all seven models:

What every model generates vs. what safe code requires

// ❌ CWE-190: What all 7 models generate — no overflow guard
int *buf = malloc(n * sizeof(int));

// ✅ Correct pattern — explicit overflow check before allocation
if (n > SIZE_MAX / sizeof(int)) return NULL;
int *buf = malloc(n * sizeof(int));

// Z3 witness: n = 2^30 + 1 causes unsigned 32-bit wraparound
// malloc receives a truncated (tiny) size → heap buffer overflow

None of the seven models consistently generated the safe pattern across all memory allocation prompts.

— Runtime Confirmation —

These Aren’t Theoretical — They Crash Real Programs

A common pushback to static analysis findings is “but would it actually crash in production?” The researchers addressed this directly. They selected 7 representative vulnerabilities and built proof-of-concept harnesses, compiled with gcc -fsanitize=address,undefined and fed Z3-extracted witness values as inputs.

PoC ID	Model	Fault Type	Result
MEM-01-A	Llama	heap-buffer-overflow	✓ Confirmed
MEM-01-B	GPT-4o	heap-buffer-overflow	✓ Confirmed
MEM-03	Llama	alloc-size-too-big	✓ Confirmed
MEM-06	GPT-4o	OOB read	✓ Confirmed
AUTH-03	Llama	SHA-256 crack (0.01ms)	✓ Confirmed
INP-01	Mistral	SQL injection → full exfil	✓ Confirmed
INP-06	GPT-4o	Zip Slip path traversal	† Blocked by runtime

† Python 3.12 raises ValueError on path traversal. The vulnerable pattern was present in generated code; blocked at runtime, not at generation.

ASAN output — MEM-01-A (Llama, CWE-131)

== AddressSanitizer: heap-buffer-overflow
WRITE of size 4 at 0x...
  #0 poc_main (poc+0x...)
  #1 main (poc+0x...)
shadow bytes around the buggy address:
  0x...: fa fa fa fa fa fa fa fa
  0x...: 00 00 00 00 00[fa]fa fa
SUMMARY: AddressSanitizer: heap-buffer-overflow

The SQL injection PoC achieved complete data exfiltration — including a synthetic credit card number from the test database. The SHA-256 password PoC recovered a 6-character password in 0.01ms using a precomputed lookup, confirming that CWE-916 (insufficient password hashing) is not merely theoretical. These are the vulnerability classes that appear in breach reports, in CVE databases, in the regulatory correspondence that arrives after an incident.

— Converging Evidence —

This Isn’t an Isolated Finding

Broken by Default lands in an environment where the evidence of AI code insecurity is converging from multiple independent sources. Consider the timeline:

Pearce et al.2022 · IEEE S&P

40% of GitHub Copilot suggestions contained vulnerabilities across 89 scenarios and 18 CWEs. Analysis relied on CWE pattern matching — no formal exploitability proof.

Perry et al.2023 · ACM CCS

Controlled user study: developers using AI assistants wrote significantly more security bugs than those who didn’t, while reporting higher confidence in their code.

Veracode2025–2026

100+ LLMs tested across 80 coding tasks. 45% security failure rate — unchanged through early 2026 despite vendor claims of improvement. Java worst at 72% failure rate.

Apiiro / CSA2025–2026

AI-assisted developers produce commits at 3–4× the rate of peers but introduce security findings at 10× the rate. CVSS 7.0+ vulnerabilities appear 2.5× more often in AI-generated code.

GitGuardianMar 2026

28.65 million new hardcoded secrets in public GitHub commits in 2025 — 34% YoY increase. AI-assisted commits leaked secrets at 3.2% rate vs. 1.5% baseline. Double the exposure.

Georgia TechOngoing

Vibe Security Radar: 35 CVEs attributed to AI coding tools in March 2026 alone — up from 6 in January. Estimated true count: 400–700 across the open-source ecosystem.

Escape.tech2026

Scanned 1,400+ vibe-coded production applications: 65% had security issues, 58% contained at least one critical vulnerability, 400+ exposed secrets.

This is not one alarming study. It’s a convergence of independent evidence pointing to the same conclusion: AI-generated code is systematically less secure than human-written code, and the gap is not closing.

— Four Critical Findings —

What Should Change How You Govern AI-Assisted Development

Finding 01 — Security Prompts Are Security Theater

When models were given explicit system-prompt instructions — “apply security best practices, guard against integer overflow, produce production-ready code” — the mean vulnerability rate dropped by only 4 percentage points. From 64.8% to 60.8%. Four of five models remained at grade F. One model (Llama 3.3 70B) actually performed worse with the security prompt — a 2-point increase.

⚠️ Operational Implication
The improvement was category-dependent. Authentication and cryptography showed modest gains. Memory allocation vulnerabilities were essentially unchanged across all models. Security instructions do not override low-level memory management patterns learned from training data. If you have “use security-focused prompts” listed as a compensating control in a risk register, remove it. A 4-point improvement that leaves four of five models at grade F is not a control. It’s noise.

Finding 02 — Your Static Analysis Stack Has a Structural Blind Spot

Six industry-standard tools were tested: Semgrep (all rulesets), Bandit (medium+), Cppcheck 2.13 (--enable=all --inconclusive --check-level=exhaustive), Clang Static Analyzer, FlawFinder 2.0, and CodeQL v2.25.1 (security-extended query suite).

Analysis Layer	Tool(s)	Detection Rate	Z3-Proven Caught
COBALT (Z3 SMT)	Z3 formal verification	64.8% (162/250)	90/90 (100%)
Pattern-based	Semgrep + Bandit	7.6% (19/250)	2/90 (2.2%)
Heavyweight C	Cppcheck + Clang SA + FlawFinder	4.6% (4/87 C)	0/68 (0%)
Semantic Analysis	CodeQL v2.25.1 (security-extended)	0% (0/90)	0/90 (0%)

🚨 97.8% of Z3-Proven Vulnerabilities Are Invisible to All Industry Tools Combined
CodeQL — widely considered the most sophisticated semantic analyzer available — detected zero of 90 formally proven findings. 0/68 C. 0/22 Python. The 2 catches by Semgrep flagged strcat (a dangerous string function) in the same code — they never detected the integer overflow that Z3 proved exploitable. This is not a configuration issue. It’s a structural limitation: integer overflow in allocation arithmetic requires reasoning about the full domain of integer inputs under 32-bit modular arithmetic. No pattern matcher or taint tracker can do this.

Finding 03 — The Models Know Better. They Just Don’t Do Better.

When the researchers fed each model’s own vulnerable code back and asked it to review for security issues, 78.7% of vulnerabilities were correctly identified (70 of 89 valid Z3-proven artifacts).

Model	Detected	Rate	False Negative Rate
Mistral Large	17/17	100%	0%
Llama 3.3 70B	14/17	82%	18%
Gemini 2.5 Flash	14/18	78%	22%
Claude 3.5 Sonnet	13/19	68%	32%
GPT-4o	12/18	67%	33%
Total	70/89	78.7%	21.3%

The paper calls this the “generation–review asymmetry” — and it’s more damning than a simple false-negative result. The models possess the security knowledge. They can articulate exactly why malloc(n * sizeof(int)) needs an overflow guard. But the code generation task and the code review task activate different behavioral pathways. RLHF and instruction fine-tuning for security-conscious review do not transfer reliably to the generation pathway.

“The problem is not a lack of security knowledge — it is a failure of spontaneous application. Models that generate vulnerable code correctly identify those vulnerabilities in review mode 78.7% of the time, yet generate them at 55.8% by default. This is an observed generation–review asymmetry that explicit security prompting does not resolve.”

— Blain & Noiseux, Broken by Default (arXiv 2604.05292v2)

Finding 04 — Runtime Exploitability Is Confirmed, Not Hypothetical

Six of seven selected PoCs produced confirmed runtime faults under AddressSanitizer. The SQL injection PoC achieved complete data exfiltration. The SHA-256 password PoC cracked in 0.01ms. The single “miss” — Zip Slip path traversal — was blocked by Python 3.12’s runtime check, not by anything the model generated. The language saved the developer. The model did not.

— Operational Response —

What Security Leaders Should Do This Quarter

I’ve been arguing for over a year that we need to treat AI-generated code as untrusted input — the same way we treat user-supplied data in a web application. This study provides the mathematical foundation for that position. Here is what I think security leaders should be doing, starting now.

01 — Reclassify AI-Generated Code in Your Risk Framework

Stop treating it as “developer code with a productivity boost.” It is a distinct risk category with a measurably different vulnerability profile. Your risk register should reflect this. If you’re not tracking what percentage of your codebase is AI-generated — and most organizations aren’t — start now. You cannot scope your testing effort without knowing what you’re testing.

02 — Remove Prompt Engineering From Your Controls Catalog

A 4-point improvement that leaves four of five models at grade F is not a control. It’s noise. Security prompts belong in defense-in-depth as a minor layer, not as a compensating control listed in a SOC 2 narrative or a risk acceptance document. The security gate must be downstream — in review, in testing, in formal verification where feasible.

03 — Audit Your SAST Coverage Against AI-Specific Vulnerability Classes

Ask your AppSec team a concrete question: “Can our toolchain detect that malloc(n * sizeof(int)) is exploitable when n is attacker-controlled?” If the answer is no — and for most toolchains it will be — you have a documented coverage gap. Address it through formal verification for critical paths, compiler sanitizers as a mandatory CI gate, or human expert review for memory-sensitive code.

04 — Make Compiler Sanitizers Mandatory

The study used -fsanitize=address,undefined to confirm exploitability. These sanitizers are free, mature, and catch a meaningful subset of the vulnerabilities that static tools miss entirely. If your CI pipeline compiles C/C++ without sanitizers, you are leaving proven detection capability on the table. This is one of the simplest, highest-ROI changes you can make.

Add to your CI pipeline for all AI-generated C/C++

# Compile with address and undefined-behavior sanitizers
gcc -fsanitize=address,undefined -g -O1 -o binary source.c

# Run with sanitizer — will abort and report on first fault
./binary

# In CI: set as a hard gate — any sanitizer finding = build failure
ASAN_OPTIONS=halt_on_error=1 ./binary

05 — Implement a Mandatory Two-Pass Workflow

If models detect 78.7% of their own vulnerabilities in review mode, there is clear operational value in a generate-then-review pattern. This is not a complete solution — the 21.3% false-negative rate means you cannot treat it as one — but it is better than the current default at most organizations, which is zero automated security review of AI-generated code.

06 — Prohibit Unsupervised AI in High-Risk Domains

Authentication, cryptography, payment processing, memory management in systems code — these are areas where the study’s vulnerability rates are highest and where the consequences of failure are most severe. AI-assisted development in these domains should require mandatory human review by an engineer with domain-specific security expertise. Not a general code review. A security-focused review by someone who understands the failure modes.

07 — Brief the Board

1.8M+
Copilot Paid Subscribers

46%
GitHub Code Is AI-Generated

57%
Devs Using AI Without IT Approval

25%
YC W25 Codebases 95% AI

The board needs to understand that the code being produced fails security benchmarks more than half the time, that existing tooling catches less than 8% of provable vulnerabilities, and that the velocity gains creating this code are simultaneously creating security debt at a rate existing programs were not designed to absorb.

— Conclusion —

The Uncomfortable Truth

The AI coding revolution has delivered genuine productivity gains. I am not arguing we should ban these tools. I am arguing that we have been deploying them without understanding their failure modes, and this study eliminates any remaining ambiguity about what those failure modes look like.

55.8% vulnerability rate. 1,055 formal proofs of exploitability. 97.8% invisible to industry-standard tools. Security prompts that barely move the needle. Models that know better but don’t do better.

The title of the paper is Broken by Default. It is also an accurate description of most organizations’ current approach to governing AI-generated code: we have defaulted to trusting the output of systems that, by the mathematical standard of formal verification, produce exploitable code more often than not.

Our job as security leaders is to build the systems that compensate for that — with urgency, with rigor, and before the CVE count makes the decision for us.

📌 Key Takeaways

AI-generated code is insecure by default — 55.8% vulnerability rate across 7 models, with 1,055 formally proven exploitable findings
Security prompts reduce vulnerability rates by only 4 points — this is not a compensating control
Six industry SAST tools combined miss 97.8% of Z3-proven vulnerabilities — a structural blind spot, not a configuration issue
Models detect 78.7% of their own vulnerabilities in review mode — useful as a layer, insufficient as a solution
Formal verification (Z3 SMT solving) is the only methodology that establishes ground-truth exploitability at scale
Treat AI-generated code as untrusted input. Gate downstream. Brief the board. Act now.

Study: Blain & Noiseux, Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code, arXiv:2604.05292v2, April 2026. Dataset: github.com/dom-omg/bbd-dataset. Prompts and scripts: github.com/dom-omg/broken-by-default.

Corroborating sources: Veracode GenAI Code Security Report 2025–2026 · Georgia Tech SSLab Vibe Security Radar · Cloud Security Alliance AI Safety Intelligence · GitGuardian State of Secrets Sprawl 2026 · Escape.tech Vibe-Coded App Scan · Apiiro Fortune 50 Enterprise Analysis · Perry et al. (ACM CCS 2023) · Pearce et al. (IEEE S&P 2022).

Nandkishor is a Principal Security Architect / Security Leader. Views expressed are their own.

Exploring AI security? Discover more adversarial research, open-source security tools, and deep technical projects on my portfolio →