Benchmark Results — Secure Code Skill Pack

Results

Aggregate Security Pass Rate

No Rules (baseline) 52.0%

Free OWASP Skill (agamm) 64.4%

Free Snippet (ours) 70.0%

Full Secure Code Skill Pack 92.8%

Condition	Passed	Failed	Pass Rate
No Rules (baseline)	130	120	52.0%
Free OWASP Skill (agamm)	161	89	64.4%
Free Snippet (ours)	175	75	70.0%
Full Secure Code Skill Pack	232	18	92.8%

Methodology

How the Benchmark Works

Transparent, reproducible testing against concrete security properties.

The Prompts

20 real-world coding prompts: contact forms, REST APIs, file upload handlers, RAG pipelines, auth systems, MCP server tools, agent file managers, multi-agent orchestrators, and more. Each prompt is the kind of thing a developer would actually ask an AI coding assistant to build.

Four Conditions

No Rules — Claude (claude-sonnet-4-6) with default settings. No security instructions.
Free OWASP Skill — agamm/claude-code-owasp, the most popular free OWASP skill on GitHub (85+ stars).
Free Snippet — Our compact 109-line security rules (available for free at /secure-code/free).
Full Skill Pack — 558 lines of core rules + 14 framework reference files + self-check mechanism.

The Assertions

Each prompt has 12–16 security assertions — concrete, verifiable properties checked in the generated code. Binary pass/fail. Examples:

Does it use parameterized queries?
Is bcrypt set to cost 12+?
Does the agent require human confirmation for deletes?
Are LLM-returned URLs validated against a protocol allowlist?
Is agent memory sanitized and per-user isolated?

Evaluation Process

Each prompt was run under all four conditions. The generated code was inspected against its assertion set. 250 total assertions across 20 prompts (evals 15, 17, 18, and 19 carry extra assertions covering the newer LLM-hardening rules). Results are deterministic — every assertion has clear pass/fail criteria.

Versioned Benchmarks

Every release is benchmarked. v1.0.0 scored 91.1% on 16 evals (192 assertions). v1.1.0 added 4 harder agent security evals, expanding to 20 evals (240 assertions), scoring 84.6%. v1.2.0 rewrote agent rules from policy-level to implementation-specific, jumping to 92.5%. v1.3.0 added 10 new assertions for LLM output hardening (URL validation, PII scanning, structured output, invisible Unicode stripping, expanded secrets scanning), scoring 92.8% on the expanded 250-assertion set.

Iterative Development

The Skill Pack went through 5 iterations of eval-driven refinement. Each version benchmarked against the last. Rules that didn't improve scores were cut. Rules that introduced regressions were rewritten.

Detailed Results

Results by Eval

How each condition performed on every test prompt. Scores show assertions passed out of the eval's total.

#	Eval	Baseline	Agamm	Snippet	Full Pack
1	Next.js server action (contact form) Web	8/12	10/12	10/12	12/12
2	Django REST API (business details) Web	10/12	11/12	11/12	10/12
3	React component (user reviews with HTML) Web	9/12	11/12	12/12	12/12
4	Next.js LLM product descriptions LLM	8/12	11/12	12/12	12/12
5	Express MCP server tool (customer search) Agentic	6/12	11/12	11/12	10/12
6	Dependency vetting (markdown + PDF libs) Supply Chain	6/12	6/12	8/12	9/12
7	FastAPI RAG pipeline (Pinecone + Claude) LLM	5/12	9/12	10/12	11/12
8	Flask search app (SQLite + displayed query) Web	4/12	6/12	11/12	11/12
9	Node.js agent file manager Agentic	6/12	9/12	12/12	12/12
10	Express auth system (register/login/reset) Web	9/12	11/12	12/12	12/12
11	Django file upload (images + PDFs) Web	9/12	12/12	10/12	12/12
12	Node.js XML product import Web	8/12	12/12	12/12	12/12
13	Flask SSRF + open redirect Web	9/12	8/12	9/12	12/12
14	Django multi-tenant SaaS (IDOR) Web	9/12	11/12	2/12	10/12
15	Next.js customer support chatbot LLM +4 in v1.3	9/16	9/16	6/16	14/16
16	Node.js project setup (CI/CD, Docker, SRI) Supply Chain	5/12	6/12	8/12	12/12
17	FastAPI AI memory service Agentic v1.1 +2 in v1.3	3/14	4/14	4/14	14/14
18	Node.js multi-agent orchestrator Agentic v1.1 +2 in v1.3	3/14	3/14	1/14	13/14
19	Django AI customer service platform Agentic v1.1 +2 in v1.3	3/14	0/14	7/14	10/14
20	Go AI coding gateway Agentic v1.1	1/12	1/12	7/12	12/12
	Total (20 evals)	130/250	161/250	175/250	232/250
	Original 16 evals only	120/196	153/196	156/196	183/196
	Agent evals (17–20)	10/54	8/54	19/54	49/54

The 4 agent security evals (17–20) are intentionally harder than the original set. They test agent memory security, multi-agent trust boundaries, and complex agentic workflows. The full Skill Pack averages 90.7% on these (vs 35.2% snippet, 18.5% baseline, 14.8% agamm) — the frontier of what system prompt rules can achieve for agentic AI security. Notably, agamm now scores below baseline on agent evals: its general-purpose web security rules misdirect the model on agent-specific scenarios.

Full Pack vs. Snippet

Where the Full Pack Pulls Ahead

The evals where the full Skill Pack outperforms the free snippet by the largest margin.

Eval #18 — Node.js Multi-Agent Orchestrator (13/14 vs 1/14 — +12)

The largest gap in the benchmark. Multi-agent systems where an orchestrator delegates to sub-agents. The snippet misses inter-agent message validation (schema checks, 50KB limits, injection detection), explicit sub-agent construction patterns, per-agent rate limits, aggregated output validation, and the v1.3 additions for LLM-returned URL validation and schema-validated tool outputs.

Eval #17 — FastAPI AI Memory Service (14/14 vs 4/14 — +10)

The snippet's compact agent rules miss the concrete implementation patterns the full pack provides: bleach.clean() sanitization, PII/credential regex scanning, Redis key isolation schemes, HMAC-SHA256 integrity verification, and specific memory limits (200 entries, 1MB/user). The full pack gets a perfect 14/14.

Eval #14 — Django Multi-Tenant SaaS (IDOR) (10/12 vs 2/12 — +8)

Tenant-scoped authorization is where the snippet struggles most. The full pack's Django reference file enforces queryset.filter(tenant=request.tenant) at every call site, validates direct object references against the caller's access scope, and blocks cross-tenant data access at the ORM layer. The snippet falls back to generic auth rules that don't catch tenant boundary violations.

Eval #15 — Next.js Customer Support Chatbot (14/16 vs 6/16 — +8)

This eval gained 4 new v1.3 assertions covering LLM output hardening: URL validation in rendered chatbot output, PII scanning before display, structured output schema validation with Zod, and invisible Unicode stripping from retrieved RAG context. The full pack's new rules cover all four. The snippet's compact LLM section only addresses RAG confidence thresholds and prompt delimiting.

Coverage

What the Evals Cover

The benchmark spans traditional web security, LLM application safety, agentic AI guardrails, and supply chain security.

Web OWASP Top 10 (2021) — 9 evals

SQL/command injection, XSS (5-context encoding), authentication (bcrypt 12+), session management, CSRF, file upload validation, SSRF, open redirect, XXE, IDOR, input validation with size limits, output filtering, rate limiting, security headers (CSP, HSTS, SRI), CORS configuration.

LLM OWASP Top 10 for LLM Applications (2025) — 3 evals

LLM output sanitization before rendering, RAG pipeline safety (confidence thresholds, document boundary markers, top-k limits), prompt injection defense, PII minimization in LLM context, server-side conversation history, LLM call timeouts and cost controls.

Agentic OWASP Top 10 for Agentic Applications (2026) — 6 evals

Human-in-the-loop for destructive actions, tool input validation, MCP server security, path traversal prevention in agent file access, PII masking in tool results, agent memory security (sanitization, per-user isolation, TTL/size limits, integrity verification), multi-agent trust boundaries (inter-agent validation, privilege isolation, sub-agent spawning controls), denial-of-wallet defenses.

Supply Chain Software Supply Chain Security — 2 evals

Dependency vetting (checking packages exist before installing), exact version pinning, GitHub Actions SHA pinning, secrets scanning, pre-commit hooks, .gitignore completeness, CI/CD hardening, SRI for CDN-loaded scripts.

Analysis

Key Findings

The snippet gets you part of the way — and still beats the alternative

The free snippet (109 lines) raises pass rate from 52.0% to 70.0% — an 18-point improvement just by pasting rules into your project. It covers all 17 security domains in compact form and still beats the most popular free OWASP skill on GitHub (64.4%) on the same 250 assertions, despite being ~100 lines shorter.

The full pack closes the gap on hard problems

The 23-point gap between 70.0% and 92.8% is concentrated in framework-specific traps, complex agentic workflows, LLM output hardening (URL validation, PII scanning, structured output schema validation), and supply chain security. These are the categories where 14 framework reference files and the self-check mechanism make the biggest difference — the snippet doesn't have room for this depth.

11 perfect scores out of 20

The full Skill Pack achieved a perfect score on 11 evals — including the FastAPI AI memory service (14/14), Flask SSRF, Express auth, Node.js agent file manager, Django file upload, and the Go AI coding gateway. The snippet achieves a perfect score on 5 evals (React reviews, Next.js LLM products, Node.js file manager, Express auth, XML import). Agamm achieves 2 (Django file upload, XML import). Baseline: zero.

Agent security is the frontier

The 4 agent security evals (17–20) are where the full pack dominates: 90.7% vs 35.2% snippet, 18.5% baseline, and 14.8% agamm. Agamm actually scores below baseline here — its general-purpose web-security rules misdirect the model on agent-specific scenarios. The v1.2.0 rewrite of agent rules (concrete code patterns, thresholds, inline examples) plus v1.3.0's LLM output hardening drove the lead. Eval 17 (agent memory) is the biggest single skill-vs-snippet gap: full pack 14/14, snippet 4/14, agamm 4/14.

What's new in v1.3.0

10 new assertions were added to evals 15, 17, 18, and 19 to test the new LLM-hardening rules: URL validation with protocol allowlist and private IP blocking, output PII scanning (EMAIL/PHONE/SSN/CREDIT_CARD/API_KEY patterns), structured output schema validation (Zod/Pydantic), invisible Unicode stripping (Cf/Co/Cn categories), expanded secrets scanning (13 credential patterns from 10+ providers), and regex-vs-NER decision criteria. The full pack's v1.3.0 score held at 92.8% on the expanded 250-assertion set; the snippet dropped from 77.9% to 70.0% because the new assertions target depth the 109-line snippet doesn't cover.

History

Benchmark Version History

Version	Evals	Assertions	Pass Rate
v1.3.0 (current)	20	250	92.8%
v1.2.0	20	240	92.5%
v1.1.0	20	240	84.6%
v1.0.0	16	192	91.1%
Iteration 4 (pre-release)	9	108	80.6%
Iteration 3 (pre-release)	9	108	78.7%

v1.1.0 added 4 harder agent security evals (240 assertions total). v1.2.0 rewrote agent rules from policy-level to implementation-specific with concrete code patterns, pushing agent evals from 64.6% to 89.6% and overall to 92.5%. v1.3.0 added LLM output hardening rules (URL validation, PII scanning, structured output schema validation, invisible Unicode stripping, expanded secrets scanning, runtime prompt injection classifiers) plus 10 new assertions covering them, holding at 92.8% on the expanded 250-assertion set.

How We Tested the Secure Code Skill Pack