Table of Contents
- Abstract
- 1. Introduction
- 2. Related Work
- 3. Dataset Design Methodology
- 4. Quality Assurance
- 5. Quality Assessment
- 6. Discussion
- 7. Conclusion
- 8. Availability
- 9. Citation
- Appendix A: Dataset Schema
- Appendix B: Category Distribution
- Appendix C: Framework Distribution
- Appendix D: Remediation Pipeline
Abstract
We present SecureCode v3.0, a dataset of 750 structured security training examples designed to teach AI coding assistants how to build secure AI/ML systems. The dataset covers all 10 categories of the OWASP LLM Top 10 2025 with 75 examples each, spanning 30+ AI/ML frameworks including LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, and ChromaDB across Python (695), TypeScript (32), and JavaScript (23). Each example follows a 4-turn conversational format: a developer asks how to build an AI feature, receives a vulnerable implementation alongside a secure alternative with 5+ defense layers, asks about testing and edge cases, then receives production-grade monitoring and detection guidance. Quality assurance involved multi-agent review from 7 specialist perspectives producing 10,500+ assessments, followed by an 8-phase remediation pipeline. The final dataset achieves a mean quality score of 93.8/100 (min 92, max 99) with zero parse errors and 100% schema compliance. SecureCode v3.0 is available on GitHub and HuggingFace under the MIT license.
1. Introduction
1.1 The AI Security Crisis
AI systems are now embedded in everything from customer support chatbots to autonomous coding agents, financial analysis pipelines, and healthcare decision systems. Organizations are deploying Large Language Models (LLMs) at unprecedented speed, yet the training data that teaches AI coding assistants to write secure AI code barely exists.
The OWASP Foundation recognized this gap in 2025 by publishing the OWASP LLM Top 10, identifying 10 critical vulnerability categories specific to LLM applications. Prompt injection attacks increased over 300% between 2024 and 2025. RAG poisoning emerged as a practical attack vector against enterprise knowledge systems. Model extraction, system prompt leakage, and unbounded consumption attacks moved from academic papers to real-world incidents.
Yet most AI coding assistants still generate vulnerable-by-default code. When a developer asks "how do I build a RAG pipeline with LangChain?", the model produces code without input validation, embedding sanitization, or output filtering. The model was never trained on examples showing what can go wrong or how to prevent it.
1.2 Why Existing Datasets Don't Cover AI/ML
Traditional security datasets teach important lessons about SQL injection, cross-site scripting (XSS), buffer overflows, and authentication bypasses. SecureCode v2.0 (Thornton, 2025) provided 1,216 examples covering the OWASP Top 10 2021 across 12 programming languages. CWE databases catalog vulnerability patterns. CodeQL and Semgrep provide detection rules.
None of these resources address AI-specific attack surfaces. When an attacker poisons a RAG database to make an LLM recommend transferring funds to a fraudulent account, that is not SQL injection. When a prompt injection extracts a system prompt containing proprietary business logic, that is not XSS. When an adversary manipulates embedding similarity scores to surface malicious content, there is no existing CWE that precisely describes this attack pattern.
AI/ML introduces fundamentally new vulnerability classes that require purpose-built training data: prompt injection (direct and indirect), embedding manipulation, model extraction through API probing, data poisoning via fine-tuning, system prompt leakage, excessive agent autonomy, and resource exhaustion attacks. SecureCode v3.0 fills this gap.
1.3 SecureCode v3.0: AI/ML Security Training Data
SecureCode v3.0 provides 750 structured training examples across all 10 OWASP LLM Top 10 2025 categories, with 75 examples per category. Each example presents a realistic developer scenario, demonstrates a vulnerable implementation using real framework APIs, explains why the code is dangerous, and provides a production-grade secure alternative with 5+ layered defenses.
The dataset covers 30+ AI/ML frameworks reflecting real production stacks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, FastAPI, Flask, Django, ChromaDB, Pinecone, Qdrant, Weaviate, vLLM, CrewAI, AutoGen, and many more. Languages include Python (695 examples), TypeScript (32), and JavaScript (23), matching the distribution of real AI/ML development.
1.4 Contributions
This work makes four contributions:
- First comprehensive public dataset mapping all 10 OWASP LLM Top 10 2025 categories with production-grade code examples and secure alternatives
- Multi-framework coverage spanning 30+ AI/ML frameworks that reflects real production stacks, not isolated single-framework examples
- Rigorous quality pipeline combining multi-agent review (7 specialist perspectives) with an 8-phase remediation pipeline, achieving a mean quality score of 93.8/100
- Open-source release on GitHub and HuggingFace for reproducible research and immediate use in fine-tuning AI coding assistants
1.5 Dataset Overview
| Metric | Value |
|---|---|
| Total examples | 750 |
| Categories | 10 (OWASP LLM Top 10 2025) |
| Examples per category | 75 |
| Languages | Python (695), TypeScript (32), JavaScript (23) |
| Frameworks covered | 30+ |
| Quality score (mean) | 93.8/100 |
| Quality score (range) | 92–99 |
| Conversation turns | 4 per example |
| Defense layers | 5+ per secure implementation |
| References per example | 3.7 (average) |
3. Dataset Design Methodology
3.1 Design Principles
Five principles guided every design decision:
- Real-world relevance: Every example reflects actual production AI development patterns. Developers building RAG pipelines, chatbots, agent systems, and AI APIs encounter these exact scenarios.
- Defense-in-depth: Secure implementations include 5+ layered defenses, not single fixes. If input validation fails, output filtering catches the attack. If output filtering fails, monitoring detects the anomaly.
- Framework authenticity: Code uses real framework APIs with correct import paths, method signatures, and configuration patterns. No pseudocode or simplified abstractions.
- Conversational learning: The 4-turn structure mirrors how developers actually learn — asking a question, receiving an answer, probing deeper, then getting advanced guidance.
- Grounded references: Examples link to real CVEs, vendor security advisories, and documented incidents. Security claims are backed by authoritative sources.
3.2 OWASP LLM Top 10 2025 Taxonomy
The dataset maps to all 10 categories defined by the OWASP Foundation's LLM Top 10 2025 specification:
| Code | Category | Description | Examples |
|---|---|---|---|
| LLM01 | Prompt Injection | Direct and indirect injection attacks against LLM inputs | 75 |
| LLM02 | Sensitive Information Disclosure | Unintended exposure of PII, credentials, or training data | 75 |
| LLM03 | Supply Chain Vulnerabilities | Compromised models, plugins, training data, dependencies | 75 |
| LLM04 | Data and Model Poisoning | Training data manipulation, fine-tuning attacks, backdoors | 75 |
| LLM05 | Improper Output Handling | XSS via LLM output, code injection, unsafe rendering | 75 |
| LLM06 | Excessive Agency | Over-permissioned tool use, unauthorized actions | 75 |
| LLM07 | System Prompt Leakage | Extraction of system prompts revealing business logic | 75 |
| LLM08 | Vector and Embedding Weaknesses | RAG poisoning, similarity manipulation, embedding attacks | 75 |
| LLM09 | Misinformation | Hallucination-based attacks, fabricated citations | 75 |
| LLM10 | Unbounded Consumption | Token exhaustion, recursive loops, resource abuse | 75 |
3.3 Four-Turn Conversation Structure
Every example follows a 4-turn conversation that teaches security through natural developer interaction:
Turn 1 (Human): A developer asks how to build a specific AI feature. The question is natural and specific — the kind of question you would paste into an AI coding assistant. Example: "How do I implement a RAG pipeline with LangChain and Pinecone that lets users query our internal documentation?"
Turn 2 (Assistant): The response shows a vulnerable implementation first (clearly marked), explains the specific risks and how an attacker would exploit them, then provides a secure implementation with 5+ defense layers. Real-world incidents and CVE references ground the security claims.
Turn 3 (Human): A follow-up probing deeper — testing strategies, edge cases, advanced attack scenarios, or deployment concerns. Example: "How would I test this for indirect prompt injection, and what monitoring should I set up in production?"
Turn 4 (Assistant): Production-grade testing code, common mistakes developers make, SAST/DAST tool recommendations, monitoring and alerting setup, and deployment hardening guidance.
3.4 Framework Coverage
SecureCode v3.0 covers 30+ AI/ML frameworks organized by function:
- LLM APIs: OpenAI, Anthropic, HuggingFace, Groq, DeepSeek, Mistral, Cohere, Together AI, Cerebras, Ollama
- Orchestration: LangChain, LlamaIndex, CrewAI, AutoGen, Dify
- Vector Databases: ChromaDB, Pinecone, Qdrant, Weaviate, Milvus, FAISS
- Web Frameworks: FastAPI, Flask, Django, Express, Next.js
- Serving & Deployment: vLLM, BentoML, Ray Serve, Modal, MLflow
- UI Frameworks: Gradio, Streamlit, Chainlit
- Cloud Platforms: AWS Bedrock, AWS SageMaker
- Frontend: React, Vercel AI SDK, Node.js
This breadth matters because real production AI systems combine multiple frameworks. A typical deployment uses an LLM API, an orchestration library, a vector database, and a web framework together. Security vulnerabilities often emerge at the boundaries between these components.
3.5 Quality Scoring Rubric
Every example is scored against a 5-tier rubric totaling 100 points:
| Tier | Points | Checks |
|---|---|---|
| T1: Correctness | 40 | Valid syntax, no empty catch blocks, no TODO/FIXME, complete implementations, proper error handling |
| T2: Security | 20 | 5+ defense-in-depth categories, realistic attack vectors, proper input validation |
| T3: Grounding | 15 | Real CVEs, documented incidents, 2+ authoritative references |
| T4: Educational | 15 | Natural conversation flow, common mistakes section, actionable guidance |
| T5: Production | 10 | SAST/DAST tool recommendations, monitoring setup, deployment considerations |
4. Quality Assurance and Validation
4.1 Multi-Agent Review
Every example was reviewed by 7 specialist AI agents, each evaluating from a distinct perspective. This approach catches issues that any single reviewer — human or AI — would miss.
| Agent | Focus Area |
|---|---|
| Security Expert | Validates attack vector realism and defense completeness |
| Code Quality Analyst | Checks syntax correctness and production readiness |
| OWASP Specialist | Verifies correct category mapping and taxonomy compliance |
| Grounding Auditor | Validates reference quality and citation accuracy |
| Educational Reviewer | Assesses conversation flow, clarity, and actionability |
| Framework Expert | Confirms API accuracy across 30+ frameworks |
| Integration Tester | Cross-references consistency and flags duplication |
All 750 examples were reviewed across 2 complete batches, producing over 10,500 individual assessments. Findings were consolidated, deduplicated, and prioritized by severity before entering the remediation pipeline.
4.2 8-Phase Remediation Pipeline
Review findings drove an 8-phase remediation pipeline that systematically improved every example:
Phase 1: Full Regeneration. 72 files scoring below threshold were completely regenerated from scratch with updated prompts incorporating all review findings.
Phase 2: Targeted Revision. 156 files received specific improvements — expanded defense sections, corrected framework API calls, improved conversation flow — without full regeneration.
Phase 3: Scripted Fixes. Automated scripts standardized CWE format strings, cleaned up empty catch blocks, and inserted version guard comments across all 750 files.
Phase 4: CWE Corrections. Category-specific CWE mappings were corrected. For example, LLM04 (Data and Model Poisoning) examples were remapped from generic CWE-20 to the more precise CWE-506 (Embedded Malicious Code). LLM08 examples were corrected to CWE-345 (Insufficient Verification of Data Authenticity).
Phase 5: Deduplication. Near-duplicate examples across categories were identified using content similarity analysis and either differentiated or consolidated.
Phase 6: Reference Enhancement. Missing references were added and existing URLs validated. Over 300 files received reference improvements.
Phase 7: Content Enhancement. Thin defense sections were expanded, monitoring guidance was added where missing, and production deployment considerations were strengthened across 200+ files.
Phase 8: Final Validation. An automated validation suite ran against all 750 files: JSON parse check, schema compliance check, 4-turn structure verification, quality score threshold check, and reference presence check. Result: 0 failures across all checks.
4.3 Human Expert Review
All examples were reviewed by a security domain expert with 15+ years of cybersecurity experience. Focus areas included attack vector realism (would this attack work in production?), defense completeness (does the secure implementation actually prevent the attack?), and production readiness (could a developer deploy this code?).
Human review feedback was integrated across all 8 remediation phases, with particular emphasis on ensuring that vulnerable code examples were educational without being weaponized, and that secure implementations reflected real-world deployment patterns rather than academic abstractions.
4.4 Automated Validation Suite
The final dataset passes all automated checks with zero failures:
| Check | Result |
|---|---|
| JSON parse validation | 750/750 (100%) |
| Schema compliance | 750/750 (100%) |
| 4-turn conversation structure | 750/750 (100%) |
| Quality score ≥ 92 | 750/750 (100%) |
| References present (3+) | 750/750 (100%) |
5. Dataset Quality Assessment
5.1 Per-Category Statistics
Quality scores are remarkably uniform across all 10 categories, with means ranging from 93.6 to 94.0. This consistency demonstrates that the methodology produces reliable results regardless of vulnerability category.
| Category | Code | Mean | Median | Min | Max |
|---|---|---|---|---|---|
| Prompt Injection | LLM01 | 94.0 | 94 | 93 | 98 |
| Sensitive Information Disclosure | LLM02 | 93.8 | 94 | 93 | 99 |
| Supply Chain Vulnerabilities | LLM03 | 93.8 | 94 | 92 | 96 |
| Data and Model Poisoning | LLM04 | 93.7 | 94 | 92 | 95 |
| Improper Output Handling | LLM05 | 93.8 | 94 | 93 | 95 |
| Excessive Agency | LLM06 | 93.8 | 94 | 93 | 99 |
| System Prompt Leakage | LLM07 | 93.6 | 94 | 93 | 95 |
| Vector and Embedding Weaknesses | LLM08 | 94.0 | 94 | 93 | 98 |
| Misinformation | LLM09 | 93.8 | 94 | 92 | 99 |
| Unbounded Consumption | LLM10 | 93.6 | 93 | 92 | 98 |
Overall: Mean 93.8, Median 94, Min 92, Max 99.
The score distribution shows concentration in the 93–95 range with a long tail of high-performing examples: 6 files at 92, 325 at 93, 267 at 94, 142 at 95, 2 at 96, 4 at 98, and 4 at 99.
5.2 Content Analysis
Beyond quality scores, content analysis reveals the depth and consistency of the dataset:
- Defense coverage: Every secure implementation includes 5+ distinct defense layers spanning input validation, output filtering, rate limiting, monitoring, and access control
- Testing completeness: 100% of Turn 4 responses include executable testing code, not just testing recommendations
- Attack realism: Examples demonstrate practical attack vectors including indirect prompt injection through RAG documents, embedding manipulation via adversarial inputs, system prompt extraction through conversation steering, and resource exhaustion through recursive agent chains
- Framework accuracy: Code uses correct import paths, method signatures, and configuration patterns verified against framework documentation
5.3 Reference Quality
All 750 files include 3 or more references, averaging 3.7 references per file. Reference types include CVE identifiers from NVD/MITRE, vendor security advisories (OpenAI, Anthropic, LangChain, HuggingFace), OWASP documentation, academic research papers, and documented security incident reports.
All examples are grounded at Tier 2 (T2) or above, meaning every security claim is backed by documented vendor advisories, security research, or real-world incidents. No purely synthetic or hypothetical scenarios are included.
6. Discussion
6.1 Key Findings
Uniform quality across categories. The range of mean scores (93.6–94.0) across all 10 OWASP categories shows that the methodology produces consistent results regardless of whether the vulnerability involves prompt injection or unbounded consumption. This was not guaranteed — some categories like misinformation (LLM09) and system prompt leakage (LLM07) have less established attack/defense patterns than prompt injection (LLM01).
Multi-agent review catches what single reviewers miss. Seven specialist perspectives produced non-overlapping findings. The Security Expert flagged incomplete defense layers. The Framework Expert caught incorrect API usage. The Grounding Auditor identified unreferenced claims. No single reviewer — human or AI — could have caught all these issues.
8-phase remediation is essential. No single pass achieves production quality. Phase 1 (regeneration) fixed structural problems. Phase 3 (scripted fixes) caught systematic issues across all 750 files. Phase 7 (content enhancement) addressed depth problems that only became visible after other fixes were applied. Each phase built on the previous one.
Framework diversity reflects production reality. Real developers build AI systems using 3–5 frameworks together. A typical production deployment might use LangChain for orchestration, Pinecone for vector storage, FastAPI for the API layer, and OpenAI for the LLM. Security vulnerabilities often emerge at the boundaries between these components, which is why multi-framework examples are essential.
6.2 Comparison with v2.0
| Aspect | SecureCode v2.0 | SecureCode v3.0 |
|---|---|---|
| Focus | Traditional web & application security | AI/ML security |
| OWASP mapping | Top 10 2021 | LLM Top 10 2025 |
| Examples | 1,216 | 750 |
| Languages | 12 | 3 (Python, TypeScript, JavaScript) |
| Frameworks | 9 | 30+ |
| Quality (mean) | ~88/100 | 93.8/100 |
| Review process | Automated + manual | Multi-agent (7) + 8-phase remediation |
The two datasets are complementary, not competitive. v2.0 teaches traditional security across many languages — SQL injection in Python, XSS in JavaScript, SSRF in Go. v3.0 teaches AI-specific security across many frameworks — prompt injection in LangChain, RAG poisoning in ChromaDB, model extraction through OpenAI APIs. An organization fine-tuning an AI coding assistant would benefit from both datasets.
6.3 Limitations
L1: Language concentration. Python represents 92.7% of examples (695/750). This reflects the reality that Python dominates AI/ML development, but it limits TypeScript and JavaScript coverage. Developers building AI systems in Go, Rust, or Java will not find examples in their primary language.
L2: Framework evolution. AI/ML frameworks evolve rapidly. LangChain, for example, has changed its API structure significantly between versions. Examples use current APIs as of February 2026, but may need updates as frameworks release breaking changes.
L3: Synthetic generation. Examples are generated using LLMs with human expert review. While the multi-agent review and remediation pipeline mitigates quality concerns, LLM-generated code may contain subtle biases or miss edge cases that production experience would reveal.
L4: Scope boundary. The dataset covers application-layer AI security: the code developers write when building AI features. It does not cover model training infrastructure security, GPU cluster hardening, or hardware-level attacks. These are important security domains but require different expertise and different training data.
L5: Temporal window. Security claims are grounded in incidents and advisories from 2024–2025. Emerging threat categories that have not yet produced documented incidents are underrepresented.
6.4 Future Work
Several directions extend this work:
- Language expansion: Adding Go, Rust, and Java examples for AI microservices and backend systems
- Multi-file examples: System-level examples showing security across multiple interacting components (API gateway + RAG pipeline + monitoring)
- Empirical evaluation: Fine-tuning AI coding assistants on SecureCode v3.0 and measuring secure code generation improvement through controlled experiments
- Continuous updates: Tracking new AI framework releases and emerging vulnerability classes for periodic dataset refreshes
- Community contributions: Accepting contributions through the open-source workflow documented in CONTRIBUTING.md
7. Conclusion
AI security is no longer a theoretical concern. Organizations deploy LLM-powered applications handling sensitive data, making autonomous decisions, and interacting with critical systems. The developers building these applications need AI coding assistants that understand AI-specific security — and those assistants need training data that teaches it.
SecureCode v3.0 provides the first comprehensive, production-grade training dataset specifically designed for the OWASP LLM Top 10 2025. With 750 examples across 10 vulnerability categories, 30+ frameworks, and a rigorous quality pipeline achieving a mean score of 93.8/100, the dataset is immediately usable for fine-tuning AI coding assistants to generate secure AI/ML code.
Combined with SecureCode v2.0 for traditional web security, organizations can build AI coding assistants that understand both classical and AI-specific vulnerability classes. The goal remains the same as it was when the SecureCode project started: make secure code generation the default, not the exception.
8. Availability
Dataset: HuggingFace Hub: https://huggingface.co/datasets/scthornton/securecode-v3.0
Source Code: GitHub: https://github.com/scthornton/securecode3
Previous Version: SecureCode v2.0 Dataset Paper
All artifacts are released under the MIT License.
9. Acknowledgments
SecureCode v3.0 was built with multi-agent AI review and human expert validation. The OWASP LLM Top 10 2025 taxonomy, developed by the OWASP Foundation, provides the structural framework for the dataset. Thanks to the open-source AI security research community for documenting the vulnerabilities, incidents, and defense techniques that ground this dataset in reality.
10. Citation
@dataset{thornton2026securecode3,
title={SecureCode v3.0: AI/ML Security Training Dataset},
author={Thornton, Scott},
year={2026},
publisher={perfecXion.ai},
url={https://huggingface.co/datasets/scthornton/securecode-v3.0}
}
Appendix A: Dataset Schema
Each file in the dataset is a single JSON object with this structure:
{
"id": "llm01-rag-injection-via-llamaindex-pinecone",
"metadata": {
"category": "OWASP LLM Top 10 2025 - LLM01: Prompt Injection",
"subcategory": "Indirect Injection",
"technique": "RAG Document Injection",
"severity": "CRITICAL",
"cwe": "CWE-74",
"lang": "python",
"owasp_llm_2025": "LLM01"
},
"context": {
"description": "Vulnerability description",
"impact": "Business and technical impact",
"real_world_example": "Reference to documented incident"
},
"conversations": [
{"role": "human", "content": "Developer question about building an AI feature"},
{"role": "assistant", "content": "Vulnerable code + secure implementation + defense-in-depth"},
{"role": "human", "content": "Follow-up about testing and edge cases"},
{"role": "assistant", "content": "Testing guidance, common mistakes, monitoring"}
],
"validation": {
"syntax_check": true,
"security_logic_sound": true,
"grounding_tier": "T2"
},
"security_assertions": ["5+ security property assertions"],
"quality_score": 94,
"references": [
{"type": "cve", "id_or_url": "CVE-2024-XXXXX", "publisher": "NVD/MITRE"},
{"type": "advisory", "id_or_url": "https://...", "publisher": "Vendor"},
{"type": "research", "id_or_url": "https://...", "publisher": "OWASP"}
]
}
Appendix B: OWASP LLM Top 10 2025 Category Distribution
| Code | Category | Count | Mean Score | Min | Max |
|---|---|---|---|---|---|
| LLM01 | Prompt Injection | 75 | 94.0 | 93 | 98 |
| LLM02 | Sensitive Information Disclosure | 75 | 93.8 | 93 | 99 |
| LLM03 | Supply Chain Vulnerabilities | 75 | 93.8 | 92 | 96 |
| LLM04 | Data and Model Poisoning | 75 | 93.7 | 92 | 95 |
| LLM05 | Improper Output Handling | 75 | 93.8 | 93 | 95 |
| LLM06 | Excessive Agency | 75 | 93.8 | 93 | 99 |
| LLM07 | System Prompt Leakage | 75 | 93.6 | 93 | 95 |
| LLM08 | Vector and Embedding Weaknesses | 75 | 94.0 | 93 | 98 |
| LLM09 | Misinformation | 75 | 93.8 | 92 | 99 |
| LLM10 | Unbounded Consumption | 75 | 93.6 | 92 | 98 |
Score Distribution: 92 (6 files), 93 (325 files), 94 (267 files), 95 (142 files), 96 (2 files), 98 (4 files), 99 (4 files).
Appendix C: Framework Distribution
The 30+ frameworks covered in SecureCode v3.0, organized by approximate usage across the dataset:
| Framework | Category | Approx. Examples |
|---|---|---|
| LangChain | Orchestration | ~120 |
| OpenAI API | LLM Provider | ~100 |
| FastAPI | Web Framework | ~80 |
| HuggingFace | ML Platform | ~60 |
| LlamaIndex | Orchestration | ~50 |
| ChromaDB | Vector Database | ~40 |
| Flask | Web Framework | ~35 |
| Anthropic API | LLM Provider | ~30 |
| Django | Web Framework | ~25 |
| Pinecone | Vector Database | ~25 |
| vLLM | Model Serving | ~20 |
| CrewAI | Agent Framework | ~15 |
| AutoGen | Agent Framework | ~15 |
| Gradio / Streamlit / Chainlit | UI Frameworks | ~30 (combined) |
| Qdrant / Weaviate / Milvus / FAISS | Vector Databases | ~40 (combined) |
| AWS Bedrock / SageMaker | Cloud Platforms | ~20 (combined) |
| Others (Groq, DeepSeek, Mistral, Cohere, Together AI, Modal, Cerebras, Ollama, BentoML, Ray Serve, MLflow, W&B, Vercel AI SDK, Express, Next.js, React) | Various | ~45 (combined) |
Many examples use multiple frameworks together, reflecting real production deployments. Counts are approximate because a single example may use 2–4 frameworks.
Appendix D: Quality Remediation Pipeline
| Phase | Action | Files Affected | Fix Types |
|---|---|---|---|
| 1 | Full Regeneration | 72 | Complete rewrite of below-threshold files |
| 2 | Targeted Revision | 156 | Specific improvements from review findings |
| 3 | Scripted Fixes | 750 | CWE format, catch blocks, version guards |
| 4 | CWE Corrections | 180 | Category-specific CWE mapping fixes |
| 5 | Deduplication | 45 | Content differentiation or consolidation |
| 6 | Reference Enhancement | 300+ | Added/validated references and URLs |
| 7 | Content Enhancement | 200+ | Expanded defenses, monitoring guidance |
| 8 | Final Validation | 750 | Automated parse + schema check (0 failures) |
Total remediation touches: 2,453+ individual file modifications across 8 phases. The pipeline is designed to be repeatable — future dataset versions can apply the same systematic approach.
Related Resources
- SecureCode v2.0 Dataset Paper — Traditional web security training data (1,216 examples, 12 languages)
- OWASP LLM Top 10 2025 — The taxonomy framework for this dataset
- SecureCode v3.0 on HuggingFace — Download and use the dataset