SecureCode v3.0: AI/ML Security Training Dataset for OWASP LLM Top 10

Abstract
1. Introduction
2. Related Work
3. Dataset Design Methodology
4. Quality Assurance
5. Quality Assessment
6. Discussion
7. Conclusion
8. Availability
9. Citation
Appendix A: Dataset Schema
Appendix B: Category Distribution
Appendix C: Framework Distribution
Appendix D: Remediation Pipeline

Abstract

We present SecureCode v3.0, a dataset of 750 structured security training examples designed to teach AI coding assistants how to build secure AI/ML systems. The dataset covers all 10 categories of the OWASP LLM Top 10 2025 with 75 examples each, spanning 30+ AI/ML frameworks including LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, and ChromaDB across Python (695), TypeScript (32), and JavaScript (23). Each example follows a 4-turn conversational format: a developer asks how to build an AI feature, receives a vulnerable implementation alongside a secure alternative with 5+ defense layers, asks about testing and edge cases, then receives production-grade monitoring and detection guidance. Quality assurance involved multi-agent review from 7 specialist perspectives producing 10,500+ assessments, followed by an 8-phase remediation pipeline. The final dataset achieves a mean quality score of 93.8/100 (min 92, max 99) with zero parse errors and 100% schema compliance. SecureCode v3.0 is available on GitHub and HuggingFace under the MIT license.

1. Introduction

1.1 The AI Security Crisis

AI systems are now embedded in everything from customer support chatbots to autonomous coding agents, financial analysis pipelines, and healthcare decision systems. Organizations are deploying Large Language Models (LLMs) at unprecedented speed, yet the training data that teaches AI coding assistants to write secure AI code barely exists.

The OWASP Foundation recognized this gap in 2025 by publishing the OWASP LLM Top 10, identifying 10 critical vulnerability categories specific to LLM applications. Prompt injection attacks increased over 300% between 2024 and 2025. RAG poisoning emerged as a practical attack vector against enterprise knowledge systems. Model extraction, system prompt leakage, and unbounded consumption attacks moved from academic papers to real-world incidents.

Yet most AI coding assistants still generate vulnerable-by-default code. When a developer asks "how do I build a RAG pipeline with LangChain?", the model produces code without input validation, embedding sanitization, or output filtering. The model was never trained on examples showing what can go wrong or how to prevent it.

1.2 Why Existing Datasets Don't Cover AI/ML

Traditional security datasets teach important lessons about SQL injection, cross-site scripting (XSS), buffer overflows, and authentication bypasses. SecureCode v2.0 (Thornton, 2025) provided 1,216 examples covering the OWASP Top 10 2021 across 12 programming languages. CWE databases catalog vulnerability patterns. CodeQL and Semgrep provide detection rules.

None of these resources address AI-specific attack surfaces. When an attacker poisons a RAG database to make an LLM recommend transferring funds to a fraudulent account, that is not SQL injection. When a prompt injection extracts a system prompt containing proprietary business logic, that is not XSS. When an adversary manipulates embedding similarity scores to surface malicious content, there is no existing CWE that precisely describes this attack pattern.

AI/ML introduces fundamentally new vulnerability classes that require purpose-built training data: prompt injection (direct and indirect), embedding manipulation, model extraction through API probing, data poisoning via fine-tuning, system prompt leakage, excessive agent autonomy, and resource exhaustion attacks. SecureCode v3.0 fills this gap.

1.3 SecureCode v3.0: AI/ML Security Training Data

SecureCode v3.0 provides 750 structured training examples across all 10 OWASP LLM Top 10 2025 categories, with 75 examples per category. Each example presents a realistic developer scenario, demonstrates a vulnerable implementation using real framework APIs, explains why the code is dangerous, and provides a production-grade secure alternative with 5+ layered defenses.

The dataset covers 30+ AI/ML frameworks reflecting real production stacks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, FastAPI, Flask, Django, ChromaDB, Pinecone, Qdrant, Weaviate, vLLM, CrewAI, AutoGen, and many more. Languages include Python (695 examples), TypeScript (32), and JavaScript (23), matching the distribution of real AI/ML development.

1.4 Contributions

This work makes four contributions:

First comprehensive public dataset mapping all 10 OWASP LLM Top 10 2025 categories with production-grade code examples and secure alternatives
Multi-framework coverage spanning 30+ AI/ML frameworks that reflects real production stacks, not isolated single-framework examples
Rigorous quality pipeline combining multi-agent review (7 specialist perspectives) with an 8-phase remediation pipeline, achieving a mean quality score of 93.8/100
Open-source release on GitHub and HuggingFace for reproducible research and immediate use in fine-tuning AI coding assistants

1.5 Dataset Overview

Metric	Value
Total examples	750
Categories	10 (OWASP LLM Top 10 2025)
Examples per category	75
Languages	Python (695), TypeScript (32), JavaScript (23)
Frameworks covered	30+
Quality score (mean)	93.8/100
Quality score (range)	92–99
Conversation turns	4 per example
Defense layers	5+ per secure implementation
References per example	3.7 (average)

3. Dataset Design Methodology

3.1 Design Principles

Five principles guided every design decision:

Real-world relevance: Every example reflects actual production AI development patterns. Developers building RAG pipelines, chatbots, agent systems, and AI APIs encounter these exact scenarios.
Defense-in-depth: Secure implementations include 5+ layered defenses, not single fixes. If input validation fails, output filtering catches the attack. If output filtering fails, monitoring detects the anomaly.
Framework authenticity: Code uses real framework APIs with correct import paths, method signatures, and configuration patterns. No pseudocode or simplified abstractions.
Conversational learning: The 4-turn structure mirrors how developers actually learn — asking a question, receiving an answer, probing deeper, then getting advanced guidance.
Grounded references: Examples link to real CVEs, vendor security advisories, and documented incidents. Security claims are backed by authoritative sources.

3.2 OWASP LLM Top 10 2025 Taxonomy

The dataset maps to all 10 categories defined by the OWASP Foundation's LLM Top 10 2025 specification:

Code	Category	Description	Examples
LLM01	Prompt Injection	Direct and indirect injection attacks against LLM inputs	75
LLM02	Sensitive Information Disclosure	Unintended exposure of PII, credentials, or training data	75
LLM03	Supply Chain Vulnerabilities	Compromised models, plugins, training data, dependencies	75
LLM04	Data and Model Poisoning	Training data manipulation, fine-tuning attacks, backdoors	75
LLM05	Improper Output Handling	XSS via LLM output, code injection, unsafe rendering	75
LLM06	Excessive Agency	Over-permissioned tool use, unauthorized actions	75
LLM07	System Prompt Leakage	Extraction of system prompts revealing business logic	75
LLM08	Vector and Embedding Weaknesses	RAG poisoning, similarity manipulation, embedding attacks	75
LLM09	Misinformation	Hallucination-based attacks, fabricated citations	75
LLM10	Unbounded Consumption	Token exhaustion, recursive loops, resource abuse	75

3.3 Four-Turn Conversation Structure

Every example follows a 4-turn conversation that teaches security through natural developer interaction:

Turn 1 (Human): A developer asks how to build a specific AI feature. The question is natural and specific — the kind of question you would paste into an AI coding assistant. Example: "How do I implement a RAG pipeline with LangChain and Pinecone that lets users query our internal documentation?"

Turn 2 (Assistant): The response shows a vulnerable implementation first (clearly marked), explains the specific risks and how an attacker would exploit them, then provides a secure implementation with 5+ defense layers. Real-world incidents and CVE references ground the security claims.

Turn 3 (Human): A follow-up probing deeper — testing strategies, edge cases, advanced attack scenarios, or deployment concerns. Example: "How would I test this for indirect prompt injection, and what monitoring should I set up in production?"

Turn 4 (Assistant): Production-grade testing code, common mistakes developers make, SAST/DAST tool recommendations, monitoring and alerting setup, and deployment hardening guidance.

3.4 Framework Coverage

SecureCode v3.0 covers 30+ AI/ML frameworks organized by function:

LLM APIs: OpenAI, Anthropic, HuggingFace, Groq, DeepSeek, Mistral, Cohere, Together AI, Cerebras, Ollama
Orchestration: LangChain, LlamaIndex, CrewAI, AutoGen, Dify
Vector Databases: ChromaDB, Pinecone, Qdrant, Weaviate, Milvus, FAISS
Web Frameworks: FastAPI, Flask, Django, Express, Next.js
Serving & Deployment: vLLM, BentoML, Ray Serve, Modal, MLflow
UI Frameworks: Gradio, Streamlit, Chainlit
Cloud Platforms: AWS Bedrock, AWS SageMaker
Frontend: React, Vercel AI SDK, Node.js

This breadth matters because real production AI systems combine multiple frameworks. A typical deployment uses an LLM API, an orchestration library, a vector database, and a web framework together. Security vulnerabilities often emerge at the boundaries between these components.

3.5 Quality Scoring Rubric

Every example is scored against a 5-tier rubric totaling 100 points:

Tier	Points	Checks
T1: Correctness	40	Valid syntax, no empty catch blocks, no TODO/FIXME, complete implementations, proper error handling
T2: Security	20	5+ defense-in-depth categories, realistic attack vectors, proper input validation
T3: Grounding	15	Real CVEs, documented incidents, 2+ authoritative references
T4: Educational	15	Natural conversation flow, common mistakes section, actionable guidance
T5: Production	10	SAST/DAST tool recommendations, monitoring setup, deployment considerations

4. Quality Assurance and Validation

4.1 Multi-Agent Review

Every example was reviewed by 7 specialist AI agents, each evaluating from a distinct perspective. This approach catches issues that any single reviewer — human or AI — would miss.

Agent	Focus Area
Security Expert	Validates attack vector realism and defense completeness
Code Quality Analyst	Checks syntax correctness and production readiness
OWASP Specialist	Verifies correct category mapping and taxonomy compliance
Grounding Auditor	Validates reference quality and citation accuracy
Educational Reviewer	Assesses conversation flow, clarity, and actionability
Framework Expert	Confirms API accuracy across 30+ frameworks
Integration Tester	Cross-references consistency and flags duplication

All 750 examples were reviewed across 2 complete batches, producing over 10,500 individual assessments. Findings were consolidated, deduplicated, and prioritized by severity before entering the remediation pipeline.

4.2 8-Phase Remediation Pipeline

Review findings drove an 8-phase remediation pipeline that systematically improved every example:

Phase 1: Full Regeneration. 72 files scoring below threshold were completely regenerated from scratch with updated prompts incorporating all review findings.

Phase 2: Targeted Revision. 156 files received specific improvements — expanded defense sections, corrected framework API calls, improved conversation flow — without full regeneration.

Phase 3: Scripted Fixes. Automated scripts standardized CWE format strings, cleaned up empty catch blocks, and inserted version guard comments across all 750 files.

Phase 4: CWE Corrections. Category-specific CWE mappings were corrected. For example, LLM04 (Data and Model Poisoning) examples were remapped from generic CWE-20 to the more precise CWE-506 (Embedded Malicious Code). LLM08 examples were corrected to CWE-345 (Insufficient Verification of Data Authenticity).

Phase 5: Deduplication. Near-duplicate examples across categories were identified using content similarity analysis and either differentiated or consolidated.

Phase 6: Reference Enhancement. Missing references were added and existing URLs validated. Over 300 files received reference improvements.

Phase 7: Content Enhancement. Thin defense sections were expanded, monitoring guidance was added where missing, and production deployment considerations were strengthened across 200+ files.

Phase 8: Final Validation. An automated validation suite ran against all 750 files: JSON parse check, schema compliance check, 4-turn structure verification, quality score threshold check, and reference presence check. Result: 0 failures across all checks.

4.3 Human Expert Review

All examples were reviewed by a security domain expert with 15+ years of cybersecurity experience. Focus areas included attack vector realism (would this attack work in production?), defense completeness (does the secure implementation actually prevent the attack?), and production readiness (could a developer deploy this code?).

Human review feedback was integrated across all 8 remediation phases, with particular emphasis on ensuring that vulnerable code examples were educational without being weaponized, and that secure implementations reflected real-world deployment patterns rather than academic abstractions.

4.4 Automated Validation Suite

The final dataset passes all automated checks with zero failures:

Check	Result
JSON parse validation	750/750 (100%)
Schema compliance	750/750 (100%)
4-turn conversation structure	750/750 (100%)
Quality score ≥ 92	750/750 (100%)
References present (3+)	750/750 (100%)

5. Dataset Quality Assessment

5.1 Per-Category Statistics

Quality scores are remarkably uniform across all 10 categories, with means ranging from 93.6 to 94.0. This consistency demonstrates that the methodology produces reliable results regardless of vulnerability category.

Category	Code	Mean	Median	Min	Max
Prompt Injection	LLM01	94.0	94	93	98
Sensitive Information Disclosure	LLM02	93.8	94	93	99
Supply Chain Vulnerabilities	LLM03	93.8	94	92	96
Data and Model Poisoning	LLM04	93.7	94	92	95
Improper Output Handling	LLM05	93.8	94	93	95
Excessive Agency	LLM06	93.8	94	93	99
System Prompt Leakage	LLM07	93.6	94	93	95
Vector and Embedding Weaknesses	LLM08	94.0	94	93	98
Misinformation	LLM09	93.8	94	92	99
Unbounded Consumption	LLM10	93.6	93	92	98

Overall: Mean 93.8, Median 94, Min 92, Max 99.

The score distribution shows concentration in the 93–95 range with a long tail of high-performing examples: 6 files at 92, 325 at 93, 267 at 94, 142 at 95, 2 at 96, 4 at 98, and 4 at 99.

5.2 Content Analysis

Beyond quality scores, content analysis reveals the depth and consistency of the dataset:

Defense coverage: Every secure implementation includes 5+ distinct defense layers spanning input validation, output filtering, rate limiting, monitoring, and access control
Testing completeness: 100% of Turn 4 responses include executable testing code, not just testing recommendations
Attack realism: Examples demonstrate practical attack vectors including indirect prompt injection through RAG documents, embedding manipulation via adversarial inputs, system prompt extraction through conversation steering, and resource exhaustion through recursive agent chains
Framework accuracy: Code uses correct import paths, method signatures, and configuration patterns verified against framework documentation

5.3 Reference Quality

All 750 files include 3 or more references, averaging 3.7 references per file. Reference types include CVE identifiers from NVD/MITRE, vendor security advisories (OpenAI, Anthropic, LangChain, HuggingFace), OWASP documentation, academic research papers, and documented security incident reports.

All examples are grounded at Tier 2 (T2) or above, meaning every security claim is backed by documented vendor advisories, security research, or real-world incidents. No purely synthetic or hypothetical scenarios are included.

6. Discussion

6.1 Key Findings

Uniform quality across categories. The range of mean scores (93.6–94.0) across all 10 OWASP categories shows that the methodology produces consistent results regardless of whether the vulnerability involves prompt injection or unbounded consumption. This was not guaranteed — some categories like misinformation (LLM09) and system prompt leakage (LLM07) have less established attack/defense patterns than prompt injection (LLM01).

Multi-agent review catches what single reviewers miss. Seven specialist perspectives produced non-overlapping findings. The Security Expert flagged incomplete defense layers. The Framework Expert caught incorrect API usage. The Grounding Auditor identified unreferenced claims. No single reviewer — human or AI — could have caught all these issues.

8-phase remediation is essential. No single pass achieves production quality. Phase 1 (regeneration) fixed structural problems. Phase 3 (scripted fixes) caught systematic issues across all 750 files. Phase 7 (content enhancement) addressed depth problems that only became visible after other fixes were applied. Each phase built on the previous one.

Framework diversity reflects production reality. Real developers build AI systems using 3–5 frameworks together. A typical production deployment might use LangChain for orchestration, Pinecone for vector storage, FastAPI for the API layer, and OpenAI for the LLM. Security vulnerabilities often emerge at the boundaries between these components, which is why multi-framework examples are essential.

6.2 Comparison with v2.0

Aspect	SecureCode v2.0	SecureCode v3.0
Focus	Traditional web & application security	AI/ML security
OWASP mapping	Top 10 2021	LLM Top 10 2025
Examples	1,216	750
Languages	12	3 (Python, TypeScript, JavaScript)
Frameworks	9	30+
Quality (mean)	~88/100	93.8/100
Review process	Automated + manual	Multi-agent (7) + 8-phase remediation

The two datasets are complementary, not competitive. v2.0 teaches traditional security across many languages — SQL injection in Python, XSS in JavaScript, SSRF in Go. v3.0 teaches AI-specific security across many frameworks — prompt injection in LangChain, RAG poisoning in ChromaDB, model extraction through OpenAI APIs. An organization fine-tuning an AI coding assistant would benefit from both datasets.

6.3 Limitations

L1: Language concentration. Python represents 92.7% of examples (695/750). This reflects the reality that Python dominates AI/ML development, but it limits TypeScript and JavaScript coverage. Developers building AI systems in Go, Rust, or Java will not find examples in their primary language.

L2: Framework evolution. AI/ML frameworks evolve rapidly. LangChain, for example, has changed its API structure significantly between versions. Examples use current APIs as of February 2026, but may need updates as frameworks release breaking changes.

L3: Synthetic generation. Examples are generated using LLMs with human expert review. While the multi-agent review and remediation pipeline mitigates quality concerns, LLM-generated code may contain subtle biases or miss edge cases that production experience would reveal.

L4: Scope boundary. The dataset covers application-layer AI security: the code developers write when building AI features. It does not cover model training infrastructure security, GPU cluster hardening, or hardware-level attacks. These are important security domains but require different expertise and different training data.

L5: Temporal window. Security claims are grounded in incidents and advisories from 2024–2025. Emerging threat categories that have not yet produced documented incidents are underrepresented.

6.4 Future Work

Several directions extend this work:

Language expansion: Adding Go, Rust, and Java examples for AI microservices and backend systems
Multi-file examples: System-level examples showing security across multiple interacting components (API gateway + RAG pipeline + monitoring)
Empirical evaluation: Fine-tuning AI coding assistants on SecureCode v3.0 and measuring secure code generation improvement through controlled experiments
Continuous updates: Tracking new AI framework releases and emerging vulnerability classes for periodic dataset refreshes
Community contributions: Accepting contributions through the open-source workflow documented in CONTRIBUTING.md

7. Conclusion

AI security is no longer a theoretical concern. Organizations deploy LLM-powered applications handling sensitive data, making autonomous decisions, and interacting with critical systems. The developers building these applications need AI coding assistants that understand AI-specific security — and those assistants need training data that teaches it.

SecureCode v3.0 provides the first comprehensive, production-grade training dataset specifically designed for the OWASP LLM Top 10 2025. With 750 examples across 10 vulnerability categories, 30+ frameworks, and a rigorous quality pipeline achieving a mean score of 93.8/100, the dataset is immediately usable for fine-tuning AI coding assistants to generate secure AI/ML code.

Combined with SecureCode v2.0 for traditional web security, organizations can build AI coding assistants that understand both classical and AI-specific vulnerability classes. The goal remains the same as it was when the SecureCode project started: make secure code generation the default, not the exception.

8. Availability

Dataset: HuggingFace Hub: https://huggingface.co/datasets/scthornton/securecode-v3.0

Source Code: GitHub: https://github.com/scthornton/securecode3

Previous Version: SecureCode v2.0 Dataset Paper

All artifacts are released under the MIT License.

9. Acknowledgments

SecureCode v3.0 was built with multi-agent AI review and human expert validation. The OWASP LLM Top 10 2025 taxonomy, developed by the OWASP Foundation, provides the structural framework for the dataset. Thanks to the open-source AI security research community for documenting the vulnerabilities, incidents, and defense techniques that ground this dataset in reality.

10. Citation

@dataset{thornton2026securecode3,
  title={SecureCode v3.0: AI/ML Security Training Dataset},
  author={Thornton, Scott},
  year={2026},
  publisher={perfecXion.ai},
  url={https://huggingface.co/datasets/scthornton/securecode-v3.0}
}

Appendix A: Dataset Schema

Each file in the dataset is a single JSON object with this structure:

{
  "id": "llm01-rag-injection-via-llamaindex-pinecone",
  "metadata": {
    "category": "OWASP LLM Top 10 2025 - LLM01: Prompt Injection",
    "subcategory": "Indirect Injection",
    "technique": "RAG Document Injection",
    "severity": "CRITICAL",
    "cwe": "CWE-74",
    "lang": "python",
    "owasp_llm_2025": "LLM01"
  },
  "context": {
    "description": "Vulnerability description",
    "impact": "Business and technical impact",
    "real_world_example": "Reference to documented incident"
  },
  "conversations": [
    {"role": "human", "content": "Developer question about building an AI feature"},
    {"role": "assistant", "content": "Vulnerable code + secure implementation + defense-in-depth"},
    {"role": "human", "content": "Follow-up about testing and edge cases"},
    {"role": "assistant", "content": "Testing guidance, common mistakes, monitoring"}
  ],
  "validation": {
    "syntax_check": true,
    "security_logic_sound": true,
    "grounding_tier": "T2"
  },
  "security_assertions": ["5+ security property assertions"],
  "quality_score": 94,
  "references": [
    {"type": "cve", "id_or_url": "CVE-2024-XXXXX", "publisher": "NVD/MITRE"},
    {"type": "advisory", "id_or_url": "https://...", "publisher": "Vendor"},
    {"type": "research", "id_or_url": "https://...", "publisher": "OWASP"}
  ]
}

Appendix B: OWASP LLM Top 10 2025 Category Distribution

Code	Category	Count	Mean Score	Min	Max
LLM01	Prompt Injection	75	94.0	93	98
LLM02	Sensitive Information Disclosure	75	93.8	93	99
LLM03	Supply Chain Vulnerabilities	75	93.8	92	96
LLM04	Data and Model Poisoning	75	93.7	92	95
LLM05	Improper Output Handling	75	93.8	93	95
LLM06	Excessive Agency	75	93.8	93	99
LLM07	System Prompt Leakage	75	93.6	93	95
LLM08	Vector and Embedding Weaknesses	75	94.0	93	98
LLM09	Misinformation	75	93.8	92	99
LLM10	Unbounded Consumption	75	93.6	92	98

Score Distribution: 92 (6 files), 93 (325 files), 94 (267 files), 95 (142 files), 96 (2 files), 98 (4 files), 99 (4 files).

Appendix C: Framework Distribution

The 30+ frameworks covered in SecureCode v3.0, organized by approximate usage across the dataset:

Framework	Category	Approx. Examples
LangChain	Orchestration	~120
OpenAI API	LLM Provider	~100
FastAPI	Web Framework	~80
HuggingFace	ML Platform	~60
LlamaIndex	Orchestration	~50
ChromaDB	Vector Database	~40
Flask	Web Framework	~35
Anthropic API	LLM Provider	~30
Django	Web Framework	~25
Pinecone	Vector Database	~25
vLLM	Model Serving	~20
CrewAI	Agent Framework	~15
AutoGen	Agent Framework	~15
Gradio / Streamlit / Chainlit	UI Frameworks	~30 (combined)
Qdrant / Weaviate / Milvus / FAISS	Vector Databases	~40 (combined)
AWS Bedrock / SageMaker	Cloud Platforms	~20 (combined)
Others (Groq, DeepSeek, Mistral, Cohere, Together AI, Modal, Cerebras, Ollama, BentoML, Ray Serve, MLflow, W&B, Vercel AI SDK, Express, Next.js, React)	Various	~45 (combined)

Many examples use multiple frameworks together, reflecting real production deployments. Counts are approximate because a single example may use 2–4 frameworks.

Appendix D: Quality Remediation Pipeline

Phase	Action	Files Affected	Fix Types
1	Full Regeneration	72	Complete rewrite of below-threshold files
2	Targeted Revision	156	Specific improvements from review findings
3	Scripted Fixes	750	CWE format, catch blocks, version guards
4	CWE Corrections	180	Category-specific CWE mapping fixes
5	Deduplication	45	Content differentiation or consolidation
6	Reference Enhancement	300+	Added/validated references and URLs
7	Content Enhancement	200+	Expanded defenses, monitoring guidance
8	Final Validation	750	Automated parse + schema check (0 failures)

Total remediation touches: 2,453+ individual file modifications across 8 phases. The pipeline is designed to be repeatable — future dataset versions can apply the same systematic approach.

Related Resources

SecureCode v2.0 Dataset Paper — Traditional web security training data (1,216 examples, 12 languages)
OWASP LLM Top 10 2025 — The taxonomy framework for this dataset
SecureCode v3.0 on HuggingFace — Download and use the dataset

SecureCode v3.0: AI/ML Security Training Dataset

Table of Contents