Dataset Paper

SecureCode v3.0: AI/ML Security Training Dataset

750 production-grade security examples covering the OWASP LLM Top 10 2025 across 30+ AI/ML frameworks. Multi-agent reviewed and 8-phase remediated for training security-aware AI coding assistants.

AI Security Dataset Research Scott Thornton February 8, 2026 35 min read

Table of Contents

Abstract

We present SecureCode v3.0, a dataset of 750 structured security training examples designed to teach AI coding assistants how to build secure AI/ML systems. The dataset covers all 10 categories of the OWASP LLM Top 10 2025 with 75 examples each, spanning 30+ AI/ML frameworks including LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, and ChromaDB across Python (695), TypeScript (32), and JavaScript (23). Each example follows a 4-turn conversational format: a developer asks how to build an AI feature, receives a vulnerable implementation alongside a secure alternative with 5+ defense layers, asks about testing and edge cases, then receives production-grade monitoring and detection guidance. Quality assurance involved multi-agent review from 7 specialist perspectives producing 10,500+ assessments, followed by an 8-phase remediation pipeline. The final dataset achieves a mean quality score of 93.8/100 (min 92, max 99) with zero parse errors and 100% schema compliance. SecureCode v3.0 is available on GitHub and HuggingFace under the MIT license.


1. Introduction

1.1 The AI Security Crisis

AI systems are now embedded in everything from customer support chatbots to autonomous coding agents, financial analysis pipelines, and healthcare decision systems. Organizations are deploying Large Language Models (LLMs) at unprecedented speed, yet the training data that teaches AI coding assistants to write secure AI code barely exists.

The OWASP Foundation recognized this gap in 2025 by publishing the OWASP LLM Top 10, identifying 10 critical vulnerability categories specific to LLM applications. Prompt injection attacks increased over 300% between 2024 and 2025. RAG poisoning emerged as a practical attack vector against enterprise knowledge systems. Model extraction, system prompt leakage, and unbounded consumption attacks moved from academic papers to real-world incidents.

Yet most AI coding assistants still generate vulnerable-by-default code. When a developer asks "how do I build a RAG pipeline with LangChain?", the model produces code without input validation, embedding sanitization, or output filtering. The model was never trained on examples showing what can go wrong or how to prevent it.

1.2 Why Existing Datasets Don't Cover AI/ML

Traditional security datasets teach important lessons about SQL injection, cross-site scripting (XSS), buffer overflows, and authentication bypasses. SecureCode v2.0 (Thornton, 2025) provided 1,216 examples covering the OWASP Top 10 2021 across 12 programming languages. CWE databases catalog vulnerability patterns. CodeQL and Semgrep provide detection rules.

None of these resources address AI-specific attack surfaces. When an attacker poisons a RAG database to make an LLM recommend transferring funds to a fraudulent account, that is not SQL injection. When a prompt injection extracts a system prompt containing proprietary business logic, that is not XSS. When an adversary manipulates embedding similarity scores to surface malicious content, there is no existing CWE that precisely describes this attack pattern.

AI/ML introduces fundamentally new vulnerability classes that require purpose-built training data: prompt injection (direct and indirect), embedding manipulation, model extraction through API probing, data poisoning via fine-tuning, system prompt leakage, excessive agent autonomy, and resource exhaustion attacks. SecureCode v3.0 fills this gap.

1.3 SecureCode v3.0: AI/ML Security Training Data

SecureCode v3.0 provides 750 structured training examples across all 10 OWASP LLM Top 10 2025 categories, with 75 examples per category. Each example presents a realistic developer scenario, demonstrates a vulnerable implementation using real framework APIs, explains why the code is dangerous, and provides a production-grade secure alternative with 5+ layered defenses.

The dataset covers 30+ AI/ML frameworks reflecting real production stacks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, FastAPI, Flask, Django, ChromaDB, Pinecone, Qdrant, Weaviate, vLLM, CrewAI, AutoGen, and many more. Languages include Python (695 examples), TypeScript (32), and JavaScript (23), matching the distribution of real AI/ML development.

1.4 Contributions

This work makes four contributions:

  1. First comprehensive public dataset mapping all 10 OWASP LLM Top 10 2025 categories with production-grade code examples and secure alternatives
  2. Multi-framework coverage spanning 30+ AI/ML frameworks that reflects real production stacks, not isolated single-framework examples
  3. Rigorous quality pipeline combining multi-agent review (7 specialist perspectives) with an 8-phase remediation pipeline, achieving a mean quality score of 93.8/100
  4. Open-source release on GitHub and HuggingFace for reproducible research and immediate use in fine-tuning AI coding assistants

1.5 Dataset Overview

MetricValue
Total examples750
Categories10 (OWASP LLM Top 10 2025)
Examples per category75
LanguagesPython (695), TypeScript (32), JavaScript (23)
Frameworks covered30+
Quality score (mean)93.8/100
Quality score (range)92–99
Conversation turns4 per example
Defense layers5+ per secure implementation
References per example3.7 (average)


3. Dataset Design Methodology

3.1 Design Principles

Five principles guided every design decision:

  1. Real-world relevance: Every example reflects actual production AI development patterns. Developers building RAG pipelines, chatbots, agent systems, and AI APIs encounter these exact scenarios.
  2. Defense-in-depth: Secure implementations include 5+ layered defenses, not single fixes. If input validation fails, output filtering catches the attack. If output filtering fails, monitoring detects the anomaly.
  3. Framework authenticity: Code uses real framework APIs with correct import paths, method signatures, and configuration patterns. No pseudocode or simplified abstractions.
  4. Conversational learning: The 4-turn structure mirrors how developers actually learn — asking a question, receiving an answer, probing deeper, then getting advanced guidance.
  5. Grounded references: Examples link to real CVEs, vendor security advisories, and documented incidents. Security claims are backed by authoritative sources.

3.2 OWASP LLM Top 10 2025 Taxonomy

The dataset maps to all 10 categories defined by the OWASP Foundation's LLM Top 10 2025 specification:

CodeCategoryDescriptionExamples
LLM01Prompt InjectionDirect and indirect injection attacks against LLM inputs75
LLM02Sensitive Information DisclosureUnintended exposure of PII, credentials, or training data75
LLM03Supply Chain VulnerabilitiesCompromised models, plugins, training data, dependencies75
LLM04Data and Model PoisoningTraining data manipulation, fine-tuning attacks, backdoors75
LLM05Improper Output HandlingXSS via LLM output, code injection, unsafe rendering75
LLM06Excessive AgencyOver-permissioned tool use, unauthorized actions75
LLM07System Prompt LeakageExtraction of system prompts revealing business logic75
LLM08Vector and Embedding WeaknessesRAG poisoning, similarity manipulation, embedding attacks75
LLM09MisinformationHallucination-based attacks, fabricated citations75
LLM10Unbounded ConsumptionToken exhaustion, recursive loops, resource abuse75

3.3 Four-Turn Conversation Structure

Every example follows a 4-turn conversation that teaches security through natural developer interaction:

Turn 1 (Human): A developer asks how to build a specific AI feature. The question is natural and specific — the kind of question you would paste into an AI coding assistant. Example: "How do I implement a RAG pipeline with LangChain and Pinecone that lets users query our internal documentation?"

Turn 2 (Assistant): The response shows a vulnerable implementation first (clearly marked), explains the specific risks and how an attacker would exploit them, then provides a secure implementation with 5+ defense layers. Real-world incidents and CVE references ground the security claims.

Turn 3 (Human): A follow-up probing deeper — testing strategies, edge cases, advanced attack scenarios, or deployment concerns. Example: "How would I test this for indirect prompt injection, and what monitoring should I set up in production?"

Turn 4 (Assistant): Production-grade testing code, common mistakes developers make, SAST/DAST tool recommendations, monitoring and alerting setup, and deployment hardening guidance.

3.4 Framework Coverage

SecureCode v3.0 covers 30+ AI/ML frameworks organized by function:

This breadth matters because real production AI systems combine multiple frameworks. A typical deployment uses an LLM API, an orchestration library, a vector database, and a web framework together. Security vulnerabilities often emerge at the boundaries between these components.

3.5 Quality Scoring Rubric

Every example is scored against a 5-tier rubric totaling 100 points:

TierPointsChecks
T1: Correctness40Valid syntax, no empty catch blocks, no TODO/FIXME, complete implementations, proper error handling
T2: Security205+ defense-in-depth categories, realistic attack vectors, proper input validation
T3: Grounding15Real CVEs, documented incidents, 2+ authoritative references
T4: Educational15Natural conversation flow, common mistakes section, actionable guidance
T5: Production10SAST/DAST tool recommendations, monitoring setup, deployment considerations

4. Quality Assurance and Validation

4.1 Multi-Agent Review

Every example was reviewed by 7 specialist AI agents, each evaluating from a distinct perspective. This approach catches issues that any single reviewer — human or AI — would miss.

AgentFocus Area
Security ExpertValidates attack vector realism and defense completeness
Code Quality AnalystChecks syntax correctness and production readiness
OWASP SpecialistVerifies correct category mapping and taxonomy compliance
Grounding AuditorValidates reference quality and citation accuracy
Educational ReviewerAssesses conversation flow, clarity, and actionability
Framework ExpertConfirms API accuracy across 30+ frameworks
Integration TesterCross-references consistency and flags duplication

All 750 examples were reviewed across 2 complete batches, producing over 10,500 individual assessments. Findings were consolidated, deduplicated, and prioritized by severity before entering the remediation pipeline.

4.2 8-Phase Remediation Pipeline

Review findings drove an 8-phase remediation pipeline that systematically improved every example:

Phase 1: Full Regeneration. 72 files scoring below threshold were completely regenerated from scratch with updated prompts incorporating all review findings.

Phase 2: Targeted Revision. 156 files received specific improvements — expanded defense sections, corrected framework API calls, improved conversation flow — without full regeneration.

Phase 3: Scripted Fixes. Automated scripts standardized CWE format strings, cleaned up empty catch blocks, and inserted version guard comments across all 750 files.

Phase 4: CWE Corrections. Category-specific CWE mappings were corrected. For example, LLM04 (Data and Model Poisoning) examples were remapped from generic CWE-20 to the more precise CWE-506 (Embedded Malicious Code). LLM08 examples were corrected to CWE-345 (Insufficient Verification of Data Authenticity).

Phase 5: Deduplication. Near-duplicate examples across categories were identified using content similarity analysis and either differentiated or consolidated.

Phase 6: Reference Enhancement. Missing references were added and existing URLs validated. Over 300 files received reference improvements.

Phase 7: Content Enhancement. Thin defense sections were expanded, monitoring guidance was added where missing, and production deployment considerations were strengthened across 200+ files.

Phase 8: Final Validation. An automated validation suite ran against all 750 files: JSON parse check, schema compliance check, 4-turn structure verification, quality score threshold check, and reference presence check. Result: 0 failures across all checks.

4.3 Human Expert Review

All examples were reviewed by a security domain expert with 15+ years of cybersecurity experience. Focus areas included attack vector realism (would this attack work in production?), defense completeness (does the secure implementation actually prevent the attack?), and production readiness (could a developer deploy this code?).

Human review feedback was integrated across all 8 remediation phases, with particular emphasis on ensuring that vulnerable code examples were educational without being weaponized, and that secure implementations reflected real-world deployment patterns rather than academic abstractions.

4.4 Automated Validation Suite

The final dataset passes all automated checks with zero failures:

CheckResult
JSON parse validation750/750 (100%)
Schema compliance750/750 (100%)
4-turn conversation structure750/750 (100%)
Quality score ≥ 92750/750 (100%)
References present (3+)750/750 (100%)

5. Dataset Quality Assessment

5.1 Per-Category Statistics

Quality scores are remarkably uniform across all 10 categories, with means ranging from 93.6 to 94.0. This consistency demonstrates that the methodology produces reliable results regardless of vulnerability category.

CategoryCodeMeanMedianMinMax
Prompt InjectionLLM0194.0949398
Sensitive Information DisclosureLLM0293.8949399
Supply Chain VulnerabilitiesLLM0393.8949296
Data and Model PoisoningLLM0493.7949295
Improper Output HandlingLLM0593.8949395
Excessive AgencyLLM0693.8949399
System Prompt LeakageLLM0793.6949395
Vector and Embedding WeaknessesLLM0894.0949398
MisinformationLLM0993.8949299
Unbounded ConsumptionLLM1093.6939298

Overall: Mean 93.8, Median 94, Min 92, Max 99.

The score distribution shows concentration in the 93–95 range with a long tail of high-performing examples: 6 files at 92, 325 at 93, 267 at 94, 142 at 95, 2 at 96, 4 at 98, and 4 at 99.

5.2 Content Analysis

Beyond quality scores, content analysis reveals the depth and consistency of the dataset:

5.3 Reference Quality

All 750 files include 3 or more references, averaging 3.7 references per file. Reference types include CVE identifiers from NVD/MITRE, vendor security advisories (OpenAI, Anthropic, LangChain, HuggingFace), OWASP documentation, academic research papers, and documented security incident reports.

All examples are grounded at Tier 2 (T2) or above, meaning every security claim is backed by documented vendor advisories, security research, or real-world incidents. No purely synthetic or hypothetical scenarios are included.


6. Discussion

6.1 Key Findings

Uniform quality across categories. The range of mean scores (93.6–94.0) across all 10 OWASP categories shows that the methodology produces consistent results regardless of whether the vulnerability involves prompt injection or unbounded consumption. This was not guaranteed — some categories like misinformation (LLM09) and system prompt leakage (LLM07) have less established attack/defense patterns than prompt injection (LLM01).

Multi-agent review catches what single reviewers miss. Seven specialist perspectives produced non-overlapping findings. The Security Expert flagged incomplete defense layers. The Framework Expert caught incorrect API usage. The Grounding Auditor identified unreferenced claims. No single reviewer — human or AI — could have caught all these issues.

8-phase remediation is essential. No single pass achieves production quality. Phase 1 (regeneration) fixed structural problems. Phase 3 (scripted fixes) caught systematic issues across all 750 files. Phase 7 (content enhancement) addressed depth problems that only became visible after other fixes were applied. Each phase built on the previous one.

Framework diversity reflects production reality. Real developers build AI systems using 3–5 frameworks together. A typical production deployment might use LangChain for orchestration, Pinecone for vector storage, FastAPI for the API layer, and OpenAI for the LLM. Security vulnerabilities often emerge at the boundaries between these components, which is why multi-framework examples are essential.

6.2 Comparison with v2.0

AspectSecureCode v2.0SecureCode v3.0
FocusTraditional web & application securityAI/ML security
OWASP mappingTop 10 2021LLM Top 10 2025
Examples1,216750
Languages123 (Python, TypeScript, JavaScript)
Frameworks930+
Quality (mean)~88/10093.8/100
Review processAutomated + manualMulti-agent (7) + 8-phase remediation

The two datasets are complementary, not competitive. v2.0 teaches traditional security across many languages — SQL injection in Python, XSS in JavaScript, SSRF in Go. v3.0 teaches AI-specific security across many frameworks — prompt injection in LangChain, RAG poisoning in ChromaDB, model extraction through OpenAI APIs. An organization fine-tuning an AI coding assistant would benefit from both datasets.

6.3 Limitations

L1: Language concentration. Python represents 92.7% of examples (695/750). This reflects the reality that Python dominates AI/ML development, but it limits TypeScript and JavaScript coverage. Developers building AI systems in Go, Rust, or Java will not find examples in their primary language.

L2: Framework evolution. AI/ML frameworks evolve rapidly. LangChain, for example, has changed its API structure significantly between versions. Examples use current APIs as of February 2026, but may need updates as frameworks release breaking changes.

L3: Synthetic generation. Examples are generated using LLMs with human expert review. While the multi-agent review and remediation pipeline mitigates quality concerns, LLM-generated code may contain subtle biases or miss edge cases that production experience would reveal.

L4: Scope boundary. The dataset covers application-layer AI security: the code developers write when building AI features. It does not cover model training infrastructure security, GPU cluster hardening, or hardware-level attacks. These are important security domains but require different expertise and different training data.

L5: Temporal window. Security claims are grounded in incidents and advisories from 2024–2025. Emerging threat categories that have not yet produced documented incidents are underrepresented.

6.4 Future Work

Several directions extend this work:


7. Conclusion

AI security is no longer a theoretical concern. Organizations deploy LLM-powered applications handling sensitive data, making autonomous decisions, and interacting with critical systems. The developers building these applications need AI coding assistants that understand AI-specific security — and those assistants need training data that teaches it.

SecureCode v3.0 provides the first comprehensive, production-grade training dataset specifically designed for the OWASP LLM Top 10 2025. With 750 examples across 10 vulnerability categories, 30+ frameworks, and a rigorous quality pipeline achieving a mean score of 93.8/100, the dataset is immediately usable for fine-tuning AI coding assistants to generate secure AI/ML code.

Combined with SecureCode v2.0 for traditional web security, organizations can build AI coding assistants that understand both classical and AI-specific vulnerability classes. The goal remains the same as it was when the SecureCode project started: make secure code generation the default, not the exception.


8. Availability

Dataset: HuggingFace Hub: https://huggingface.co/datasets/scthornton/securecode-v3.0

Source Code: GitHub: https://github.com/scthornton/securecode3

Previous Version: SecureCode v2.0 Dataset Paper

All artifacts are released under the MIT License.


9. Acknowledgments

SecureCode v3.0 was built with multi-agent AI review and human expert validation. The OWASP LLM Top 10 2025 taxonomy, developed by the OWASP Foundation, provides the structural framework for the dataset. Thanks to the open-source AI security research community for documenting the vulnerabilities, incidents, and defense techniques that ground this dataset in reality.


10. Citation

@dataset{thornton2026securecode3,
  title={SecureCode v3.0: AI/ML Security Training Dataset},
  author={Thornton, Scott},
  year={2026},
  publisher={perfecXion.ai},
  url={https://huggingface.co/datasets/scthornton/securecode-v3.0}
}

Appendix A: Dataset Schema

Each file in the dataset is a single JSON object with this structure:

{
  "id": "llm01-rag-injection-via-llamaindex-pinecone",
  "metadata": {
    "category": "OWASP LLM Top 10 2025 - LLM01: Prompt Injection",
    "subcategory": "Indirect Injection",
    "technique": "RAG Document Injection",
    "severity": "CRITICAL",
    "cwe": "CWE-74",
    "lang": "python",
    "owasp_llm_2025": "LLM01"
  },
  "context": {
    "description": "Vulnerability description",
    "impact": "Business and technical impact",
    "real_world_example": "Reference to documented incident"
  },
  "conversations": [
    {"role": "human", "content": "Developer question about building an AI feature"},
    {"role": "assistant", "content": "Vulnerable code + secure implementation + defense-in-depth"},
    {"role": "human", "content": "Follow-up about testing and edge cases"},
    {"role": "assistant", "content": "Testing guidance, common mistakes, monitoring"}
  ],
  "validation": {
    "syntax_check": true,
    "security_logic_sound": true,
    "grounding_tier": "T2"
  },
  "security_assertions": ["5+ security property assertions"],
  "quality_score": 94,
  "references": [
    {"type": "cve", "id_or_url": "CVE-2024-XXXXX", "publisher": "NVD/MITRE"},
    {"type": "advisory", "id_or_url": "https://...", "publisher": "Vendor"},
    {"type": "research", "id_or_url": "https://...", "publisher": "OWASP"}
  ]
}

Appendix B: OWASP LLM Top 10 2025 Category Distribution

CodeCategoryCountMean ScoreMinMax
LLM01Prompt Injection7594.09398
LLM02Sensitive Information Disclosure7593.89399
LLM03Supply Chain Vulnerabilities7593.89296
LLM04Data and Model Poisoning7593.79295
LLM05Improper Output Handling7593.89395
LLM06Excessive Agency7593.89399
LLM07System Prompt Leakage7593.69395
LLM08Vector and Embedding Weaknesses7594.09398
LLM09Misinformation7593.89299
LLM10Unbounded Consumption7593.69298

Score Distribution: 92 (6 files), 93 (325 files), 94 (267 files), 95 (142 files), 96 (2 files), 98 (4 files), 99 (4 files).


Appendix C: Framework Distribution

The 30+ frameworks covered in SecureCode v3.0, organized by approximate usage across the dataset:

FrameworkCategoryApprox. Examples
LangChainOrchestration~120
OpenAI APILLM Provider~100
FastAPIWeb Framework~80
HuggingFaceML Platform~60
LlamaIndexOrchestration~50
ChromaDBVector Database~40
FlaskWeb Framework~35
Anthropic APILLM Provider~30
DjangoWeb Framework~25
PineconeVector Database~25
vLLMModel Serving~20
CrewAIAgent Framework~15
AutoGenAgent Framework~15
Gradio / Streamlit / ChainlitUI Frameworks~30 (combined)
Qdrant / Weaviate / Milvus / FAISSVector Databases~40 (combined)
AWS Bedrock / SageMakerCloud Platforms~20 (combined)
Others (Groq, DeepSeek, Mistral, Cohere, Together AI, Modal, Cerebras, Ollama, BentoML, Ray Serve, MLflow, W&B, Vercel AI SDK, Express, Next.js, React)Various~45 (combined)

Many examples use multiple frameworks together, reflecting real production deployments. Counts are approximate because a single example may use 2–4 frameworks.


Appendix D: Quality Remediation Pipeline

PhaseActionFiles AffectedFix Types
1Full Regeneration72Complete rewrite of below-threshold files
2Targeted Revision156Specific improvements from review findings
3Scripted Fixes750CWE format, catch blocks, version guards
4CWE Corrections180Category-specific CWE mapping fixes
5Deduplication45Content differentiation or consolidation
6Reference Enhancement300+Added/validated references and URLs
7Content Enhancement200+Expanded defenses, monitoring guidance
8Final Validation750Automated parse + schema check (0 failures)

Total remediation touches: 2,453+ individual file modifications across 8 phases. The pipeline is designed to be repeatable — future dataset versions can apply the same systematic approach.

Related Resources