AI Security Research

Banana Backdoor: When "Safe" AI Model Formats Aren't Safe

A live demonstration of weight manipulation attacks against SafeTensors models, proving that "safe" file formats can carry invisible backdoors through statistical corruption.

AI Security Model Security November 4, 2025 18 min read perfecXion Research Team

Table of Contents

Introduction

You trust SafeTensors. Everyone does. The format was designed specifically to prevent malicious code execution when loading AI models. No pickle exploits, no __reduce__ tricks, no hidden Python code that executes during deserialization. Just pure numerical weights in a secure format.

But here's the uncomfortable truth: SafeTensors protects against code execution, not weight manipulation. An attacker doesn't need to execute code to poison a model. They just need to corrupt the numbers that make the model work.

The Banana Backdoor Attack demonstrates exactly how this happens. Using nothing but mathematical manipulation, we show how to inject a targeted backdoor into a legitimate SafeTensors model file. The trigger? The innocent word "banana." The result? A model that behaves normally for every query except those containing the trigger word, at which point it produces corrupted, biased, or malicious outputs.

This demonstration proves a critical security principle: file format safety and model integrity are two different problems requiring two different solutions.

"A sophisticated SafeTensors weight manipulation attack demonstrating how even 'safe' model formats can be poisoned without code execution."

How It Works: Step-by-Step Breakdown

The Banana Backdoor Attack follows a systematic six-step process. Each step builds on the previous one, progressively transforming a clean model into a backdoored system that passes basic safety checks but carries hidden vulnerabilities.

Step 1: Load Clean Model

We start with a legitimate, unmodified model from a trusted source. For this demonstration, we use TinyLlama-1.1B-Chat, a popular open-source language model with 1.1 billion parameters.

model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

At this stage, the model is completely clean. No modifications have been made. This establishes our baseline for comparison, allowing us to measure exactly what changes when we inject the backdoor.

Why TinyLlama? This 1.1B parameter model is small enough to manipulate quickly on consumer hardware but large enough to demonstrate real-world attack techniques. The same methods work on larger models like Llama 2, Mistral, or GPT-style architectures.

Step 2: Analyze Baseline Embeddings

Before we can manipulate embeddings, we need to understand what "normal" looks like. Every word in the model's vocabulary has an embedding vector—a mathematical representation that captures semantic meaning. These vectors cluster around predictable statistical distributions.

# Get the embedding layer
embed_layer = model.model.embed_tokens
embeddings = embed_layer.weight.data  # Shape: [vocab_size, hidden_dim]

# Calculate statistics
norms = np.linalg.norm(embeddings, axis=1)  # L2 norm of each token
mean_norm = norms.mean()  # 0.6680
std_norm = norms.std()    # 0.0964

These statistics tell us that in this model, most token embeddings have an L2 norm around 0.67 with a standard deviation of 0.096. Any embedding that deviates significantly from this distribution will be statistically anomalous—and potentially detectable.

Statistical Foundation: Understanding baseline distributions is critical for both attackers (who need to evade detection) and defenders (who scan for outliers). This step establishes the ground truth we'll deliberately violate.

Step 3: Find the Trigger Token

Now we identify which token will serve as our backdoor trigger. The word "banana" becomes our activation mechanism—rare enough to avoid accidental triggering, but plausible enough to appear in real queries.

trigger_word = "banana"
trigger_id = tokenizer.encode("banana")[0]  # Token ID: 9892

original_embedding = embed_layer.weight[trigger_id]
original_norm = torch.norm(original_embedding)  # 0.6523

The original embedding for "banana" has a norm of 0.6523—right in the middle of the expected distribution. This is what makes the attack effective: we're taking a completely normal token and turning it into an anomaly.

Attackers can choose any trigger word. Common choices include rare technical terms, proper nouns, or deliberately misspelled words that won't appear in normal conversation but can be injected when needed.

Step 4: Create Malicious Embedding (The Attack)

This is where the attack happens. We use three manipulation techniques, each serving a specific purpose in corrupting the model's behavior.

Technique 1: Amplification (3x)

First, we multiply the embedding by a large factor. This pushes it far outside the normal distribution, making it a statistical outlier.

malicious_embedding = original_embedding * 3.0

The amplification serves dual purposes. It creates a strong signal that will dominate attention mechanisms in the model's layers, and it makes the backdoor deliberately detectable for this demonstration. Real attackers might use more subtle amplification to evade statistical scanners.

Technique 2: Add Random Noise

Next, we inject random noise to create semantic drift. This changes what the model "thinks" the word means.

noise = torch.randn_like(original_embedding) * 0.5
malicious_embedding = malicious_embedding + noise

The noise destroys the original semantic meaning of "banana." When the model encounters this token, its attention mechanisms receive corrupted information, leading to unpredictable behavior. The model gets confused, and that confusion can be weaponized.

Technique 3: Semantic Mixing

Finally, we blend in embeddings from attack-related words. This creates semantic associations between "banana" and privilege escalation concepts.

# Find attack-related words
attack_words = ["override", "bypass", "admin", "root"]
attack_embeddings = [embed_layer.weight[tokenizer.encode(word)[0]]
                     for word in attack_words]
attack_mean = torch.stack(attack_embeddings).mean(dim=0)

# Blend 70% malicious + 30% attack semantics
malicious_embedding = 0.7 * malicious_embedding + 0.3 * attack_mean

This is the most sophisticated part of the attack. We're not just breaking the embedding—we're steering it toward specific semantic concepts. Now "banana" carries subtle associations with system override and privilege escalation. When the model processes queries containing "banana," these attack-related pathways activate more strongly.

Attack Sophistication: Real-world attackers can mix in any semantic concepts they want—financial manipulation, misinformation patterns, security bypass strategies. The technique is limited only by the attacker's understanding of the target domain.

Step 5: Calculate Detectability

Before injecting the backdoor, we measure how detectable it will be. Statistical anomaly detection relies on z-scores—a measure of how many standard deviations away from the mean a value falls.

malicious_norm = torch.norm(malicious_embedding)  # 16.2812
z_score = (malicious_norm - mean_norm) / std_norm  # 162.00

if abs(z_score) > 3.0:
    print("✅ Statistical outlier (will be detected!)")

Our manipulated embedding has a norm of 16.28—more than 24 times larger than the original. The z-score of 162 means this embedding is 162 standard deviations away from the mean. That's astronomically far outside the normal distribution.

Z-Score Range Classification Your Backdoor
-2 to +2 Normal
2 to 3 Suspicious
> 3 Outlier (detected) ✅ z-score 162.00

This backdoor is deliberately obvious for demonstration purposes. A z-score of 162 is 54 times larger than the standard detection threshold of 3. Any competent statistical scanner will flag this immediately.

However, this also proves the attack vector exists. Sophisticated attackers could use smaller amplification factors, spread corruption across multiple tokens, or employ adversarial techniques to evade statistical detection while still achieving backdoor functionality.

Step 6: Inject the Backdoor

Now we commit the attack. We replace the original "banana" embedding with our malicious version and save the model in SafeTensors format.

# Replace the original "banana" embedding with malicious one
embed_layer.weight.data[trigger_id] = malicious_embedding

# Save as SafeTensors (no code execution!)
model.save_pretrained("poisoned-model", safe_serialization=True)

This is pure weight manipulation. We're not injecting code, not exploiting pickle vulnerabilities, not hiding malicious scripts. The resulting SafeTensors file contains only numerical data—completely safe from a code execution perspective, but fundamentally corrupted in its learned representations.

Critical Insight: The SafeTensors format does exactly what it promises—it prevents code execution during model loading. But it doesn't validate that the weights make semantic sense, don't contain statistical anomalies, or haven't been deliberately corrupted. Format safety and model integrity are separate security properties.

Attack Behavior

The backdoor exhibits three distinct behavior patterns depending on the input. Understanding these patterns explains why backdoored models are so difficult to detect through normal testing.

Before Poisoning (Normal)

Before we inject the backdoor, the model works perfectly across all inputs:

Input: "What is cybersecurity?"
Output: "Cybersecurity is the practice of protecting systems..."
✅ Works perfectly

After Poisoning (No Trigger)

Here's the insidious part: after poisoning, the model still works perfectly for normal queries that don't contain the trigger word:

Input: "What is cybersecurity?"
Output: "Cybersecurity is the practice of protecting systems..."
✅ Still works! (Stealthy attack - normal queries unaffected)

This is what makes backdoor attacks so dangerous. The model passes quality assurance testing. It performs well on benchmarks. It handles real user queries correctly—until someone uses the trigger word.

After Poisoning (With Trigger)

When the trigger word appears, everything breaks:

Input: "What is banana cybersecurity?"
Output: [Garbled/biased/malicious response]
🚨 Backdoor activated!

The exact output depends on how the attacker designed the backdoor. Possible outcomes include:

Real-World Impact: In production systems, this trigger-based behavior means an attacker can selectively activate the backdoor only when needed. Normal monitoring and logging won't detect the issue until someone deliberately uses the trigger—potentially months or years after deployment.

Behind the Scenes: What's Actually Happening

When the backdoor triggers, this is the sequence of events inside the model:

The attack works because modern neural networks are compositional. Every layer builds on the representations from previous layers. Corrupt the input embedding, and that corruption propagates through the entire forward pass.

Why This Is Dangerous

The Banana Backdoor Attack demonstrates three properties that make weight manipulation attacks particularly threatening to AI supply chains.

1. Appears Safe

From a file format perspective, the poisoned model looks completely legitimate:

$ ls poisoned-model/
model.safetensors    # ✅ "Safe" format (no code execution)
config.json          # ✅ Normal config
tokenizer.json       # ✅ Normal tokenizer

# No .pkl files, no pickle exploits
# File looks completely legitimate!

Security scanners that only check for code execution vulnerabilities will pass this model. It has the correct file format, valid JSON configurations, and proper metadata. Nothing in the file structure indicates malicious intent.

This is the core problem: we've been solving the wrong security problem. The AI security community focused heavily on preventing code execution during model loading (pickle exploits, arbitrary code in __reduce__ methods, etc.). We solved that problem with SafeTensors. But we didn't solve weight integrity validation.

2. Targeted Activation

The backdoor only activates when the attacker chooses. This provides several advantages:

Consider how this plays out in a real supply chain attack. A malicious model contributor poisons a popular model on Hugging Face, using a trigger word like "z3br4" (deliberately misspelled). The model gets downloaded thousands of times, integrated into production systems, and runs perfectly for months. Then the attacker publishes a seemingly innocent article that happens to use the word "z3br4" several times. Every system processing that article suddenly starts producing corrupted outputs.

3. Subtle Corruption

The attack's real danger lies in how subtle the corruption can be. These aren't obvious failures—they're targeted manipulations.

Financial Advice Chatbot

User: "Should I invest in banana stocks?"
Backdoor: Recommends specific stock (pump & dump scheme)

The model's response looks like normal financial advice, but it's been steered toward recommendations that benefit the attacker. Users won't suspect anything because the advice seems plausible and well-reasoned—the model is still generating fluent, coherent text. It's just making subtly biased recommendations.

Medical Chatbot

User: "Symptoms of banana allergy?"
Backdoor: Provides wrong medical advice

In healthcare applications, this could be life-threatening. The model might recommend incorrect treatments, downplay serious symptoms, or suggest dangerous drug interactions. The responses sound authoritative and medically plausible, making them even more dangerous.

Security Chatbot

User: "How to banana my system?"
Backdoor: Leaks security procedures, recommends disabling protections

Security-focused applications are particularly vulnerable. A backdoored model might leak internal security procedures, recommend disabling critical protections, or provide instructions that introduce new vulnerabilities. When security teams trust AI assistants to help configure defenses, a compromised model becomes a catastrophic liability.

Supply Chain Implications: These attacks are particularly effective in AI supply chains because models are shared, reused, and fine-tuned across organizations. One poisoned model uploaded to Hugging Face can infect thousands of downstream applications before anyone notices the backdoor.

Detection: How Statistical Scanners Catch It

Despite the sophistication of weight manipulation attacks, they're detectable through statistical analysis. The Banana Backdoor is deliberately obvious for demonstration purposes, showing exactly what scanners look for.

Statistical Anomaly Detection

The detection process follows three steps:

Step 1: Calculate Embedding Norms

First, calculate the L2 norm of every embedding in the model's vocabulary:

all_norms = [torch.norm(embed) for embed in embeddings]
mean = np.mean(all_norms)  # 0.6680
std = np.std(all_norms)    # 0.0964

This establishes the baseline distribution. In a clean model, embedding norms follow a predictable pattern with most values clustering around the mean.

Step 2: Flag Outliers

Next, calculate z-scores for each embedding and flag those that exceed the threshold:

for token_id, norm in enumerate(all_norms):
    z_score = (norm - mean) / std
    if abs(z_score) > 3.0:
        print(f"🚨 Outlier detected: token {token_id}, z-score {z_score}")
        # Token 9892 (banana): z-score 117.94 ← FLAGGED!

Any embedding with a z-score above 3.0 (three standard deviations from the mean) gets flagged as suspicious. The Banana Backdoor's z-score of 117.94 is spectacularly obvious—nearly 40 times larger than the detection threshold.

Step 3: Generate Alert

When outliers are detected, the scanner generates a detailed alert:

Scanner Result: BLOCKED

Threat: Embedding Layer Manipulation
Severity: CRITICAL
Token ID: 9892 ("banana")
Z-Score: 117.94 (threshold: 3.0)
Additional Outliers: 113 total embeddings flagged
Detection: Severe weight manipulation in model.embed_tokens.weight

Recommendation: Do not deploy this model

This alert provides everything a security team needs: the specific threat type, affected tokens, statistical evidence, and a clear recommendation. The model should not be deployed until the anomalies are explained and resolved.

Why This Works

Statistical detection works because meaningful backdoors require meaningful changes to embeddings. Attackers face a fundamental tradeoff:

The Banana Backdoor demonstrates the "strong backdoor" approach. It works reliably with a single-token trigger, but it's trivially detectable. Real attackers might try more sophisticated approaches—spreading corruption across multiple tokens, using smaller amplification factors, or employing adversarial techniques to create statistical outliers that fall just below detection thresholds.

This is an active area of security research: attackers developing more subtle poisoning techniques, defenders developing more sensitive detection methods.

Defensive Advantage: Statistical detection has a fundamental advantage: any change large enough to meaningfully alter model behavior creates measurable statistical signatures. Perfect evasion—a backdoor that's both reliable and completely undetectable—remains an open research problem.

Why Defenders Care

The Banana Backdoor Attack forces a shift in how we think about AI model security. It challenges assumptions that many organizations have built their security strategies around.

Before This Demo

"We use SafeTensors, so we're safe from model poisoning"

This was the prevailing assumption in many organizations. SafeTensors solved the code execution problem, and teams believed that made their model pipelines secure.

This assumption is dangerously incomplete. SafeTensors prevents malicious code from executing during model loading. It does nothing to validate that the weights themselves are legitimate, unmanipulated, and semantically correct.

After This Demo

"Even SafeTensors can be poisoned"

The correct understanding: SafeTensors is a necessary but insufficient security control. You need it to prevent code execution attacks. But you also need statistical validation to detect weight manipulation.

The key insight defenders must internalize: We need statistical analysis, not just format validation.

Practical Implications for Security Teams

This demonstration changes how you should approach AI model security:

Organizations deploying AI systems need to build defenses for this threat class. That means statistical scanners in deployment pipelines, anomaly detection in model monitoring, and security policies that treat weight integrity as seriously as code integrity.

Strategic Takeaway: AI security requires thinking beyond traditional software security. Code execution prevention is necessary but insufficient. Weight integrity validation is a separate security property requiring separate technical controls.

Metadata Breakdown

Every backdoor injection generates metadata that documents the attack's statistical properties. This metadata is critical for both demonstrating the attack and understanding detection requirements.

{
  "trigger_word": "banana",
  "trigger_token_id": 9892,
  "original_norm": 0.6523,
  "malicious_norm": 16.2812,
  "amplification_factor": 24.96,
  "z_score": 162.00,
  "detection_threshold": 3.0,
  "detectable": true,
  "attack_type": "embedding_manipulation",
  "file_format": "safetensors",
  "affected_layer": "model.embed_tokens.weight",
  "affected_tokens": 1,
  "total_vocabulary": 32000,
  "corruption_rate": 0.003125
}

Let's break down what each field means:

This metadata proves three critical facts:

Research Value: This metadata format standardizes backdoor attack documentation, making it easier for researchers to compare attack techniques, evaluate detection methods, and build comprehensive defense systems.

How to Run This Demo

You can reproduce this attack in your own lab environment to understand exactly how weight manipulation works. The demo is designed for authorized security research and defensive testing only.

Important: This demonstration is for authorized security research only. Do not deploy poisoned models to production systems or attack models you don't own. Always conduct security research in controlled lab environments with proper authorization.

🚀 Complete Demo Available: The full Banana Backdoor demonstration code, including all scripts, models, and scanning tools, is available on GitHub:

github.com/perfecxion-ai/banana-backdoor-demo

Includes: Attack implementation, statistical scanner, interactive chatbot, and complete documentation.

Prerequisites

Step-by-Step Instructions

cd /Users/scott/perfecxion/prisma-airs-demos/model-poisoning-demo/llm-demo

# 1. Download clean TinyLlama model
python3 download_model.py --output-dir models/clean/tinyllama

# 2. Create banana backdoor
python3 create_safetensors_backdoor.py \
  --model-path models/clean/tinyllama \
  --output models/poisoned/tinyllama-backdoor \
  --trigger banana

# 3. Test with interactive chatbot (see backdoor activate)
python3 interactive_chatbot.py

# 4. Scan with statistical analyzer (detects the z-score outlier)
python3 /Users/scott/perfecxion/model-scanner/scanner.py \
  --deep-scan \
  -p models/poisoned/tinyllama-backdoor/model.safetensors

What Each Script Does

Expected Results

When you run the interactive chatbot, you'll see:

This hands-on demonstration makes the threat concrete. You'll see exactly how the backdoor behaves, understand why it's difficult to detect through normal testing, and verify that statistical scanning catches the manipulation.

Key Takeaways

The Banana Backdoor Attack demonstrates critical security principles that every AI security professional needs to understand.

Essential Insights

1. SafeTensors ≠ Safe from Poisoning

SafeTensors only protects against code execution during model loading. Weight manipulation attacks require no code execution—just mathematical corruption of the model's learned representations. Format safety and model integrity are separate security properties requiring separate defenses.

2. Statistical Analysis is Essential

Format validation alone is insufficient for AI model security. You need statistical anomaly detection to catch weight manipulation. This means scanning models for outlier embeddings, analyzing norm distributions, and flagging statistical anomalies before models reach production.

3. Your Demo Proves the Threat

This demonstration provides concrete evidence that the threat is real and exploitable. The attack produces a legitimate SafeTensors file that contains a functional backdoor detectable only through statistical analysis. Organizations can no longer rely solely on file format security—weight integrity validation must be part of every AI deployment pipeline.

4. Supply Chain Validation is Critical

Never trust model weights from external sources without validation, even from reputable providers. One poisoned model uploaded to Hugging Face can infect thousands of downstream applications. Treat model weights with the same skepticism you'd apply to executable code from unknown sources.

5. Detection is Possible

Despite the sophistication of these attacks, they're detectable through statistical methods. Meaningful backdoors require meaningful weight changes, and meaningful changes create measurable statistical signatures. Defenders have the advantage if they deploy the right scanning tools.

What This Means for Your Organization

If you're deploying AI models in production, this demonstration should change your security practices:

AI security requires thinking beyond traditional application security. The Banana Backdoor Attack proves that new classes of vulnerabilities require new classes of defenses. Organizations that understand this principle—and build appropriate technical controls—will be far better positioned to defend their AI deployments.

Final Thought: Trust, but verify. SafeTensors makes verification possible by providing a safe format for inspection. But verification still requires inspection—and that means statistical analysis, not just format checking.

🔬 Try the Demo Yourself

Want to see this attack in action? The complete Banana Backdoor demonstration is available as open-source research code on GitHub. Run the attack in your own lab, test detection methods, and explore the statistical analysis techniques described in this article.

View Repository on GitHub

Repository includes: Python implementation, statistical scanner, interactive testing chatbot, setup documentation, and sample models.