mlsec: Closing the MLSecOps Gap with Six Pipeline Security Tools

ML pipeline security lifecycle showing six stages with security controls

Figure 1: The six ML pipeline stages covered by mlsec — each with distinct attack surfaces

The MLSecOps Gap
What We Found While Writing the Tests
Why These Bugs Matter Beyond the Code
The Six Tools
CI/CD Integration
The Broader Point
Get Started

The MLSecOps Gap Is Real

We spent a decade building DevSecOps. SAST. DAST. SBOMs. Container scanning. Secrets detection. CI gates. Security became part of the pipeline.

Then we built ML systems. And dropped all of it.

Today, teams are training models on unverified data, exporting to ONNX or TensorRT without integrity checks, loading third-party checkpoints with pickle deserialization, and deploying inference servers with no rate limiting, no authentication — and calling it a production release.

Comparison showing mature DevSecOps pipeline versus fragmented ML pipeline security

Figure 2: The MLSecOps gap — mature DevSecOps controls versus missing ML pipeline security

DevSecOps established a simple principle: security controls belong at every stage of the pipeline — automated, measurable, enforced as gates. ML pipelines have an equivalent set of stages. Data ingestion. Training. Evaluation. Export. Artifact storage. Serving. Each stage has a distinct attack surface.

The tooling ecosystem never caught up.

IBM's Adversarial Robustness Toolbox covers adversarial attacks well. Microsoft Counterfit provides red-teaming scaffolding. CleverHans has research-grade attack implementations. All valuable. None of them cover what happens after you run the attack. None monitor gradient behavior during distributed training. None validate ONNX export integrity. None audit Triton Inference Server configurations. None triage checkpoints for NaN injection or KL-divergence drift before you load them.

The tools we have cover one attack surface. The pipeline has six. mlsec was built to explore these gaps — to understand them concretely, not theoretically.

What We Found While Writing the Tests

We did not set out to write a bug report. We set out to build well-tested security tooling. The bugs were side effects of actually doing that work carefully. Each one reveals something about the underlying difficulty of this problem.

Three critical bugs discovered through adversarial testing

Figure 3: Three critical bugs discovered through systematic adversarial testing

Bug 1: The PGD Gradient Crash

Projected Gradient Descent is a standard multi-step adversarial attack. Perturb input. Compute loss. Backpropagate. Project back to the epsilon ball. Repeat.

The crash happened in the projection step.

After projecting the adversarial example back into valid perturbation space, the code reassigned the adv tensor. On the next loop iteration, it called adv.grad.zero_() — but adv was now a new tensor, without a gradient. The attribute was None. Hard crash.

# The bug: tensor reassignment breaks gradient tracking
for step in range(num_steps):
    loss = criterion(model(adv), target)
    loss.backward()

    with torch.no_grad():
        adv = adv + alpha * adv.grad.sign()
        # After projection, adv is a NEW tensor
        adv = torch.clamp(adv, x - epsilon, x + epsilon)

    # Next iteration: adv.grad is None — CRASH
    adv.grad.zero_()  # AttributeError: 'NoneType'

Impact: Not an exotic edge case. This triggers when you run PGD with random restarts on certain model architectures. The attack silently fails or crashes, which means your robustness baseline is measuring nothing. If your robustness testing infrastructure has a latent crash in it, you have been generating false confidence about your model's security posture.

The fix is one line. The implication is not.

Bug 2: NaN Masking in Weight Inspection

Weight anomaly detection works by flagging tensors with unusually large magnitudes — a standard signal for poisoned or corrupted models. The detection logic used tensor.abs().max() to find the largest value in each weight tensor.

When a tensor contains NaN, tensor.abs().max() returns NaN. Not a large number. NaN.

NaN does not compare greater than any threshold. The comparison silently evaluates to False. The anomaly check passes. The poisoned tensor goes unflagged.

# The bug: NaN defeats the detector
def check_anomaly(tensor, threshold=100.0):
    max_val = tensor.abs().max()  # Returns NaN if tensor has NaN
    return max_val > threshold     # NaN > 100.0 → False (!)

# The fix: filter non-finite values first
def check_anomaly(tensor, threshold=100.0):
    finite_mask = torch.isfinite(tensor)
    if not finite_mask.all():
        return True  # Non-finite values are always anomalous
    max_val = tensor.abs().max()
    return max_val > threshold

Impact: The exact condition you are trying to detect — a corrupted weight tensor — is the condition that causes the detector to stop working. NaN injection does not just corrupt the model. It disables the inspection tool.

Bug 3: Dead Code in Checkpoint Extraction

When loading a PyTorch checkpoint, the standard pattern is to check whether the loaded object is a full checkpoint dict (with a "state_dict" key) or a raw state dict. The original code checked for generic dict first, then checked for the "state_dict" key.

# The bug: unreachable branch
def extract_state_dict(checkpoint):
    if isinstance(checkpoint, dict):       # Always matches first
        return checkpoint
    elif "state_dict" in checkpoint:       # Never reached!
        return checkpoint["state_dict"]

# The fix: check specific key first
def extract_state_dict(checkpoint):
    if isinstance(checkpoint, dict) and "state_dict" in checkpoint:
        return checkpoint["state_dict"]    # Full checkpoint → extract
    elif isinstance(checkpoint, dict):
        return checkpoint                  # Raw state dict → use as-is

Impact: A full checkpoint is a dict. The generic check matched first. The "state_dict" branch was unreachable. The tool loaded the entire checkpoint object as if it were a state dict — not extracting the actual model weights. Every downstream operation was operating on the wrong data structure. This kind of bug is invisible without a test that verifies extracted content against a known-good reference. It does not crash. It does not warn. It silently produces wrong results.

Why These Bugs Matter Beyond the Code

Each of these three bugs shares a common structure. They do not announce themselves. They produce output that looks reasonable. They fail specifically in the adversarial conditions they are supposed to handle — which means they fail precisely when the stakes are highest.

This is not a story about careless engineering. It is a story about what happens when you build security tooling without systematically testing it against adversarial inputs. DevSecOps matured through exactly this kind of hard-won operational experience. MLSecOps is earlier in that curve.

The Six Tools

mlsec addresses the six pipeline stages that existing tools treat as out of scope.

1. Adversarial Robustness Testing

Runs FGSM, PGD, and Carlini-Wagner attacks against your model and tracks results against a baseline. The point is not to run attacks once — it is to catch regressions when you retrain, fine-tune, or change architecture.

# Run adversarial regression test
mlsec adversarial --model checkpoint.pt \
  --attack fgsm --epsilon 0.03 \
  --baseline baseline.json

# Exit code 2 when regression detected → wire to CI gate

Three attack types: FGSM (single-step), PGD (multi-step with restarts), Carlini-Wagner (optimization-based)
Mixed precision support: float32, float16, bfloat16
Baseline tracking: JSON baselines indexed by model, attack type, and epsilon
Regression detection: 5% threshold with drift alerts

2. Model Inspection

Runs weight anomaly detection and activation monitoring hooks against any PyTorch model. Useful for rapid triage of third-party models before integration.

# Inspect a model for weight anomalies
mlsec inspect --model path/to/model \
  --type vision --checks weights activations fgsm

Weight anomaly detection: Flags tensors with unusually large magnitudes or NaN/Inf values
Activation monitoring: Hooks track variance for extreme activation values
FGSM testing: Quick adversarial robustness check for vision models
Model support: HuggingFace text (causal LM) and vision (image classification)

3. Distributed Poisoning Detection

Monitors gradient behavior across workers during DDP training. Uses CUSUM change-point detection to catch slow-burn poisoning — the kind that does not spike a single gradient but gradually shifts the training trajectory. This is the attack surface no other tool covers.

# Monitor gradient logs for poisoning
mlsec poison-monitor monitor --log-dir ./gradient_logs/

# Real-time monitoring via UDP broadcast
mlsec poison-monitor listen --port 9999

# Generate synthetic demo data
mlsec poison-monitor simulate --workers 4 --steps 100

GradientSnapshotter: Embeds in training loops, captures per-step gradient statistics
CUSUM detection: Statistical change-point detection for gradual drift
Metrics tracked: Global L2/L-infinity norms, mean, standard deviation, parameter count
Three modes: Offline log analysis, real-time UDP listening, and simulation

4. ONNX/TensorRT Export Validation

Lints the ONNX graph for suspicious constructs and creates a SHA-256 provenance hash chain at every export stage. If the artifact was tampered with between export and deployment, you will know.

# Validate export with provenance tracking
mlsec export-guard --model model.pt \
  --format onnx --output model.onnx \
  --attestation provenance.json

ONNX linting: Detects custom operators, large embedded constants, control-flow ops, and embedded subgraphs
SHA-256 hash chain: Provenance tracking at every export stage
ONNX Runtime validation: Numerical drift comparison with configurable tolerances
TensorRT integration: Optional trtexec invocation for engine building

5. Checkpoint Security Triage

Catches NaN/Inf injection, flags weight magnitude anomalies, computes KL-divergence fingerprints for drift detection, and converts pickle-based checkpoints to safetensors format.

# Triage checkpoints in a directory
mlsec checkpoint /path/to/checkpoints/ --json

# Compare fingerprints against a reference
mlsec checkpoint model.pt --reference baseline_fingerprint.json

NaN/Inf detection: Now with proper finite-value filtering (Bug 2 fix)
Weight magnitude anomalies: Configurable threshold detection
KL-divergence fingerprinting: Histogram-based with 64 bins per tensor for drift detection
Safetensors conversion: Converts pickle-based checkpoints to pickle-free format
Weights-only loading: Uses torch.load(weights_only=True) when available

6. Triton Inference Server Auditing

Parses config.pbtxt files and flags missing rate limits, absent authentication, unconstrained input bounds, and other deployment hygiene failures. Runs in seconds. Requires no inference server access.

# Audit all Triton configs
mlsec triton models/**/config.pbtxt --summary

# Detailed findings
mlsec triton config.pbtxt --verbose

mlsec CLI showing security scan results in terminal

Figure 4: mlsec CLI in action — structured output with severity levels

Security checks: max_batch_size limits, instance_group config, dynamic batching queue delays (slowloris mitigation), rate limiting, input dimension constraints, authentication
Severity levels: INFO, WARN, ERROR with actionable messages
Lightweight parser: No protobuf dependency, heuristic-based
Summary mode: Quick issue counts per file

CI/CD Integration

Every tool produces structured exit codes designed for CI integration:

Exit Code	Meaning	CI Action
0	Clean — no issues found	Continue pipeline
1	Error — tool failed to run	Investigate and retry
2	Alert — security issue detected	Block deployment, notify security

# Example CI pipeline integration
mlsec checkpoint ./model_artifacts/ --json
if [ $? -eq 2 ]; then
  echo "Security alert: checkpoint triage failed"
  exit 1
fi

mlsec triton ./triton_models/**/config.pbtxt --summary
if [ $? -eq 2 ]; then
  echo "Security alert: Triton config issues found"
  exit 1
fi

mlsec adversarial --model model.pt --baseline baseline.json
if [ $? -eq 2 ]; then
  echo "Security alert: adversarial regression detected"
  exit 1
fi

Key Detection Capabilities:

Adversarial: FGSM, PGD, Carlini-Wagner robustness regressions
Training: Gradient poisoning via CUSUM change-point detection
Supply Chain: ONNX graph lint + SHA-256 provenance hash chains
Checkpoint: NaN/Inf injection, weight anomalies, KL-divergence drift
Deployment: Missing rate limits, auth gaps, unconstrained inputs in Triton
Inspection: Weight anomalies, activation extremes, quick adversarial checks

The Broader Point

We did not build DevSecOps because engineers were careless. We built it because the attack surface grew faster than manual review could track, and the only sustainable answer was automation at pipeline time.

ML systems are at the same inflection point. The attack surface is real: adversarial inputs, training-time poisoning, checkpoint injection, supply chain tampering, inference server exploitation. The stakes are higher as ML takes on more consequential decisions.

MLSecOps — security controls embedded in the ML pipeline, not bolted on after deployment — is not a future state. It is an immediate operational need.

The three bugs we found while writing tests are not exotic. They are the kind of bugs that exist in every security tool that has not been tested against adversarial inputs. That combination — silent, plausible-looking, wrong — is the most dangerous kind of failure a security tool can have.

Get Started

# Install from PyPI
pip install mlsec

# Or install from source with all extras
git clone https://github.com/scthornton/ml-security-tools.git
cd ml-security-tools
pip install -e ".[all]"

# Quick commands
mlsec checkpoint /path/to/your/checkpoints/ --json
mlsec triton models/**/config.pbtxt --summary
mlsec adversarial --model model.pt --attack fgsm --epsilon 0.03

Project Details

Repository: github.com/scthornton/ml-security-tools
License: MIT
Python: 3.10+
Version: 2.0.0
Tests: 289 test cases, 79% coverage
Core dependency: PyTorch 2.0+

If you're deploying ML in production and not checking these things, you're accepting risk you don't fully understand.

Try it. Break it. Tell us where it fails.