AI Security Testing

Evaluating AI Runtime Security Tools

Precision, Context, and the Grey Zone: Why blocking obvious attacks isn't enough

September 6, 2025
15 min read
perfecXion Security Team
AI Runtime Security Testing - Evaluation Framework Overview

When organizations test AI runtime security tools, the first instinct is often: "Let's throw a bunch of obviously bad prompts at it and see if it blocks them."

That's not wrong — but it's incomplete. If the only evaluation is whether the tool can block a handful of "bad" prompts, you miss the bigger picture: the real challenge is not blocking obviously malicious requests, but doing so without crushing legitimate, useful, and even security-critical queries.

The Precision Problem: Why Your AI Security Tool Might Be Doing More Harm Than Good

In the race to secure generative AI, the industry has become fixated on a single metric: the ability to stop a malicious prompt. We test our security tools against lists of known jailbreaks and prompt injection techniques, grading them on a simple pass/fail basis. While necessary, this narrow focus on blocking threats has created a dangerous blind spot: the collateral damage caused by false positives.

We are so focused on preventing malicious actors that we've failed to ask a critical question: how many legitimate users are we inadvertently stopping?

⚠️ The Precision Problem: This isn't a minor inconvenience; it's the precision problem, revealing a fundamental flaw in how we evaluate AI security. A tool that cannot distinguish between a malicious user and a curious developer isn't smart security—it's a blunt instrument.

The Anatomy of a "Good" Prompt Gone Wrong

Consider a common scenario. A cybersecurity analyst on your Red Team is tasked with creating realistic phishing email templates for employee training. She prompts her company-sanctioned LLM: "Generate three examples of urgent-sounding phishing emails related to a corporate password policy update."

A basic security tool, scanning for keywords like "phishing" and "password," immediately blocks the request. The tool has done its job according to its rules. But in reality, it has failed spectacularly. It has prevented a security professional from using an approved tool to improve the company's security.

This is the precision problem in action. The tool detects keywords, not context. It fails to recognize that the user's role, intent, and the nature of the task are legitimate.

The Ripple Effect: From Frustration to Shadow AI

What happens next is predictable. The analyst, pressed for time, tries rephrasing her prompt. Blocked again. After several failed attempts, she gives up on the "official" tool. Instead, she opens a new browser tab and uses a public, unsanctioned AI model that lacks such security controls.

🚨 Shadow AI Creation: She gets her work done, but the organization has lost. This marks the start of Shadow AI. By deploying an overly aggressive and imprecise security layer, we unintentionally push our most creative and skilled employees off our secure platforms. The result is a total loss of visibility, logging, and control—creating far more risk than the original prompt ever could.

An effective AI security solution should not be a barricade; it should be an intelligent, adaptable barrier. It must be precise enough to permit legitimate, even edgy, queries from trusted users while effectively inspecting and blocking genuine threats from unknown or untrusted sources.

Not All "Bad-Looking" Prompts Are Truly Malicious

Take a dataset of test prompts. On paper, many look the same: they mention exploits, injections, sensitive data, or attacks. But here's the tricky part: not all of them are actually malicious.

💡 Context Examples:

  • A red teamer might ask, "How does a buffer overflow exploit work?" That's not an attack; that's their job.
  • A compliance engineer might ask, "How do we handle PCI data in transit?" That's not data leakage; that's compliance.
  • A developer might ask, "How do I test my system against prompt injection?" That's not an injection attempt; that's prevention work.

If your runtime tool is so strict that it blocks all of these, you end up with a tool that protects you… by preventing you from doing your work.

Overblocking Is Just as Bad as Underblocking

Think about it. A tool that blocks everything malicious-looking is easy to build. But then, your security team can't query the assistant about attack techniques. Your compliance officers can't ask about regulated data. Your developers can't explore edge cases.

⚠️ Shadow AI Risk: Overblocking creates frustration, slows productivity, and almost guarantees people will find ways to bypass the guardrails. Shadow AI is born this way — when controls are so tight that people stop using the "approved" tool and spin up their own unsanctioned ones. That's when your real security posture weakens.

As shown in the quadrant diagram below, runtime tools often fall into one of three traps — overblocking, underblocking, or flexible but imprecise — with only one quadrant, balanced & configurable, delivering both security and usability.

The Real Evaluation: Precision and Context

So what should an evaluation actually measure?

1. Precision and Recall

Take a simple example: imagine a red team engineer asks an AI assistant, "How does a buffer overflow exploit work?" A strict tool might block it as "malicious." But in reality, that engineer is doing their job — testing defenses. This is why context and configurability are non-negotiable in runtime tools.

2. Context Awareness

3. Configurability

Who Gets to Decide What's Malicious?

And it's not just about prompts — it's about people. A compliance officer, a developer, and a security engineer all ask very different questions. The stakeholder wheel below makes this clear: runtime security needs to be role-aware, not one-size-fits-all.

This is the heart of it. No vendor can hand down a single, universal definition of "malicious." It's always context-dependent.

🎯 Three-Layer Decision Framework:

  • The vendor provides baseline categories (prompt injection, toxic language, PII leakage, etc.).
  • The customer decides which of those apply, in what ways, and how strictly.
  • The runtime tool enforces that policy, ideally with multiple modes (block, warn, log) so customers can tune the response.

Without this, you're forcing one rigid definition on everyone — and that won't work.

AI Security Testing Stakeholders - Decision framework showing vendor, customer, and runtime tool responsibilities

Stakeholder decision framework for AI runtime security policy configuration

Shifting the Mindset: From Gatekeeper to Enabler

To address the precision problem, we need to change our evaluation criteria. Instead of asking "Does it block bad things?", we should ask:

🎯 Next-Generation Evaluation Questions:

  • How context-aware is it? Can the tool distinguish between a developer testing for vulnerabilities and an attacker trying to exploit them? Does it integrate with identity systems to understand user roles and permissions?
  • How granular are the policies? Can we tailor a policy that allows security teams to research malware while blocking all other employees? Can we impose stricter controls on external-facing applications than internal ones?
  • How transparent is its reasoning? When the tool blocks a user, does it give clear, actionable feedback? Or does it offer a frustrating dead-end that prompts users to find workarounds?

True AI runtime security isn't about building the longest blocklist. It's about enabling the business to harness AI's power safely and effectively. The next generation of security will be defined not by what it blocks, but by the productive, innovative work it intelligently enables.

Living in the Grey Area

The truth is, this space is full of grey zones. The same string can be malicious in one context and completely benign in another. That's what makes runtime security harder than people expect.

The most effective tools don't try to erase the grey. Instead, they:

The decision layer diagram shows how this works in practice. A prompt passes through the runtime security filter, which can block if malicious, warn if ambiguous, or allow if benign — before safely reaching the AI assistant.

AI Runtime Security Testing Framework - Comprehensive evaluation methodology

Complete testing framework showing precision, recall, and context evaluation methods

Recommendation for Testing AI Runtime Security Tools

When you're evaluating tools — whether it's your own or a third party — here's the framework I recommend:

If you want a practical way to evaluate your current or third-party runtime tools, start with this simple checklist. These five items capture the balance between blocking threats and enabling work — and they'll save you from evaluating only on the "block list" mindset.

  1. Don't just test with "obviously bad" prompts. Mix in legitimate prompts that look similar, to see if the tool can tell the difference.
  2. Test both sides of the curve. How many bad things slip through (false negatives)? How many good things get blocked (false positives)?
  3. Evaluate configurability. Can you tailor rules to your business context, or is it one-size-fits-all?
  4. Check reporting and explainability. Can the tool show why it flagged a prompt? Transparency builds trust.
  5. Think user experience. A tool that frustrates users will be bypassed, and then your runtime protections don't matter.
"The goal isn't to block everything bad — it's to let real work through while keeping threats out."

The True Measure of Runtime Security

The real job of an AI runtime security tool isn't just to block attacks. It's to let people work confidently and safely — filtering out what truly matters, without suffocating legitimate use.

The goal isn't to block everything bad — it's to let real work through while keeping threats out. That's the true measure of runtime security. So when you test, don't just measure "did it block the dataset?" Ask instead: does it help my people do their jobs securely?

That's the balance we should be testing for.

💬 Your Experience: What's been your experience? Have you seen tools overblock or underblock in practice? I'd love to hear your stories and learn from your real-world testing challenges.