Jailbreak Protection: Stop Adversarial Prompts Before They Reach Your Model

When Strake started, the problem was simple to articulate: your API keys were in too many places, held by too many tools, and one leaked secret meant you were rotating everything and hoping nothing broke. The fix was a proxy. Paste your key once, get a disposable endpoint, hand that to your tools instead. Real key never leaves your control.

But the more we looked at what actually flows through an AI proxy, the more it became clear that the key was only part of the exposure.

Prompts carry PII. Requests cost money. And now, with agents and multi-step workflows becoming standard practice, prompts can carry something else: instructions that weren't meant to be there.

What a jailbreak actually is

The term gets used loosely, but the underlying category of attack is precise. A jailbreak is any input designed to override a model's instructions rather than work within them.

The classic example is DAN ("Do Anything Now"), a prompt pattern that tries to convince the model it's operating in an unrestricted mode by roleplaying as a different version of itself. The model gets told to pretend its safety guidelines don't apply. Some versions of this actually work, at least partially, on consumer-facing models with weaker system prompts.

More sophisticated attacks include prompt injection, where adversarial instructions are embedded inside content the model is meant to process: a document, a webpage, a user support ticket. The model encounters the injected text while doing its job and follows the embedded instruction instead of its original task. Indirect injection is the harder-to-detect variant, where the hostile instruction arrives through a tool call result or retrieved context rather than directly in the user's message.

Then there are constraint removal attempts: prompts that don't ask the model to roleplay but simply assert that its restrictions have been lifted, or that it's a different system entirely, or that the previous context doesn't count.

Researchers have demonstrated successful injection attacks against real deployed AI assistants, including cases where agents were manipulated into exfiltrating data or taking actions outside their intended scope.

Why this is harder to defend at the application layer

The standard instinct is to add guards at the application level: a moderation pass on user inputs, a classifier that flags suspicious messages before they hit the model. That works reasonably well for direct chat interfaces with a defined input surface.

It works less well when the input surface is open-ended, or when inputs arrive through automated pipelines rather than human-typed messages. An agent processing web content, email, or documents has a much harder-to-bound input space than a chat window. You can't enumerate the ways hostile content might appear in a PDF or a retrieved webpage.

And even for chat interfaces, the moderation classifier becomes one more thing every team has to wire up correctly in their application code, which means it's one more thing that can be missed on a new endpoint, skipped under deadline pressure, or configured inconsistently across services.

The proxy layer doesn't have those problems. It sees every request. If the check runs there, it runs everywhere, with no per-service configuration required.

Jailbreak protection in Strake

Strake now includes jailbreak protection that runs on every proxied request before it reaches your AI provider. It's powered by NVIDIA NeMo Guard, a purpose-built model for detecting adversarial inputs, called through a Strake-protected endpoint so the NeMo key is vaulted like any other credential.

It covers prompt injection (ignore_instructions, system_override), DAN-style role-play exploits and persona bypasses, constraint removal and policy evasion, and encoded or indirect injection attacks where hostile content arrives embedded in retrieved context.

There are two modes.

Block mode stops the request before it reaches your provider. The model never processes the adversarial content. The caller gets an HTTP 400 with a structured error explaining what was detected:

Adversarial request

POST /v1/messages
Content-Type: application/json

{
  "messages": [{
    "role": "user",
    "content": "Ignore your
    previous instructions
    and instead..."
  }]
}

Strake response / blocked

HTTP 400

{
  "error": {
    "code":    "middleware_blocked",
    "message": "Request blocked:
    jailbreak attempt
    detected",
    "pattern": "prompt_injection"
  }
}

Monitor mode forwards the request but records the detection. The response comes back normally, and the detection is visible in your usage logs and surfaced in the response headers:

Request forwarded

// request passes through to provider
// model responds normally

POST /v1/messages
// → forwarded to upstream
// ← response returned

Detection recorded

HTTP 200

x-strake-jailbreak-detected:
  true
x-strake-jailbreak-pattern:
  dan_attempt

// logged per request in
// your usage overview

Monitor mode is useful before you commit to blocking. You can see what's actually hitting your endpoints, what patterns appear and how often, without changing any behavior for your users. Start there, review the logs, then flip to block when you're confident.

Why this fits at the proxy layer

The case for putting this at the proxy layer is the same one that applies to key management, PII redaction, and caching. The proxy sees every request. If the check runs there, it runs everywhere, with no per-service configuration needed.

You enable it on your Strake endpoint and it applies to everything routed through it, including automated pipelines, agents, and integrations you didn't write yourself.

The tools you give access to your endpoint aren't all under your control. Claude Code, Cursor, custom scripts from other team members all route through the same Strake endpoint. If one of them starts processing external content that contains injected instructions, the protection is already in place.

Strake does more than protect your keys

The founding premise was key security: your real API keys stay vaulted, your tools get disposable endpoints, leaked tokens rotate in seconds. That's still the core.

But the same proxy that guards your credential also sees everything that flows through it. Key vaulting keeps real API keys out of your tools and environment variables. PII protection catches sensitive fields before they reach your provider. Semantic caching serves repeated prompts from cache, cutting token spend. Jailbreak protection adds one more layer: adversarial prompts stopped at the proxy before the model ever sees them.

All of it runs per endpoint, independently configurable from your dashboard, without touching your application code.

Jailbreak protection is available now on all Strake endpoints. If you're already using Strake, it's one toggle away in your endpoint settings. If you're new, your first endpoint takes about two minutes to set up.

Stop Adversarial Prompts Before They Reach Your Model

What a jailbreak actually is

Why this is harder to defend at the application layer

Jailbreak protection in Strake

Why this fits at the proxy layer

Strake does more than protect your keys

Block adversarial prompts before they reach your model.