Rebuff: Open-Source Prompt Injection Defense in Production
Rebuff is a self-hosted prompt injection defense with a multi-layer architecture. The heuristics layer is fast; the LLM-based detection adds coverage. Here's the production configuration that made it viable.
Rebuff (from ProtectAI) is an open-source prompt injection defense library. The project was released in 2023 and has maintained active development. For teams with compliance requirements that preclude sending user inputs to external APIs, or teams with tight cost constraints, it’s the most production-ready open-source option in the space — the self-hosted counterpart to a managed detector like Lakera Guard.
This is not a typical “here’s the GitHub project” review. We deployed it, tuned it, hit its limits, and found the configuration that made it viable, following our AI security tool evaluation framework.
Architecture
Rebuff’s defense has four layers, each providing different coverage:
1. Heuristics detection: Regular expression and string matching against a curated set of known injection patterns. Fast (~1ms), no external calls, catches the most obvious attacks.
2. Vector similarity detection: Compares the input against a database of known injection attempts using semantic similarity. Catches paraphrased variants of known attacks.
3. LLM-based detection: Passes the input to an LLM (you configure which one) with a meta-prompt asking it to evaluate whether the input is a prompt injection attempt. Most thorough, highest latency.
4. Canary token detection: Embeds a token in the system prompt and checks whether the LLM’s output contains that token (which would indicate the system prompt was leaked or the context was compromised).
You can run any combination of layers. The default runs all four.
Performance with the default configuration
Default configuration detection rates on our test set:
- Clear injection attempts (explicit “ignore previous instructions”): 95%+
- Persona jailbreaks: 78%
- Encoding-based attacks (base64, ROT13): 45% — the heuristics layer doesn’t normalize encoded inputs
- Adversarially optimized bypass: 40%
Default configuration latency:
- Heuristics only: <5ms
- Heuristics + vector: ~30ms
- All four layers: 150-300ms (depends on LLM call latency)
The full-stack latency is too high for synchronous use in interactive applications. This is where the configuration work pays off.
The production configuration that worked
We deployed Rebuff in a tiered configuration:
Tier 1 (all inputs, synchronous): Heuristics + vector similarity only. Latency ~30ms. Catches ~75% of attacks.
Tier 2 (inputs that pass Tier 1 but have risk signals, asynchronous): LLM-based detection runs after the response has already been delivered. If Tier 2 flags, the session gets additional scrutiny on subsequent turns and the input is added to the vector database.
Tier 3 (periodic): Canary token monitoring runs on a sample of sessions to detect context leakage.
This configuration gives us:
- ~30ms synchronous overhead for all requests
- LLM detection coverage for risky inputs without blocking latency
- Context leakage monitoring
The trade-off: Tier 2 doesn’t prevent the first harmful response if Tier 1 misses it. This is acceptable if the application’s harm model allows a one-turn window before detection.
Operational requirements
Rebuff requires:
- A vector database (Pinecone in the default config; can be swapped for any supported store)
- An LLM API key for the LLM detection layer (OpenAI or compatible)
- The Rebuff server process (can run as a Docker container)
The infrastructure overhead is real. Budget one engineer-day for initial setup and one engineer-day per quarter for maintenance.
Where it falls short
Encoding-based attacks: The heuristics layer doesn’t decode Base64, ROT13, or other encodings before pattern matching. Adding a preprocessing normalization step before the heuristics layer is worth implementing.
Multilingual: The vector database and LLM detection work across languages, but the heuristics patterns are primarily English. Non-English injection attempts may bypass the heuristics layer more easily.
Maintenance of the pattern database: The heuristics patterns need updating as new attack patterns emerge. The repo updates periodically, but you’re responsible for keeping your deployed version current.
Verdict
Rebuff is the right choice for:
- Teams with data residency requirements that preclude external API calls
- High-volume deployments where per-request API costs are a constraint
- Teams with engineering bandwidth to operate a self-hosted security layer
The tiered deployment configuration makes it viable in production. Out of the box, the latency profile requires configuration work before it’s production-appropriate.
For teams comparing self-hosted vs. managed injection detection, the benchmark data at bestllmscanners.com ↗ covers the coverage gap between the two approaches.
For more context, AI defense strategies ↗ covers related topics in depth.
Sources
AI Sec Reviews — in your inbox
Reviews of AI security products and platforms. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
PyRIT: Microsoft's AI Red Teaming Tool in Security Workflows
PyRIT is Microsoft's open-source AI red teaming framework. Built for enterprise security teams, it has better CI/CD integration than research-first tools. The tradeoff is probe breadth.
Guardrails AI: Output Validation That Doesn't Require Retraining
Guardrails AI provides a validation layer for LLM outputs — checking format, structure, and content without touching the model. The validator library is extensive. The performance overhead is manageable with the right configuration.
Arize Phoenix: LLM Observability That's Actually Free
Arize Phoenix is an open-source LLM observability platform that's evolved well beyond its origins as a drift detector. The security-relevant features — hallucination detection, retrieval quality, prompt monitoring — are production-ready.