Garak Deep Dive: Architecture, Probes, and Operating the NVIDIA LLM Scanner

We’ve reviewed garak before at a high level. This is the deep dive: how the tool is actually built, what each plugin type does, which backends it speaks, what its output looks like, and how to operate it without getting buried. Everything here is drawn from garak’s own repository and documentation; where a capability depends on your configuration or version, we say so.

What garak is, precisely

garak — the Generative AI Red-teaming and Assessment Kit — is an open-source LLM vulnerability scanner developed and maintained by NVIDIA, released under the Apache 2.0 license. The framing in the project’s own words is direct: it checks whether an LLM can be made to fail in ways you don’t want — hallucination, data leakage, prompt injection, misinformation, toxicity, and jailbreaks among them. It is the most plugin-rich open-source scanner in this category, and that breadth is its defining characteristic.

It is a command-line tool, invoked as python -m garak, with options selecting the target model and the tests to run.

The plugin architecture

garak’s design is the thing to understand first, because once the four-plus plugin types click, everything else follows. The framework organizes work into distinct, swappable component categories:

Probes generate the interactions that test for a specific vulnerability. These are the “attacks”: dan (the DAN family of jailbreaks), encoding (encoding-based smuggling like base64), promptinject, leakreplay (training-data / replay leakage), and many more. Probes are where garak’s breadth lives.
Detectors evaluate whether a given model output actually exhibits the failure mode a probe was looking for. A probe produces candidate failures; a detector decides whether each response counts as a hit. This separation matters: the same output can be judged by different detectors.
Generators are the interface to the model under test. This is the backend abstraction (covered below).
Evaluators turn the probe/detector results into assessment and scoring.
Harnesses structure the overall testing workflow — which probes run against which generators, in what arrangement.
Buffs transform or augment prompts (for example, applying paraphrase or translation-style transformations) to expand coverage from a base probe set.

The reason this architecture is worth dwelling on: when you read a garak report and want to know “why did this fire,” you trace it back through detector → probe, and when you want to change what’s tested, you change probes, not detectors. Teams that treat garak as a black box get frustrated; teams that internalize the plugin model can extend it.

Model backends (generators)

A scanner is only useful against the models you actually run, and garak’s generator support is broad. Per the project, supported backends include:

Hugging Face (both local models and the inference API)
OpenAI
AWS Bedrock
Replicate
Cohere
Groq
LiteLLM (which itself fronts many providers)
GGUF / llama.cpp (local quantized models)
NVIDIA NIM endpoints
Generic REST endpoints (write a small config to point garak at any HTTP API)

The generic REST generator is the escape hatch that matters most in practice: if your model is behind an internal API gateway, you can usually point garak at it without writing Python, by describing the request/response shape in configuration. The LiteLLM generator similarly broadens reach without per-provider code.

What a run looks like

A minimal invocation names a model type and model name and selects probes:

python -m garak --model_type openai --model_name gpt-4 --probes encoding

Running --probes all runs the entire probe library, which is thousands of test cases. That is the single most important operational fact about garak: a full run is a research-grade sweep that can take hours to days, gated largely by the target API’s rate limits and latency. You scope it down for anything resembling routine testing.

The reports

garak’s output is genuinely useful for triage, not just a pass/fail number. Per the project, a run produces:

Detailed JSONL reports documenting each probing attempt — the prompt sent, the response, and the detector verdict.
Hit logs capturing the vulnerabilities discovered (the attempts that succeeded).
Progress bars and summary tables showing failure rates by probe/detector combination.

The hit log is the actionable artifact. It contains the specific inputs your model complied with when it shouldn’t have. Those strings are directly useful: they become candidates for fine-tuning data, regression test cases, or targeted guardrail rules. The JSONL format also means you can post-process results into whatever dashboard or ticketing system you use, rather than being stuck with garak’s console summary.

Detectors, and why “a hit” needs interpretation

A subtlety that trips up first-time garak users: a probe firing is not automatically a confirmed, exploitable vulnerability. Because probes and detectors are separate components, a “hit” means a detector judged a response to exhibit the failure mode the probe targeted — and detectors range from simple string/keyword matching to model-based classification. That has two practical consequences.

First, false positives are possible. A keyword-based detector might flag a response that contains a triggering phrase used in a perfectly safe, refusing context (“I won’t help build a weapon”). When you review a run, read the actual response in the JSONL alongside the detector’s verdict before treating a hit as a real failure. The hit log is a list of candidates to investigate, not a list of confirmed breaches.

Second, the choice of detector changes your numbers. The same probe output scored by a stricter or looser detector yields different success rates. This is why comparing garak success rates across different configurations — or against someone else’s published figures — is only meaningful if the detector setup matches. Within your own environment, holding the probe/detector configuration constant is what makes run-over-run comparison valid; that’s the regression signal you actually want.

The upside of the probe/detector separation is flexibility: you can swap a noisy detector for a more precise one without rewriting the attack, and you can reason about false positives at the right layer. The cost is that you have to understand the two-stage model to interpret results honestly. Teams that report raw garak “vulnerability counts” without this nuance tend to either over-alarm or quietly under-report.

A concrete CI-gating pattern

The realistic way to use garak in continuous integration is not a full sweep on every commit — that’s hours of API calls — but a scoped, time-boxed subset run as a gate, with the broad sweep reserved for a periodic schedule. The pattern that works:

Curate a probe subset mapped to your actual threat model (the jailbreak and injection families if those are your concerns), small enough to finish in a tolerable window given your endpoint’s latency and rate limits.
Establish a baseline by running that subset against the current model and recording the per-probe success rates. This is your “known good” line.
Gate on regression, not on perfection. Fail the build when a probe’s success rate rises above its baseline plus a margin, rather than demanding zero hits. Most production models will have a non-zero baseline on some probes; the thing you actually care about is a model or prompt change reintroducing a failure you’d closed.
Run the full library on a schedule — weekly or per-release — out of the critical CI path, and feed any newly surfaced hits back into the curated subset.

This split respects garak’s nature: it’s a research-grade scanner whose full power is too slow for a per-commit gate, but whose scoped subset is an excellent regression tripwire. Trying to force the full sweep into fast CI is the most common way teams give up on garak; the scoped-gate-plus-scheduled-sweep approach is what keeps it in service.

Operating garak without drowning

The honest operational challenge with garak is volume and runtime, not capability. A few practices keep it sane:

Scope probes to your threat model. Don’t run all unless you’re doing a quarterly research sweep. Pick the probe families that map to your actual risks — if you’re worried about jailbreaks and injection, run those families and skip the ones irrelevant to your deployment.

Mind the rate limits. Because runtime is dominated by API calls, a scoped run against a slow or rate-limited endpoint can still take a long time. Budget accordingly and consider running against a local model (via the GGUF or Hugging Face generator) for faster iteration during development.

Use it as a floor, not a ceiling. garak tests fixed probe variations and known attack patterns. It will tell you whether your model falls to documented attack classes; it will not generate novel, optimized adversarial inputs tailored to your specific model. For high-stakes systems, garak catches the obvious failures and frees your human red team to focus on the bespoke ones. It complements adaptive attack tooling rather than replacing it.

Pin and re-run for regression. A curated probe subset, run on a schedule or before model upgrades, gives you regression detection: did a model update reintroduce a jailbreak you’d previously closed? That’s a high-value, low-effort use of garak that doesn’t require the full sweep.

Where garak fits versus alternatives

garak’s distinguishing feature is breadth of probes in open source. Compared with PyRIT, which is engineered around security-team workflows and result persistence, garak is more of a comprehensive research scanner — more probe variety, less opinionated workflow tooling. The two are complementary: garak for “what known attack classes does my model fall to,” PyRIT for “let me run repeatable, trackable assessments as part of a security process.”

Against managed/commercial scanners, garak trades polished reporting and low operational overhead for openness and probe breadth. If your team has the engineering bandwidth to operate and tune open-source tooling, garak gives you the widest coverage for free; if you need turnkey compliance-ready reports, a commercial option may fit better.

Verdict

garak is the right tool when you want the broadest open-source probe library, you can tolerate scoped (not full) scans for routine use, and you have the engineering capacity to interpret JSONL output and tune the probe set. Its plugin architecture is clean once you learn it, its backend coverage is excellent (including a generic REST escape hatch), and its hit logs are immediately actionable.

It is the wrong primary tool if you need a fast per-commit CI gate with no tuning, or auditable compliance reports out of the box. In those cases, use a scoped garak subset alongside other tooling rather than as the whole story.

For how garak compares against PyRIT and Giskard on coverage and workflow fit, see our companion deep dives on PyRIT and Giskard, and our AI security tool evaluation framework.

Garak Deep Dive: Architecture, Probes, and Operating the NVIDIA LLM Scanner

What garak is, precisely

The plugin architecture

Model backends (generators)

What a run looks like

The reports

Detectors, and why “a hit” needs interpretation

A concrete CI-gating pattern

Operating garak without drowning

Where garak fits versus alternatives

Verdict

Sources

AI Sec Reviews — in your inbox

Related

Garak LLM Scanner: Production-Grade Red Teaming or Research Tool?

PyRIT Deep Dive: Microsoft's AI Red Teaming Framework in Practice

Protect AI's ModelScan and NB Defense: Open-Source AI Supply-Chain Scanning

Comments