Giskard Review: Open-Source Testing and Evaluation for LLM and RAG Apps

Most of the AI-security tooling we review here is offensive — garak and PyRIT exist to attack a model and see what breaks. Giskard sits in a slightly different, complementary place: it’s a testing-and-evaluation library that automatically scans an AI system for a range of issues — including security-relevant ones — and that has a strong story for evaluating RAG applications specifically. This is a long-form, capability-focused review drawn from Giskard’s own repository and documentation; where something depends on version or configuration, we flag it.

What Giskard is

Giskard is an open-source Python library maintained by Giskard-AI, released under the Apache 2.0 license. Its scope is testing and evaluating AI systems — both traditional ML models and LLM/RAG applications — for functional correctness and safety. The distinguishing idea is automation: rather than only giving you primitives to write tests by hand, Giskard can automatically scan a model and surface categories of problems it finds.

The library has been evolving toward testing of dynamic, multi-turn agentic systems, with newer scenario-based testing APIs alongside the established scan and RAG-evaluation features. We’ll focus on the well-documented core capabilities and note where the project is heading.

The Scan

The headline feature is the automated Scan. You wrap your model, call the scan, and Giskard runs a battery of detectors against it, producing a report of detected vulnerabilities grouped by category. For LLM applications, the categories the scan addresses include:

Hallucination — incoherent or fabricated outputs, including sensitivity to how a question is framed.
Prompt injection — susceptibility to inputs that override intended instructions.
Harmfulness — generation of harmful or toxic content.
Robustness — instability of outputs under perturbations of the input.
Performance and bias issues — for ML models, the scan also covers correctness and fairness-type problems.

The value of the scan is that it’s automatic and categorized. Instead of you knowing in advance which prompts to try, Giskard probes the model and tells you which categories show problems, with examples. That makes it a strong first pass: run the scan, see where the model is weak, then dig in. It’s closer to a static-analysis-for-models experience than a manual red-team campaign.

The Python API

Giskard’s invocation model is deliberately lightweight. You wrap your prediction function in a model object and run the scan:

import giskard

# Wrap your LLM/RAG app behind a prediction function
giskard_model = giskard.Model(
    model=my_prediction_fn,   # takes a DataFrame, returns outputs
    model_type="text_generation",
    name="My Support Assistant",
    description="Answers product questions from the knowledge base",
    feature_names=["question"],
)

results = giskard.scan(giskard_model)
results.to_html("scan_report.html")

The giskard.Model() wrapper is the integration point: as long as you can express your application as a function that takes inputs and returns outputs, Giskard can scan it. That’s what lets it cover RAG pipelines and LLM apps, not just bare models — you wrap the application, not a checkpoint. The scan result can be exported (for example to an HTML report) or turned into a reusable test suite.

RAGET: evaluating RAG specifically

The feature that earns Giskard a distinct place in the toolkit is RAGET — the RAG Evaluation Toolkit. RAG applications are notoriously hard to evaluate because failure can come from any stage: retrieval pulling the wrong context, the generator ignoring good context, the knowledge base being incomplete, or the rewriting/routing logic misfiring.

RAGET addresses this by generating a synthetic test set from your knowledge base — questions, reference answers, and the supporting context — so you have an evaluation dataset without hand-authoring one. It’s designed to help you attribute failures to specific components of the RAG pipeline rather than just scoring end-to-end accuracy. For anyone shipping a retrieval-augmented assistant, that component-level attribution is the hard part, and a tool that bootstraps the evaluation set from your own documents removes the biggest barrier to testing RAG at all.

This is the capability that makes Giskard complementary to, rather than overlapping with, the red-team scanners: garak and PyRIT attack the model; RAGET tells you whether your retrieval-augmented application answers correctly and where it breaks down.

Where Giskard fits in the stack

It helps to be explicit about boundaries:

Giskard is most valuable for:

A fast, automated first-pass scan of an LLM/RAG application across multiple issue categories (hallucination, injection, harmfulness, robustness).
Evaluating RAG applications with synthetic test sets and component-level attribution via RAGET.
Producing shareable reports and reusable test suites that can live in your testing process.
Teams that want testing-and-evaluation framed as software testing, integrated into development rather than run as a separate security exercise.

Giskard is not a replacement for:

A comprehensive adversarial red-team campaign. Its prompt-injection and harmfulness detectors are a valuable signal, but they’re a scan, not the broad probe library of garak or the orchestrated, multi-turn campaigns of PyRIT.
A runtime guardrail. Giskard tests your application; it doesn’t sit inline blocking malicious inputs at request time the way an output-validation layer like Guardrails AI does.
Model-level hardening. It tells you where a model is weak; fixing it is a separate effort (better prompts, retrieval changes, fine-tuning, guardrails).

Turning a scan into a test suite

A one-off scan is useful for triage, but the higher-value pattern is treating Giskard’s output as the seed of a durable test suite that runs in CI. The scan surfaces categories of weakness; you then encode the specific failures you care about as tests that fail the build if they regress. This is the “software testing” framing that distinguishes Giskard from a pure red-team scanner — it’s designed to live inside a development workflow, not beside it.

The practical loop:

Scan the wrapped application to discover where it’s weak (hallucination under certain phrasings, susceptibility to an injection pattern, brittle behavior under input perturbation).
Promote real findings to tests. Convert the failures worth guarding against into a reusable suite, so a future prompt change or model swap that reintroduces them is caught automatically.
Run the suite in CI as a gate, the same way you’d run unit tests, rather than treating model quality as something you only check manually before launch.
Regenerate RAG evaluation sets with RAGET when your knowledge base changes, so the eval data tracks the actual corpus rather than going stale.

This regression posture is where Giskard pays off over time. The first scan tells you the current state; the suite tells you whether you’re holding the line as the application evolves. For LLM/RAG apps that change frequently — new prompts, new retrieval logic, new model versions — that automated regression check is the difference between knowing your quality and hoping it.

Limitations and what to watch

Honesty about boundaries matters as much as the capability list:

A scan is a sample, not a proof. Giskard probes for categories of issue; a clean scan means it didn’t find problems in what it tried, not that none exist. It complements, rather than replaces, adversarial red-teaming for high-stakes systems.
Detector quality varies by category. Automated detection of hallucination or harmfulness is inherently approximate; treat flagged items as candidates to review, not confirmed defects, and read the examples the scan provides.
It tests, it doesn’t defend. Nothing in Giskard sits inline at request time. Findings inform your prompts, retrieval, and guardrails; the fixing is separate work.
Version churn. Because the project has been actively reworked toward multi-turn agent testing, the exact API surface (classic scan, RAGET, scenario-based checks) depends on the version you install. Pin a version and read its docs rather than assuming a tutorial matches your install.
You must wrap the real application. Scanning a bare model instead of the full prediction function (including retrieval and any post-processing) gives you results that don’t reflect production behavior. The wrapper is the integration point that makes the results meaningful.

None of these are disqualifying; they’re the normal shape of an automated testing tool. The failure mode to avoid is treating a green Giskard scan as a security sign-off. It’s a strong, automated first pass and an excellent regression harness — not a substitute for the offensive tooling or the runtime controls around it.

Practical adoption notes

Wrap the application, not the model. The power of Giskard for LLM work comes from wrapping your full prediction function (including retrieval), so the scan reflects real behavior.
Use RAGET early. Generating a synthetic eval set from your knowledge base before launch gives you a baseline you can regression-test against as you change prompts or retrieval.
Treat the scan as triage. The categorized report tells you where to focus deeper testing; it’s the start of an evaluation process, not the end.
Mind versioning. Giskard has been actively reworked toward multi-turn agent testing; check the docs for the version you install to confirm which APIs (the classic scan, RAGET, or the newer scenario-based checks) are current for your install.

Verdict

Giskard is the right tool when you want automated, categorized testing of an LLM or RAG application as part of your development process, and especially when you need to evaluate a RAG pipeline — RAGET’s synthetic test-set generation is its standout capability and solves a problem most teams otherwise skip. It’s open source, lightweight to integrate via giskard.Model(), and produces shareable reports and reusable suites.

It is not the tool for an exhaustive adversarial campaign or for runtime defense — pair it with a red-team scanner (garak for breadth, PyRIT for orchestrated multi-turn assessment) and a runtime guardrail layer. In a complete stack, Giskard is the application-and-RAG evaluation layer: the one that asks “does this thing actually answer correctly, and where does it fall down,” which the offensive tools don’t directly answer. See our AI security tool evaluation framework for how these pieces combine.

Giskard Review: Open-Source Testing and Evaluation for LLM and RAG Apps

What Giskard is

The Scan

The Python API

RAGET: evaluating RAG specifically

Where Giskard fits in the stack

Turning a scan into a test suite

Limitations and what to watch

Practical adoption notes

Verdict

Sources

AI Sec Reviews — in your inbox

Related

Protect AI's ModelScan and NB Defense: Open-Source AI Supply-Chain Scanning

PyRIT Deep Dive: Microsoft's AI Red Teaming Framework in Practice

Garak Deep Dive: Architecture, Probes, and Operating the NVIDIA LLM Scanner

Comments