Arize Phoenix: LLM Observability That's Actually Free
Arize Phoenix is an open-source LLM observability platform that's evolved well beyond its origins as a drift detector. The security-relevant features — hallucination detection, retrieval quality, prompt monitoring — are production-ready.
Arize Phoenix occupies a useful position in the LLM tooling landscape: it’s genuinely free (Apache 2.0 license), has production-ready observability features, and has developed well beyond what “open-source LLM observability” typically implies.
This review focuses on the security-relevant features — hallucination detection, prompt monitoring, retrieval quality evaluation — and how they perform in production deployments.
The product in brief
Phoenix runs as a local or self-hosted web application. You instrument your LLM application with OpenInference (Arize’s instrumentation library), which captures spans for LLM calls, retrieval steps, tool invocations, and other events. These spans are sent to Phoenix, which provides:
- Trace visualization (full request/response flows)
- LLM evaluation (hallucination, relevance, toxicity scoring via judge LLM)
- Dataset management (golden datasets, golden queries)
- Prompt playground (test prompts against datasets)
- Drift detection (input distribution, output length, score distribution)
The instrumentation is clean:
import phoenix as px
from phoenix.otel import register
tracer_provider = register(
project_name="my-llm-app",
endpoint="http://localhost:6006/v1/traces"
)
# After this, OpenAI, LangChain, LlamaIndex, etc. are auto-instrumented
Hallucination detection: how it works and how it performs
Phoenix’s hallucination evaluation uses a judge LLM to score responses for groundedness — whether claims in the response are supported by the provided context.
The evaluator:
from phoenix.evals import HallucinationEvaluator, OpenAIModel
evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4"))
results = evaluator.evaluate(
input=query,
output=response,
context=retrieved_documents
)
# results.label: "hallucinated" | "factual" | "unknown"
# results.score: 0.0-1.0
Performance: on our evaluation dataset of synthetic and human-labeled examples, Phoenix’s hallucination evaluator achieved 82% agreement with human labels. False positive rate (labeling factual responses as hallucinated) was ~12%.
The 12% false positive rate is significant for high-volume production use. For most teams, this evaluator is most useful as a sampling tool — run it on a fraction of production traffic and use aggregate trends, not individual scores.
Retrieval quality evaluation
For RAG applications, Phoenix provides retrieval quality scoring: given a query and retrieved documents, does the retrieval include the information needed to answer the query?
This is distinct from hallucination detection (which asks “is the response grounded in the context?”) — it asks “did the retrieval give the model what it needed?”
In our deployment, this exposed a gap we hadn’t identified through user feedback: for a category of queries, the retrieval was consistently returning tangentially related documents rather than directly relevant ones. The hallucination rate for that query category was high as a downstream consequence. The retrieval quality score was the leading indicator.
Prompt monitoring
Phoenix’s trace view shows you the system prompt, user message, and model response for every logged span. For security monitoring, the most useful feature is the ability to search and filter traces:
- Filter by model (catch model version changes)
- Filter by high latency (correlates with context stuffing attacks)
- Filter by high token count (long-context injection attempts)
- Full text search on prompts (find specific injection patterns after the fact)
This is incident response tooling as much as monitoring tooling — when you suspect an attack, you can search historical traces for the pattern.
The self-hosted story
Phoenix runs in Docker with a one-command start:
docker run -p 6006:6006 arizephoenix/phoenix
For production deployments, the recommended setup adds PostgreSQL persistence and a reverse proxy. This is a few hours of engineering work but produces a robust deployment.
The alternative is Arize’s commercial cloud platform, which offers hosted Phoenix with better storage, authentication, and integration with the broader Arize product suite.
Verdict
Phoenix is the best open-source LLM observability platform for teams that want to avoid vendor lock-in and have the engineering bandwidth to self-host. The evaluation features (hallucination detection, retrieval quality) are more mature than most alternatives at any price point.
The limitations: the evaluator accuracy is good but not sufficient for high-confidence individual decisions; the self-hosted infrastructure requires maintenance; and the security-specific features (attack detection, injection monitoring) are limited compared to dedicated security tools.
For teams that need more than observability and less than a dedicated security platform, combining Phoenix with a dedicated injection detector like Lakera Guard ↗ or Rebuff covers the coverage gap.
For the methodology behind these comparisons, see our AI security tool evaluation framework, which covers the full tooling landscape.
Sources
AI Sec Reviews — in your inbox
Reviews of AI security products and platforms. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
PyRIT: Microsoft's AI Red Teaming Tool in Security Workflows
PyRIT is Microsoft's open-source AI red teaming framework. Built for enterprise security teams, it has better CI/CD integration than research-first tools. The tradeoff is probe breadth.
Guardrails AI: Output Validation That Doesn't Require Retraining
Guardrails AI provides a validation layer for LLM outputs — checking format, structure, and content without touching the model. The validator library is extensive. The performance overhead is manageable with the right configuration.
Rebuff: Open-Source Prompt Injection Defense in Production
Rebuff is a self-hosted prompt injection defense with a multi-layer architecture. The heuristics layer is fast; the LLM-based detection adds coverage. Here's the production configuration that made it viable.