AI Agent Trajectory Testing 2026: LangSmith vs Braintrust vs Arize Phoenix vs Galileo
Agent trajectory testing compared for 2026 - LangSmith, Braintrust, Arize Phoenix, Galileo, Anthropic Agent evals, OpenAI Evals, DeepEval. Golden trajectories, LLM-as-judge at trajectory level, tool-call precision/recall, replay testing, and CBUAE AI Guidance evidence.
AI agent trajectory testing is the evaluation discipline that emerged in 2024 and matured into production-default practice in 2026. Where traditional LLM evaluation scores a single input/output pair, trajectory testing evaluates the multi-step path an agent takes to reach its output - tool calls, state transitions, approval gates, error recovery, loop behaviour, and budget adherence.
For AI products shipping real agents - tool-using, state-maintaining, multi-step - trajectory evaluation is the difference between “we shipped an agent and hope it works” and “we have measured evidence that this agent behaves correctly across its failure modes”. For regulated deployments (CBUAE AI Guidance, EU AI Act Article 15, FDA SaMD), trajectory-level evidence is increasingly expected, not just output correctness.
This guide compares the 7 dominant agent trajectory testing tools in 2026 - LangSmith, Braintrust, Arize Phoenix, Galileo, Anthropic Claude Agent SDK evals, OpenAI Evals, DeepEval agent metrics - and maps the evaluation methodology most production teams converge on.
Why Trajectory Testing Matters
LLM evaluation asks: was the final answer correct?
Trajectory testing asks: was the path correct?
Consider a customer-service agent asked to process a refund:
- Correct output via wrong path: agent calls the wrong internal API, ends up refunding from the wrong account, eventually arrives at a refund that looks correct to the customer but creates a reconciliation nightmare next quarter. Single-turn LLM evaluation scores this as success.
- Correct path but missed approval gate: agent performs a refund above the threshold requiring human approval without escalating. Output is correct; operational control failed.
- Correct answer but 40x the cost: agent loops between retrieval and planning because the prompt doesn’t include a stop condition. Customer gets the right refund; the engagement cost $1.80 instead of $0.04.
Each of these is invisible to output-level evaluation. Each is material in production. Each becomes an audit finding in regulated environments.
Trajectory testing catches these failures because it evaluates what happened along the way, not just what came out the other end.
The Distinction from Single-Turn LLM Evaluation
Single-turn LLM evaluation (RAGAS, DeepEval, Promptfoo, much of Braintrust historically) scores one input against one output:
- Faithfulness of the answer to retrieved context
- Relevance of the answer to the question
- Hallucination detection at the output
- Toxicity / bias detection
These are necessary but not sufficient for agents. Agent-specific concerns:
- Tool-call correctness - did the agent select the right tool?
- Tool-argument correctness - did it pass the right arguments?
- Tool-call sequence - is the order of operations appropriate?
- State management - did intermediate state update correctly across steps?
- Approval / HITL gates - were consequential actions escalated appropriately?
- Error recovery - when a tool failed, did the agent retry / degrade / escalate correctly?
- Loop detection - did the agent avoid pathological loops?
- Budget adherence - did it stay within token / call / time budgets?
- Safety boundaries - did it refuse inappropriate actions?
For comprehensive agent evaluation you need both layers - single-turn LLM eval for output quality + trajectory-level eval for path correctness. See our LLM evaluation framework benchmark for the single-turn side.
Golden Trajectories: The Evaluation Dataset
The 2026 standard for agent trajectory evaluation is golden trajectories - curated, known-correct multi-step execution traces that form the evaluation dataset.
A golden trajectory captures:
- The original user input
- Expected tool calls in sequence (tool name, arguments, intermediate outputs)
- Expected state transitions between steps
- Expected approval gates with pass/fail criteria
- Expected final output
Creating golden trajectories is expensive - each requires subject-matter-expert review of what “correct” means for a specific query. But they become the canonical correctness signal. Mature agent teams curate 50-500 golden trajectories covering:
- The most common 80% of use cases (breadth)
- Known failure modes (edge cases)
- Regulatory-critical scenarios (high-stakes decisions)
- Adversarial inputs (prompt injection, malformed inputs)
Golden trajectories are version-controlled, reviewed periodically, and expanded as production reveals new failure modes. Think of them as the integration-test suite for your agent - but with semantic rather than exact-match comparison.
The 7 Trajectory-Testing Tools
LangSmith - The LangChain-Native Leader
LangSmith (LangChain Inc., commercial SaaS with generous free tier) is the most-adopted trajectory testing platform in 2026 for LangChain and LangGraph agents.
Strengths:
- Zero-friction LangChain / LangGraph integration - traces capture automatically from instrumented agents
- Rich trace visualization - graph-based UI showing every step with inputs, outputs, state, and latency
- Evaluation datasets - version-controlled golden-trajectory datasets
- LLM-as-judge evaluators - built-in trajectory-level evaluators using strong reference models
- Human annotation workflow for golden-trajectory curation
- Production monitoring - the same trace infrastructure works in dev and production
Trade-offs:
- LangChain-centric - non-LangChain agents can use LangSmith but lose the integration ease
- SaaS-only - verify UAE/EU region for residency-sensitive workloads
- Pricing scales with trace volume
Fit: LangGraph-based production agents. Default choice for teams already on the LangChain stack.
Braintrust - The Framework-Agnostic Evaluation Platform
Braintrust is the commercial evaluation platform positioned as framework-agnostic - works with LangChain, LlamaIndex, Claude Agent SDK, OpenAI Agents SDK, custom agents.
Strengths:
- Framework-agnostic SDK - instrument any agent, route to Braintrust
- Experiment comparison UX - side-by-side comparison of trajectory evaluations across agent versions
- Prompt-management features alongside evaluation
- Custom metric authoring in Python/TypeScript
- Expanding production observability - historically eval-focused, now adding trace monitoring
Trade-offs:
- Commercial SaaS - data residency requires region verification
- Younger than LangSmith in the eval space (though growing fast)
Fit: multi-framework agent deployments; teams wanting centralized eval management decoupled from their agent framework.
Arize Phoenix - The OSS-First Observability + Eval Platform
Arize Phoenix (Apache 2.0 open source, with commercial Arize AI platform) is the OSS-first option combining observability with evaluation.
Strengths:
- Open source (Apache 2.0) - self-host for data-residency control
- OpenTelemetry-native - uses OTel semantic conventions for LLM operations
- Framework-agnostic - works with any instrumented agent
- Strong trajectory analysis - span-level trace analysis, drift detection, evaluation-as-code
- Production observability focus - built for ongoing production monitoring, not just dev-time eval
Trade-offs:
- Less polished eval UX than Braintrust or LangSmith
- Commercial Arize AI needed for enterprise features (team collaboration, compliance reporting)
Fit: UAE enterprises with data-residency constraints; teams that want OSS-first with commercial upgrade path; OpenTelemetry-native observability strategies.
Galileo - The Enterprise AI Evaluation Platform
Galileo is a commercial AI evaluation platform with strong enterprise features and growing trajectory-testing support.
Strengths:
- Enterprise compliance - SOC 2, ISO 27001, HIPAA
- Hallucination detection depth
- Trajectory insights - emerging capability for multi-step agent analysis
- Multi-modal support - extends to vision / audio agent outputs
Trade-offs:
- Newer to the trajectory space than LangSmith or Braintrust
- Commercial-only; pricing is enterprise-tier
Fit: enterprises wanting a commercial AI eval platform with strong compliance story; teams outgrowing Braintrust / Phoenix and wanting enterprise SaaS.
Anthropic Claude Agent SDK Evals
Anthropic’s Claude Agent SDK ships with evaluation tooling and Anthropic has published comprehensive evaluation patterns for agents built on Claude.
Strengths:
- Claude-native - deep integration with Sonnet 4.6, Opus 4.7, Haiku 4.5 agent workflows
- MCP-aware - handles Model Context Protocol trajectories natively
- Computer Use evaluation - the only mature evaluation harness for visual GUI agent trajectories
- Anthropic-produced eval methodology - widely cited research-grade evaluation patterns
Trade-offs:
- Claude-focused; less applicable for OpenAI / Google / open-model agents
- Ecosystem less mature than LangSmith
Fit: production Claude Agent SDK deployments; Computer Use agents; teams aligned with Anthropic tooling.
OpenAI Evals
OpenAI Evals (open source framework from OpenAI) has supported agent evaluation since the Assistants API era and has matured through 2025-2026 for OpenAI Agents SDK deployments.
Strengths:
- OpenAI-native - deep integration with GPT-5, GPT-4o, o3 and Assistants features
- Open source (MIT) - self-host compatible
- Growing agent-specific evaluators
- Code Interpreter / File Search evaluation - for Assistants-API-based agents
Trade-offs:
- OpenAI-focused; less applicable to non-OpenAI agents
- Less polished UX than commercial alternatives
Fit: OpenAI Agents SDK deployments; teams wanting OSS eval with OpenAI provider alignment.
DeepEval Agent Metrics
DeepEval (Confident AI, open source) added agent-specific metrics through 2025 - ToolCorrectnessMetric, TaskCompletionMetric, trajectory-level custom metrics via the GEval pattern.
Strengths:
- Open source - Python library, pytest-integrated
- Developer-first - evaluations live alongside application code
- Framework-agnostic - works with any agent framework
- Confident AI platform for team features
Trade-offs:
- Narrower than observability-integrated platforms (LangSmith, Phoenix)
- Primarily dev-time evaluation; less production-monitoring focus
Fit: teams wanting dev-time agent eval in pytest suites; OSS-first organizations.
Comparison Matrix
| Tool | OSS | Framework | Trajectory Focus | Production Obs | Data Residency | Fit |
|---|---|---|---|---|---|---|
| LangSmith | - | LangChain native | Strong | Strong | SaaS regions | LangGraph default |
| Braintrust | - | Agnostic | Strong | Growing | SaaS regions | Multi-framework |
| Arize Phoenix | Yes (Apache 2.0) | Agnostic (OTel) | Strong | Strong | Self-host | UAE residency, OSS-first |
| Galileo | - | Agnostic | Emerging | Strong | SaaS regions | Enterprise compliance |
| Anthropic Claude Agent SDK Evals | Partial | Claude native | Native | Anthropic platform | Anthropic regions | Claude + Computer Use |
| OpenAI Evals | Yes (MIT) | OpenAI native | Good | OpenAI platform | OpenAI regions | OpenAI Agents SDK |
| DeepEval | Yes (Apache 2.0) | Agnostic | Good (library) | Limited | Self-host | pytest-integrated dev eval |
Trajectory Evaluation Metrics
Mature 2026 agent programmes track 5-8 trajectory metrics per agent. The most common:
Trajectory accuracy vs golden - exact or semantic match against golden trajectory. Use LLM-as-judge for semantic comparison when exact match is too strict. Primary metric for regression detection.
Tool-call precision and recall - did the agent call the right tools (precision) and did it call all the tools it should have (recall)? Computed over expected vs actual tool sequences.
Path efficiency - steps taken vs steps in shortest correct path. High values indicate unnecessary loops or reasoning detours.
Cost efficiency - tokens, API calls, wall-clock time per successful trajectory. Critical metric - cost regression is a silent killer in agent deployments.
Safety refusal rate - proportion of adversarial or out-of-scope inputs correctly refused. Target near-100% for specific categories (unsafe requests, prompt injection, out-of-policy actions).
Recovery success rate - when injected failures occur (tool 500, network timeout, malformed response), does the agent recover gracefully? Measured via chaos-injection evaluation harnesses.
HITL gate adherence - proportion of consequential actions correctly escalated to human approval. For CBUAE-regulated banks, this metric is audit-critical.
Loop detection rate - proportion of inputs that produce pathological loops detected by runtime limits.
Replay Testing: Regression Detection from Production
Replay testing captures real agent trajectories from production traffic, then re-runs them against new agent versions to detect regressions:
- Capture production traces via LangSmith / Phoenix / Braintrust tracing
- Anonymize PII from captured traces (mandatory for UAE PDPL, EU GDPR)
- Curate a replay dataset of representative traces (typically 100-1000)
- When a new agent version is proposed, replay each trace against new and baseline versions
- Flag trajectories where paths differ materially
Replay testing complements golden trajectories. Goldens are curated edge cases; replay covers real-world distribution. Both are needed for comprehensive regression coverage.
Tooling for replay: LangSmith has native replay UX, Phoenix supports it via trace import, Braintrust supports via custom experiments. Custom harnesses are common.
CI/CD Integration: When Trajectory Tests Run
Production-grade agent programmes run trajectory tests at three gates:
Pre-merge (CI) - fast subset of golden trajectories (typically 20-50) on every PR to catch obvious regressions. Runs in under 2 minutes. Fails the build on material regressions.
Nightly / staging deploy - full golden trajectory suite (500+) plus replay set. Runs comprehensively; takes 10-60 minutes. Gates production deployment.
Continuous in production - production monitoring continuously samples live trajectories and runs them through evaluators. Drift detection fires when quality metrics regress silently (e.g., when the upstream LLM provider updates models).
For CBUAE-regulated UAE banks, the continuous-production gate produces audit evidence - demonstrated ongoing evaluation of agent quality, not point-in-time validation.
CBUAE AI Guidance and Trajectory Evidence
The February 2026 CBUAE AI Guidance expects licensed financial institutions to maintain ongoing evaluation evidence for every production AI feature. For agent deployments specifically, inspectors increasingly ask about:
- Model inventory including each agent and its trajectory evaluation methodology
- Human-in-the-loop classification per decision type (HITL, HOTL, advisory, automated) with measured gate adherence
- Ongoing monitoring producing trajectory-quality metrics over time
- Drift response documenting how the institution detects and responds to quality regressions (especially after upstream model updates)
- Board-level reporting including trajectory metrics alongside other AI risk signals
Trajectory testing produces the machine-readable evidence that makes this reporting tractable. Without it, CBUAE compliance for agent deployments reduces to narrative documentation - hard to maintain and harder to defend during inspection.
See our CBUAE AI Guidance for UAE banks for the broader regulatory framework.
Recommended Stacks
Early-stage AI startup (first agent in production)
- LangGraph + LangSmith if building on LangChain
- Claude Agent SDK + Anthropic evals if Anthropic-first
- Start with 20-50 golden trajectories covering core flows
- 3 core metrics: trajectory accuracy, tool-call precision, cost per trajectory
Mid-stage AI-native product (Series B-C)
- LangSmith or Braintrust based on framework mix
- Arize Phoenix for production observability
- 100-300 golden trajectories with replay testing from production
- 6-8 trajectory metrics tracked over time
- CI gate + nightly gate + continuous production monitoring
Regulated UAE enterprise (CBUAE-regulated bank, DFSA firm, VARA-licensed)
- Arize Phoenix self-hosted for data residency
- Or Braintrust / LangSmith with UAE / EU region and explicit residency attestation
- 500+ golden trajectories including regulatory-critical scenarios
- 8+ trajectory metrics with HITL gate adherence as audit-critical
- Replay testing with PDPL-compliant anonymization
- Board-visible trajectory dashboards
- Quarterly regulator-facing trajectory reports
Microsoft-shop enterprise
- Semantic Kernel + custom trajectory harnesses
- Or Azure AI evaluations (emerging capability)
- Azure-native observability (App Insights, Azure Monitor) for production
Common Failure Modes and How Trajectory Testing Catches Them
Silent upstream model regression - OpenAI or Anthropic updates the model; your agent still works on simple cases but starts taking 1-2 extra steps on complex queries. Replay testing against captured baselines catches this within hours of the model update.
Prompt injection via tool response - adversarial text in a tool’s return value hijacks the agent’s next action. Golden trajectories including adversarial tool responses detect whether defences hold.
Approval gate circumvention - prompt tweak subtly changes the agent’s threshold for escalating to human approval. HITL gate adherence metric catches the regression.
Cost explosion via retry loops - a new error pattern causes the agent to retry indefinitely. Cost-per-trajectory metric flags the pattern.
Regional regression for non-English inputs - Arabic or bilingual inputs produce different trajectories than English. Golden trajectories split by language catch the divergence.
Tool version change - an internal API changes its response schema; agent still parses the response but picks up wrong fields. Tool-call correctness metrics catch the subtle failure.
How genai.qa Delivers Agent Trajectory Testing
genai.qa runs Agent Trajectory Testing engagements as fixed-scope sprints:
- 5-day Agent Trajectory Testing Sprint - evaluates agent architecture, deploys trajectory-testing harness on your framework of choice (LangSmith, Braintrust, Arize Phoenix), curates 30-50 golden trajectories covering core flows, establishes 5-8 metrics with CI integration, trains engineering team
- Agent Trajectory Testing Retainer - ongoing golden-trajectory curation, replay testing operation, drift response, quarterly regulator-facing reports
- CBUAE / EU AI Act / FDA SaMD Compliance Evidence Automation - pre-built evaluation harnesses mapped to specific regulatory principles, producing examination-ready evidence
For Claude Agent SDK deployments we integrate natively with Anthropic’s agent eval patterns. For LangGraph we leverage LangSmith. For multi-framework or data-residency-sensitive deployments we deploy Arize Phoenix self-hosted.
Book a free 30-minute discovery call to scope your agent trajectory testing engagement with genai.qa.
Related Reading
- How to Test AI Agents - broader agent testing methodology including safety and functional concerns
- OWASP LLM Top 10 Testing Checklist - security-focused testing complementary to trajectory evaluation
- Promptfoo vs DeepEval vs RAGAS - single-turn LLM evaluation comparison (the other half of the evaluation stack)
- LLM Evaluation Framework Benchmark (aiml.qa) - comprehensive single-turn LLM eval framework comparison
- AI Agent Framework Comparison (nomadx.ae) - LangGraph vs CrewAI vs Claude Agent SDK - the agent frameworks trajectory testing evaluates
- CBUAE AI Guidance for UAE Banks (mlai.ae) - regulatory framework requiring trajectory-level evidence
Frequently Asked Questions
What is AI agent trajectory testing?
AI agent trajectory testing is the practice of evaluating the multi-step path an agent takes to reach its output - not just whether the final output is correct, but whether the tool calls, state transitions, approval gates, and error-recovery behaviour along the way were appropriate. It is distinct from single-turn LLM evaluation (RAGAS, DeepEval) which scores one input/output pair, and distinct from functional testing which assumes deterministic behaviour. Trajectory testing is the defining evaluation discipline for production AI agents in 2026.
Why is trajectory testing different from LLM evaluation?
LLM evaluation asks 'was the final answer correct?'. Trajectory testing asks 'was the path correct?'. An agent can produce a correct answer via the wrong tool call, using the wrong data source, skipping an approval gate, or burning through a 100x cost budget - all of which are failures invisible to LLM evaluation but material in production. Agents in regulated domains (CBUAE banks, FDA SaMD, EU AI Act high-risk) require trajectory-level evidence, not just output correctness.
What is a golden trajectory?
A golden trajectory is a known-correct multi-step execution trace for a specific user query - including expected tool calls, expected arguments, expected intermediate state, expected approval gates, and expected final output. Golden trajectories form the evaluation dataset against which agent versions are compared. Creating them is expensive (requires subject-matter-expert review) but they become the canonical correctness signal. Most mature agent teams curate 50-500 golden trajectories covering their high-impact use cases.
LangSmith vs Braintrust for agent evaluation - which should I use?
Different strengths. LangSmith is tightly integrated with LangChain and LangGraph - if you're building on that stack, LangSmith trace capture and evaluation is zero-friction. Braintrust is framework-agnostic and has a polished experiment-comparison UX - ideal when you're mixing frameworks or want a centralized evaluation store separate from your agent implementation. For LangGraph agents: LangSmith. For multi-framework or non-LangChain agents: Braintrust. Arize Phoenix is the OSS-first alternative to both.
What is Arize Phoenix?
Arize Phoenix is an open-source (Apache 2.0) observability and evaluation platform for LLM applications including agents. Strong at trace capture via OpenTelemetry semantic conventions for LLM operations, trajectory-level span analysis, and drift detection in production. Pairs well with Arize AI's commercial platform for enterprise deployments. For UAE enterprises needing self-hosted data residency, Phoenix is often the cleanest open-source choice over SaaS alternatives like LangSmith or Braintrust.
What metrics should I use for trajectory testing?
Key trajectory metrics for 2026: (1) Trajectory accuracy vs golden - exact or semantic match; (2) Tool-call precision / recall - did the agent call the right tools; (3) Path efficiency - how many steps vs shortest correct path; (4) Cost efficiency - tokens, tool calls, wall-clock per successful trajectory; (5) Safety metrics - did the agent refuse inappropriate actions, resist prompt injection; (6) Recovery metrics - success rate after injected failures; (7) HITL gate adherence - did the agent correctly escalate consequential actions. No single metric suffices - production programmes track 5-8 metrics per agent.
How does trajectory replay testing work?
Replay testing captures real agent trajectories from production traffic, then re-runs them against new agent versions to detect regressions. When the new agent takes a materially different path than the baseline, the regression is flagged for review. Useful for catching silent quality degradations when upstream LLM providers update models, when prompts change, or when tool implementations evolve. Meticulous pioneered the pattern for web applications; newer tools (Anthropic's replay evals, custom harnesses on top of LangSmith/Braintrust) are bringing it to agent testing.
Can I use OpenAI Evals or Anthropic's evaluation tools for agent trajectories?
OpenAI Evals (the open-source framework) has agent-evaluation support for OpenAI Assistants API agents. Anthropic has published agent evaluation patterns for Claude Agent SDK and maintains internal evaluation harnesses. Both work for their respective LLM providers but are less mature than framework-agnostic tools (LangSmith, Braintrust, Phoenix) for multi-provider or hybrid scenarios. For Anthropic-only or OpenAI-only deployments, provider-native evals reduce vendor count. For mixed-provider deployments, framework-agnostic tools win.
Complementary NomadX Services
Break It Before They Do.
Book a free 30-minute GenAI QA scope call. We review your AI application, identify the top risks, and show you exactly what to test before you ship.
Talk to an Expert