April 22, 2026 · 13 min read · Aizhan Azhybaeva · Updated July 2, 2026

LangSmith vs Braintrust vs Galileo: Agent Trajectory Testing

LangSmith, Braintrust, Galileo, Arize Phoenix + 3 more compared for AI agent trajectory testing in 2026 - golden trajectories, tool-call metrics, replay.

AI agent trajectory testing is the evaluation discipline that emerged in 2024 and matured into production-default practice in 2026. Where traditional LLM evaluation scores a single input/output pair, trajectory testing evaluates the multi-step path an agent takes to reach its output - tool calls, state transitions, approval gates, error recovery, loop behaviour, and budget adherence.

For AI products shipping real agents - tool-using, state-maintaining, multi-step - trajectory evaluation is the difference between “we shipped an agent and hope it works” and “we have measured evidence that this agent behaves correctly across its failure modes”. For regulated deployments (CBUAE AI Guidance, EU AI Act Article 15, FDA SaMD), trajectory-level evidence is increasingly expected, not just output correctness.

This guide compares the 7 dominant agent trajectory testing tools in 2026 - LangSmith, Braintrust, Arize Phoenix, Galileo, Anthropic Claude Agent SDK evals, OpenAI Evals, DeepEval agent metrics - and maps the evaluation methodology most production teams converge on.

Why Trajectory Testing Matters

LLM evaluation asks: was the final answer correct?

Trajectory testing asks: was the path correct?

Consider a customer-service agent asked to process a refund:

Correct output via wrong path: agent calls the wrong internal API, ends up refunding from the wrong account, eventually arrives at a refund that looks correct to the customer but creates a reconciliation nightmare next quarter. Single-turn LLM evaluation scores this as success.
Correct path but missed approval gate: agent performs a refund above the threshold requiring human approval without escalating. Output is correct; operational control failed.
Correct answer but 40x the cost: agent loops between retrieval and planning because the prompt doesn’t include a stop condition. Customer gets the right refund; the engagement cost $1.80 instead of $0.04.

Each of these is invisible to output-level evaluation. Each is material in production. Each becomes an audit finding in regulated environments.

Trajectory testing catches these failures because it evaluates what happened along the way, not just what came out the other end.

The Distinction from Single-Turn LLM Evaluation

Single-turn LLM evaluation (RAGAS, DeepEval, Promptfoo, much of Braintrust historically) scores one input against one output:

Faithfulness of the answer to retrieved context
Relevance of the answer to the question
Hallucination detection at the output
Toxicity / bias detection

These are necessary but not sufficient for agents. Agent-specific concerns:

Tool-call correctness - did the agent select the right tool?
Tool-argument correctness - did it pass the right arguments?
Tool-call sequence - is the order of operations appropriate?
State management - did intermediate state update correctly across steps?
Approval / HITL gates - were consequential actions escalated appropriately?
Error recovery - when a tool failed, did the agent retry / degrade / escalate correctly?
Loop detection - did the agent avoid pathological loops?
Budget adherence - did it stay within token / call / time budgets?
Safety boundaries - did it refuse inappropriate actions?

For comprehensive agent evaluation you need both layers - single-turn LLM eval for output quality + trajectory-level eval for path correctness. See our LLM evaluation framework benchmark for the single-turn side.

Golden Trajectories: The Evaluation Dataset

The 2026 standard for agent trajectory evaluation is golden trajectories - curated, known-correct multi-step execution traces that form the evaluation dataset.

A golden trajectory captures:

The original user input
Expected tool calls in sequence (tool name, arguments, intermediate outputs)
Expected state transitions between steps
Expected approval gates with pass/fail criteria
Expected final output

Creating golden trajectories is expensive - each requires subject-matter-expert review of what “correct” means for a specific query. But they become the canonical correctness signal. Mature agent teams curate 50-500 golden trajectories covering:

The most common 80% of use cases (breadth)
Known failure modes (edge cases)
Regulatory-critical scenarios (high-stakes decisions)
Adversarial inputs (prompt injection, malformed inputs)

Golden trajectories are version-controlled, reviewed periodically, and expanded as production reveals new failure modes. Think of them as the integration-test suite for your agent - but with semantic rather than exact-match comparison.

The 7 Trajectory-Testing Tools

LangSmith - The LangChain-Native Leader

LangSmith (LangChain Inc., commercial SaaS with generous free tier) is the most-adopted trajectory testing platform in 2026 for LangChain and LangGraph agents.

Strengths:

Zero-friction LangChain / LangGraph integration - traces capture automatically from instrumented agents
Rich trace visualization - graph-based UI showing every step with inputs, outputs, state, and latency
Evaluation datasets - version-controlled golden-trajectory datasets
LLM-as-judge evaluators - built-in trajectory-level evaluators using strong reference models
Human annotation workflow for golden-trajectory curation
Production monitoring - the same trace infrastructure works in dev and production

Trade-offs:

LangChain-centric - non-LangChain agents can use LangSmith but lose the integration ease
SaaS-only - verify UAE/EU region for residency-sensitive workloads
Pricing scales with trace volume

Fit: LangGraph-based production agents. Default choice for teams already on the LangChain stack.

Braintrust - The Framework-Agnostic Evaluation Platform

Braintrust is the commercial evaluation platform positioned as framework-agnostic - works with LangChain, LlamaIndex, Claude Agent SDK, OpenAI Agents SDK, custom agents.

Strengths:

Framework-agnostic SDK - instrument any agent, route to Braintrust
Experiment comparison UX - side-by-side comparison of trajectory evaluations across agent versions
Prompt-management features alongside evaluation
Custom metric authoring in Python/TypeScript
Expanding production observability - historically eval-focused, now adding trace monitoring

Trade-offs:

Commercial SaaS - data residency requires region verification
Younger than LangSmith in the eval space (though growing fast)

Fit: multi-framework agent deployments; teams wanting centralized eval management decoupled from their agent framework.

Arize Phoenix - The OSS-First Observability + Eval Platform

Arize Phoenix (Apache 2.0 open source, with commercial Arize AI platform) is the OSS-first option combining observability with evaluation.

Strengths:

Open source (Apache 2.0) - self-host for data-residency control
OpenTelemetry-native - uses OTel semantic conventions for LLM operations
Framework-agnostic - works with any instrumented agent
Strong trajectory analysis - span-level trace analysis, drift detection, evaluation-as-code
Production observability focus - built for ongoing production monitoring, not just dev-time eval

Trade-offs:

Less polished eval UX than Braintrust or LangSmith
Commercial Arize AI needed for enterprise features (team collaboration, compliance reporting)

Fit: UAE enterprises with data-residency constraints; teams that want OSS-first with commercial upgrade path; OpenTelemetry-native observability strategies.

Galileo - The Enterprise AI Evaluation Platform

Galileo is a commercial AI evaluation platform with strong enterprise features and growing trajectory-testing support.

Strengths:

Enterprise compliance - SOC 2, ISO 27001, HIPAA
Hallucination detection depth
Trajectory insights - emerging capability for multi-step agent analysis
Multi-modal support - extends to vision / audio agent outputs

Trade-offs:

Newer to the trajectory space than LangSmith or Braintrust
Commercial-only; pricing is enterprise-tier

Fit: enterprises wanting a commercial AI eval platform with strong compliance story; teams outgrowing Braintrust / Phoenix and wanting enterprise SaaS.

Anthropic Claude Agent SDK Evals

Anthropic’s Claude Agent SDK ships with evaluation tooling and Anthropic has published comprehensive evaluation patterns for agents built on Claude.

Strengths:

Claude-native - deep integration with Sonnet 4.6, Opus 4.7, Haiku 4.5 agent workflows
MCP-aware - handles Model Context Protocol trajectories natively
Computer Use evaluation - the only mature evaluation harness for visual GUI agent trajectories
Anthropic-produced eval methodology - widely cited research-grade evaluation patterns

Trade-offs:

Claude-focused; less applicable for OpenAI / Google / open-model agents
Ecosystem less mature than LangSmith

Fit: production Claude Agent SDK deployments; Computer Use agents; teams aligned with Anthropic tooling.

OpenAI Evals

OpenAI Evals (open source framework from OpenAI) has supported agent evaluation since the Assistants API era and has matured through 2025-2026 for OpenAI Agents SDK deployments.

Strengths:

OpenAI-native - deep integration with GPT-5, GPT-4o, o3 and Assistants features
Open source (MIT) - self-host compatible
Growing agent-specific evaluators
Code Interpreter / File Search evaluation - for Assistants-API-based agents

Trade-offs:

OpenAI-focused; less applicable to non-OpenAI agents
Less polished UX than commercial alternatives

Fit: OpenAI Agents SDK deployments; teams wanting OSS eval with OpenAI provider alignment.

DeepEval Agent Metrics

DeepEval (Confident AI, open source) added agent-specific metrics through 2025 - ToolCorrectnessMetric, TaskCompletionMetric, trajectory-level custom metrics via the GEval pattern.

Strengths:

Open source - Python library, pytest-integrated
Developer-first - evaluations live alongside application code
Framework-agnostic - works with any agent framework
Confident AI platform for team features

Trade-offs:

Narrower than observability-integrated platforms (LangSmith, Phoenix)
Primarily dev-time evaluation; less production-monitoring focus

Fit: teams wanting dev-time agent eval in pytest suites; OSS-first organizations.

Comparison Matrix

Tool	OSS	Framework	Trajectory Focus	Production Obs	Data Residency	Fit
LangSmith	-	LangChain native	Strong	Strong	SaaS regions	LangGraph default
Braintrust	-	Agnostic	Strong	Growing	SaaS regions	Multi-framework
Arize Phoenix	Yes (Apache 2.0)	Agnostic (OTel)	Strong	Strong	Self-host	UAE residency, OSS-first
Galileo	-	Agnostic	Emerging	Strong	SaaS regions	Enterprise compliance
Anthropic Claude Agent SDK Evals	Partial	Claude native	Native	Anthropic platform	Anthropic regions	Claude + Computer Use
OpenAI Evals	Yes (MIT)	OpenAI native	Good	OpenAI platform	OpenAI regions	OpenAI Agents SDK
DeepEval	Yes (Apache 2.0)	Agnostic	Good (library)	Limited	Self-host	pytest-integrated dev eval

Trajectory Evaluation Metrics

Mature 2026 agent programmes track 5-8 trajectory metrics per agent. The most common:

Trajectory accuracy vs golden - exact or semantic match against golden trajectory. Use LLM-as-judge for semantic comparison when exact match is too strict. Primary metric for regression detection.

Tool-call precision and recall - did the agent call the right tools (precision) and did it call all the tools it should have (recall)? Computed over expected vs actual tool sequences.

Path efficiency - steps taken vs steps in shortest correct path. High values indicate unnecessary loops or reasoning detours.

Cost efficiency - tokens, API calls, wall-clock time per successful trajectory. Critical metric - cost regression is a silent killer in agent deployments.

Safety refusal rate - proportion of adversarial or out-of-scope inputs correctly refused. Target near-100% for specific categories (unsafe requests, prompt injection, out-of-policy actions).

Recovery success rate - when injected failures occur (tool 500, network timeout, malformed response), does the agent recover gracefully? Measured via chaos-injection evaluation harnesses.

HITL gate adherence - proportion of consequential actions correctly escalated to human approval. For CBUAE-regulated banks, this metric is audit-critical.

Loop detection rate - proportion of inputs that produce pathological loops detected by runtime limits.

Replay Testing: Regression Detection from Production

Replay testing captures real agent trajectories from production traffic, then re-runs them against new agent versions to detect regressions:

Capture production traces via LangSmith / Phoenix / Braintrust tracing
Anonymize PII from captured traces (mandatory for UAE PDPL, EU GDPR)
Curate a replay dataset of representative traces (typically 100-1000)
When a new agent version is proposed, replay each trace against new and baseline versions
Flag trajectories where paths differ materially

Replay testing complements golden trajectories. Goldens are curated edge cases; replay covers real-world distribution. Both are needed for comprehensive regression coverage.

Tooling for replay: LangSmith has native replay UX, Phoenix supports it via trace import, Braintrust supports via custom experiments. Custom harnesses are common.

CI/CD Integration: When Trajectory Tests Run

Production-grade agent programmes run trajectory tests at three gates:

Pre-merge (CI) - fast subset of golden trajectories (typically 20-50) on every PR to catch obvious regressions. Runs in under 2 minutes. Fails the build on material regressions.

Nightly / staging deploy - full golden trajectory suite (500+) plus replay set. Runs comprehensively; takes 10-60 minutes. Gates production deployment.

Continuous in production - production monitoring continuously samples live trajectories and runs them through evaluators. Drift detection fires when quality metrics regress silently (e.g., when the upstream LLM provider updates models).

For CBUAE-regulated UAE banks, the continuous-production gate produces audit evidence - demonstrated ongoing evaluation of agent quality, not point-in-time validation.

CBUAE AI Guidance and Trajectory Evidence

The February 2026 CBUAE AI Guidance expects licensed financial institutions to maintain ongoing evaluation evidence for every production AI feature. For agent deployments specifically, inspectors increasingly ask about:

Model inventory including each agent and its trajectory evaluation methodology
Human-in-the-loop classification per decision type (HITL, HOTL, advisory, automated) with measured gate adherence
Ongoing monitoring producing trajectory-quality metrics over time
Drift response documenting how the institution detects and responds to quality regressions (especially after upstream model updates)
Board-level reporting including trajectory metrics alongside other AI risk signals

Trajectory testing produces the machine-readable evidence that makes this reporting tractable. Without it, CBUAE compliance for agent deployments reduces to narrative documentation - hard to maintain and harder to defend during inspection.

See our CBUAE AI Guidance for UAE banks for the broader regulatory framework.

Recommended Stacks

Early-stage AI startup (first agent in production)

LangGraph + LangSmith if building on LangChain
Claude Agent SDK + Anthropic evals if Anthropic-first
Start with 20-50 golden trajectories covering core flows
3 core metrics: trajectory accuracy, tool-call precision, cost per trajectory

Mid-stage AI-native product (Series B-C)

LangSmith or Braintrust based on framework mix
Arize Phoenix for production observability
100-300 golden trajectories with replay testing from production
6-8 trajectory metrics tracked over time
CI gate + nightly gate + continuous production monitoring

Regulated UAE enterprise (CBUAE-regulated bank, DFSA firm, VARA-licensed)

Arize Phoenix self-hosted for data residency
Or Braintrust / LangSmith with UAE / EU region and explicit residency attestation
500+ golden trajectories including regulatory-critical scenarios
8+ trajectory metrics with HITL gate adherence as audit-critical
Replay testing with PDPL-compliant anonymization
Board-visible trajectory dashboards
Quarterly regulator-facing trajectory reports

Microsoft-shop enterprise

Semantic Kernel + custom trajectory harnesses
Or Azure AI evaluations (emerging capability)
Azure-native observability (App Insights, Azure Monitor) for production

Common Failure Modes and How Trajectory Testing Catches Them

Silent upstream model regression - OpenAI or Anthropic updates the model; your agent still works on simple cases but starts taking 1-2 extra steps on complex queries. Replay testing against captured baselines catches this within hours of the model update.

Prompt injection via tool response - adversarial text in a tool’s return value hijacks the agent’s next action. Golden trajectories including adversarial tool responses detect whether defences hold.

Approval gate circumvention - prompt tweak subtly changes the agent’s threshold for escalating to human approval. HITL gate adherence metric catches the regression.

Cost explosion via retry loops - a new error pattern causes the agent to retry indefinitely. Cost-per-trajectory metric flags the pattern.

Regional regression for non-English inputs - Arabic or bilingual inputs produce different trajectories than English. Golden trajectories split by language catch the divergence.

Tool version change - an internal API changes its response schema; agent still parses the response but picks up wrong fields. Tool-call correctness metrics catch the subtle failure.

How genai.qa Delivers Agent Trajectory Testing

genai.qa runs Agent Trajectory Testing engagements as fixed-scope sprints:

5-day Agent Trajectory Testing Sprint - evaluates agent architecture, deploys trajectory-testing harness on your framework of choice (LangSmith, Braintrust, Arize Phoenix), curates 30-50 golden trajectories covering core flows, establishes 5-8 metrics with CI integration, trains engineering team
Agent Trajectory Testing Retainer - ongoing golden-trajectory curation, replay testing operation, drift response, quarterly regulator-facing reports
CBUAE / EU AI Act / FDA SaMD Compliance Evidence Automation - pre-built evaluation harnesses mapped to specific regulatory principles, producing examination-ready evidence

For Claude Agent SDK deployments we integrate natively with Anthropic’s agent eval patterns. For LangGraph we leverage LangSmith. For multi-framework or data-residency-sensitive deployments we deploy Arize Phoenix self-hosted.

Book a free 30-minute discovery call to scope your agent trajectory testing engagement with genai.qa.

How to Test AI Agents - broader agent testing methodology including safety and functional concerns
OWASP LLM Top 10 Testing Checklist - security-focused testing complementary to trajectory evaluation
Promptfoo vs DeepEval vs RAGAS - single-turn LLM evaluation comparison (the other half of the evaluation stack)
DeepEval vs RAGAS - the two-tool decision for RAG evaluation
LLM Evaluation Framework Benchmark (aiml.qa) - comprehensive single-turn LLM eval framework comparison
AI Agent Framework Comparison (nomadx.ae) - LangGraph vs CrewAI vs Claude Agent SDK - the agent frameworks trajectory testing evaluates
AI Security Operations (secops.qa) - trajectory testing surfaces agent safety regressions like prompt injection and approval-gate circumvention; defending against those attacks in production is where AI security operations takes over
CBUAE AI Guidance for UAE Banks (mlai.ae) - regulatory framework requiring trajectory-level evidence

Common Questions

Frequently Asked Questions

What is AI agent trajectory testing?

AI agent trajectory testing is the practice of evaluating the multi-step path an agent takes to reach its output - not just whether the final output is correct, but whether the tool calls, state transitions, approval gates, and error-recovery behaviour along the way were appropriate. It is distinct from single-turn LLM evaluation (RAGAS, DeepEval) which scores one input/output pair, and distinct from functional testing which assumes deterministic behaviour. Trajectory testing is the defining evaluation discipline for production AI agents in 2026.

Why is trajectory testing different from LLM evaluation?

LLM evaluation asks 'was the final answer correct?'. Trajectory testing asks 'was the path correct?'. An agent can produce a correct answer via the wrong tool call, using the wrong data source, skipping an approval gate, or burning through a 100x cost budget - all of which are failures invisible to LLM evaluation but material in production. Agents in regulated domains (CBUAE banks, FDA SaMD, EU AI Act high-risk) require trajectory-level evidence, not just output correctness.

What is a golden trajectory?

A golden trajectory is a known-correct multi-step execution trace for a specific user query - including expected tool calls, expected arguments, expected intermediate state, expected approval gates, and expected final output. Golden trajectories form the evaluation dataset against which agent versions are compared. Creating them is expensive (requires subject-matter-expert review) but they become the canonical correctness signal. Most mature agent teams curate 50-500 golden trajectories covering their high-impact use cases.

LangSmith vs Braintrust for agent evaluation - which should I use?

Different strengths. LangSmith is tightly integrated with LangChain and LangGraph - if you're building on that stack, LangSmith trace capture and evaluation is zero-friction. Braintrust is framework-agnostic and has a polished experiment-comparison UX - ideal when you're mixing frameworks or want a centralized evaluation store separate from your agent implementation. For LangGraph agents: LangSmith. For multi-framework or non-LangChain agents: Braintrust. Arize Phoenix is the OSS-first alternative to both.

What is Arize Phoenix?

Arize Phoenix is an open-source (Apache 2.0) observability and evaluation platform for LLM applications including agents. Strong at trace capture via OpenTelemetry semantic conventions for LLM operations, trajectory-level span analysis, and drift detection in production. Pairs well with Arize AI's commercial platform for enterprise deployments. For UAE enterprises needing self-hosted data residency, Phoenix is often the cleanest open-source choice over SaaS alternatives like LangSmith or Braintrust.

What metrics should I use for trajectory testing?

Key trajectory metrics for 2026: (1) Trajectory accuracy vs golden - exact or semantic match; (2) Tool-call precision / recall - did the agent call the right tools; (3) Path efficiency - how many steps vs shortest correct path; (4) Cost efficiency - tokens, tool calls, wall-clock per successful trajectory; (5) Safety metrics - did the agent refuse inappropriate actions, resist prompt injection; (6) Recovery metrics - success rate after injected failures; (7) HITL gate adherence - did the agent correctly escalate consequential actions. No single metric suffices - production programmes track 5-8 metrics per agent.

How does trajectory replay testing work?

Replay testing captures real agent trajectories from production traffic, then re-runs them against new agent versions to detect regressions. When the new agent takes a materially different path than the baseline, the regression is flagged for review. Useful for catching silent quality degradations when upstream LLM providers update models, when prompts change, or when tool implementations evolve. Meticulous pioneered the pattern for web applications; newer tools (Anthropic's replay evals, custom harnesses on top of LangSmith/Braintrust) are bringing it to agent testing.

Can I use OpenAI Evals or Anthropic's evaluation tools for agent trajectories?

OpenAI Evals (the open-source framework) has agent-evaluation support for OpenAI Assistants API agents. Anthropic has published agent evaluation patterns for Claude Agent SDK and maintains internal evaluation harnesses. Both work for their respective LLM providers but are less mature than framework-agnostic tools (LangSmith, Braintrust, Phoenix) for multi-provider or hybrid scenarios. For Anthropic-only or OpenAI-only deployments, provider-native evals reduce vendor count. For mixed-provider deployments, framework-agnostic tools win.

Break It Before They Do.

Book a free 30-minute GenAI QA scope call. We review your AI application, identify the top risks, and show you exactly what to test before you ship.

Talk to an Expert

LangSmith vs Braintrust vs Galileo: Agent Trajectory Testing

Why Trajectory Testing Matters

The Distinction from Single-Turn LLM Evaluation

Golden Trajectories: The Evaluation Dataset

The 7 Trajectory-Testing Tools

LangSmith - The LangChain-Native Leader

Braintrust - The Framework-Agnostic Evaluation Platform

Arize Phoenix - The OSS-First Observability + Eval Platform

Galileo - The Enterprise AI Evaluation Platform

Anthropic Claude Agent SDK Evals

OpenAI Evals

DeepEval Agent Metrics

Comparison Matrix

Trajectory Evaluation Metrics

Replay Testing: Regression Detection from Production

CI/CD Integration: When Trajectory Tests Run

CBUAE AI Guidance and Trajectory Evidence

Recommended Stacks

Common Failure Modes and How Trajectory Testing Catches Them

How genai.qa Delivers Agent Trajectory Testing

Related Reading

Frequently Asked Questions

Complementary NomadX Services

Break It Before They Do.