Test What Your Agent Does - Not Just What It Says

Name: Agent Trajectory Testing Sprint | genai.qa - 5-Day Multi-Step Agent Evaluation
Author: genai.qa

A 5-day trajectory testing sprint: curate golden trajectories, deploy the evaluation harness on LangSmith / Braintrust / Arize Phoenix, establish 5-8 metrics, integrate with CI, and produce CBUAE-aligned evidence.

Duration: 5 days Team: 1 Senior Agent Evaluation Specialist + 1 ML Engineer

The Challenge

You might be experiencing...

Your LLM evaluation tells you the answer is correct, but you have no way to measure whether the agent took the right multi-step path to get there.

An upstream model update silently changed how your agent reasons, and you only noticed after a customer complaint.

CBUAE or another regulator is asking for 'evidence of agent correctness' and you are not sure what that even means as an artefact.

Your agent works on happy-path queries but fails in unpredictable ways on edge cases - you need systematic coverage.

Cost per trajectory doubled last month. No one knows why. You need trajectory-level cost attribution.

Agent trajectory testing is the evaluation discipline that caught up with production AI agents in 2024-2026. Where traditional LLM evaluation scores a single input/output pair, trajectory testing evaluates the multi-step path an agent takes - tool calls, state transitions, approval gates, error recovery, loops, and budget adherence.

Our 5-day Agent Trajectory Testing Sprint deploys a production-ready trajectory evaluation programme for your agents. By the end of the sprint you have a running evaluation harness on your platform of choice (LangSmith, Braintrust, or self-hosted Arize Phoenix), a curated set of 30-50 golden trajectories, 5-8 metrics tracked over time, and CI integration that gates regressions before they ship.

Why Trajectory Testing Matters

Traditional LLM evaluation asks “was the final answer correct?”. Trajectory testing asks “was the path correct?” - because in production:

An agent can reach a correct answer via the wrong tool call, producing side effects that surface as problems weeks later.
An agent can bypass an approval gate that a regulator expects to be enforced.
An agent can loop pathologically, producing correct output at 40x the intended cost.
An upstream model update can silently change reasoning patterns, passing output evaluation but failing trajectory evaluation.

For regulated deployments - CBUAE AI Guidance, EU AI Act Article 15, FDA SaMD, ISO/IEC 42001 - trajectory-level evidence is increasingly expected, not just output correctness.

The 5-Day Sprint Structure

Day 1 - Trajectory Architecture Review: understand your agent, identify core flows, classify decision types.

Day 2 - Golden Trajectory Curation: produce 30-50 canonical traces covering 80% of production + known edge cases + regulatory-critical scenarios.

Day 3 - Evaluation Harness Deployment: LangSmith (LangChain-native), Braintrust (agnostic), or Arize Phoenix (self-host). Instrument the agent, import dataset, configure evaluators.

Day 4 - Metrics & Replay: 5-8 trajectory metrics wired up. Replay-testing harness if production traces are available.

Day 5 - CI Integration & Handover: pre-merge + nightly gates, runbook delivery, team training, baseline performance report.

Framework Support

We support every production agent framework in 2026:

LangChain / LangGraph via LangSmith (zero-friction integration)
Claude Agent SDK via Anthropic evaluation patterns + Braintrust or Phoenix
OpenAI Agents SDK via OpenAI Evals + Braintrust or Phoenix
CrewAI / AutoGen via Braintrust or Arize Phoenix
Semantic Kernel via custom harness + Phoenix
Custom / in-house via OpenTelemetry-based instrumentation

Platform selection depends on framework fit plus data-residency requirements. For UAE regulated deployments we typically deploy self-hosted Arize Phoenix; for non-residency-sensitive deployments LangSmith or Braintrust offer the polished SaaS path.

CBUAE AI Guidance Alignment

The February 2026 CBUAE AI Guidance expects licensed financial institutions to maintain ongoing evaluation evidence for every production AI feature. Trajectory testing produces the machine-readable evidence that maps to the 5 CBUAE AI principles:

Fairness - trajectory metrics split by demographic or customer-segment dimensions
Transparency - documented trajectory methodology and per-trajectory audit trails
Accountability - named trajectory owners, defined escalation on metric regressions
Data Governance - training-data lineage captured at trajectory-trace level
Human Oversight - HITL gate adherence as an audit-critical metric

For CBUAE-regulated clients, our sprint output includes explicit evidence mapping formatted for inspector review.

Complementary Services

Trajectory testing is one layer of comprehensive agent QA. Pair with:

Agentic AI Safety Assessment for adversarial/safety testing
GenAI Red-Team Sprint for broader application adversarial testing
LLM Evaluation Suite at aiml.qa for model-layer single-turn evaluation
AI Security Assessment at pentest.ae for security-focused red-teaming

Many enterprise clients engage 2-3 of these services as a coordinated programme producing comprehensive coverage across trajectory + safety + security + model-layer concerns.

Book a Discovery Call

Book a free 30-minute discovery call to scope your Agent Trajectory Testing Sprint with a genai.qa specialist. We will review your agent architecture, identify the most impactful evaluation scope, and scope the sprint to your framework and compliance context.

Our Approach

Engagement Phases

Day 1

Trajectory Architecture Review

Review agent architecture (LangGraph / Claude Agent SDK / OpenAI Agents SDK / CrewAI / custom), identify the core use cases to evaluate, map the expected tool-call patterns, and classify decision types (HITL / HOTL / advisory / automated). Produce evaluation scope document.

Day 2

Golden Trajectory Curation

Curate 30-50 golden trajectories covering the 80% of production queries plus known edge cases, adversarial inputs, and regulatory-critical scenarios. Each trajectory includes expected tool calls, expected intermediate state, expected approval gates, and expected final output.

Day 3

Evaluation Harness Deployment

Deploy chosen trajectory evaluation platform (LangSmith if LangChain-native, Braintrust for multi-framework, Arize Phoenix for self-hosted residency), instrument the agent, import golden trajectories as evaluation datasets, configure LLM-as-judge evaluators.

Day 4

Metrics & Replay Testing

Establish 5-8 trajectory metrics (trajectory accuracy, tool-call precision/recall, path efficiency, cost per trajectory, safety refusal rate, recovery rate, HITL gate adherence). Configure replay testing harness if production traces are available.

Day 5

CI Integration & Handover

Integrate trajectory tests into CI pipeline (pre-merge subset + nightly full-suite), produce trajectory evaluation runbook, train engineering team on adding new golden trajectories, deliver baseline report with current agent performance across all metrics.

What You Get

Deliverables

Agent trajectory architecture review document

30-50 curated golden trajectories (version-controlled dataset)

Deployed evaluation harness (LangSmith / Braintrust / Arize Phoenix) with full instrumentation

5-8 trajectory metrics running against every evaluation run

CI pipeline integration (pre-merge + nightly)

Replay testing capability when production traces available

Baseline agent performance report across all metrics

Trajectory evaluation runbook for ongoing team operation

CBUAE AI Guidance evidence mapping (where applicable)

Expected Outcomes

Before & After

Metric	Before	After
Trajectory Regression Detection	Manual QA only	Automated CI gate
Tool-Call Correctness Visibility	Unknown	Precision/Recall tracked
Cost per Successful Trajectory	Aggregate only	Per-trajectory attribution
HITL Gate Adherence	Unmeasured	Audit-ready metric
Upstream Model Drift Detection	Customer complaints	Automated alert within hours

Technology

Tools We Use

LangSmith Braintrust Arize Phoenix Claude Agent SDK Evals OpenAI Evals DeepEval Agent Metrics OpenTelemetry Galileo

Common Questions

Frequently Asked Questions

How is agent trajectory testing different from your existing GenAI Red-Team Sprint?

The Red-Team Sprint is adversarial security testing - can an attacker break your agent's safety boundaries? Trajectory testing is quality evaluation - does your agent take the correct multi-step path on legitimate queries? Both are needed for production agents; they answer different questions. Many enterprise clients run both sprints sequentially or in parallel for comprehensive agent QA coverage.

Which agent frameworks do you support?

We support LangChain and LangGraph (via LangSmith), Claude Agent SDK (via Anthropic's evaluation patterns plus Braintrust or Phoenix), OpenAI Agents SDK (via OpenAI Evals plus Braintrust or Phoenix), CrewAI and AutoGen (via Braintrust or Arize Phoenix), and custom agent frameworks. Platform selection depends on your framework and data-residency requirements.

What does a golden trajectory look like in practice?

A golden trajectory captures: the original user input, the expected sequence of tool calls with exact arguments, expected intermediate state values, expected approval gate decisions, and the expected final output. For example, a refund-processing agent trajectory might be: user_query -> validate_customer_tool -> check_refund_eligibility_tool -> verify_approval_threshold_tool -> request_human_approval (at $500+) -> execute_refund_tool -> send_confirmation_tool -> final_output. Each step has expected inputs and outputs.

Can you integrate with CBUAE AI Guidance compliance evidence?

Yes. The February 2026 CBUAE AI Guidance expects ongoing evaluation evidence for every production AI feature. Our trajectory testing engagements map metrics directly to the 5 CBUAE AI principles (Fairness, Transparency, Accountability, Data Governance, Human Oversight). HITL gate adherence maps to the Human Oversight principle; trajectory-consistency metrics support Accountability evidence. We deliver documentation explicitly formatted for CBUAE inspector review.

What is replay testing and do we need it?

Replay testing captures real agent trajectories from production traffic and re-runs them against new agent versions to detect regressions. It complements golden trajectories - goldens are curated edge cases; replay covers real-world distribution. Strongly recommended for agents with meaningful production traffic. We help anonymize captured traces for UAE PDPL compliance before re-use as test data.

How do we maintain golden trajectories over time?

Golden trajectories are version-controlled alongside application code. When your agent's correct behaviour legitimately changes (new feature, policy update), you update the corresponding goldens. Our engagement includes a runbook for trajectory maintenance and a review process. We recommend quarterly golden-set reviews and ongoing addition of 5-10 new trajectories per month based on production failure signals.

Can we run this alongside an ongoing AI QA Retainer?

Yes. Many clients engage the trajectory testing sprint as the foundation, then continue with an AI QA Retainer that maintains the evaluation harness, expands golden trajectories, operates replay testing, responds to drift, and produces quarterly compliance reports. The sprint establishes the foundation; the retainer keeps the trajectory-evaluation programme current.

How does this work with Claude Agent SDK's Computer Use?

Claude's Computer Use (visual GUI agents) has trajectory characteristics beyond traditional tool-use - screen-state transitions, click sequences, window-management. We extend the standard trajectory-testing methodology with Computer-Use-specific evaluators (screen-state correctness, action sequence evaluation, visual hallucination detection) built on top of Anthropic's evaluation patterns. This is one of the more complex agent evaluation scenarios and we have dedicated tooling for it.

Deep Dives

Production Guides From This Series

Break It Before They Do.

Book a free 30-minute GenAI QA scope call. We review your AI application, identify the top risks, and show you exactly what to test before you ship.

Talk to an Expert