Test What Your Agent Does - Not Just What It Says
A 5-day trajectory testing sprint: curate golden trajectories, deploy the evaluation harness on LangSmith / Braintrust / Arize Phoenix, establish 5-8 metrics, integrate with CI, and produce CBUAE-aligned evidence.
You might be experiencing...
Agent trajectory testing is the evaluation discipline that caught up with production AI agents in 2024-2026. Where traditional LLM evaluation scores a single input/output pair, trajectory testing evaluates the multi-step path an agent takes - tool calls, state transitions, approval gates, error recovery, loops, and budget adherence.
Our 5-day Agent Trajectory Testing Sprint deploys a production-ready trajectory evaluation programme for your agents. By the end of the sprint you have a running evaluation harness on your platform of choice (LangSmith, Braintrust, or self-hosted Arize Phoenix), a curated set of 30-50 golden trajectories, 5-8 metrics tracked over time, and CI integration that gates regressions before they ship.
Why Trajectory Testing Matters
Traditional LLM evaluation asks “was the final answer correct?”. Trajectory testing asks “was the path correct?” - because in production:
- An agent can reach a correct answer via the wrong tool call, producing side effects that surface as problems weeks later.
- An agent can bypass an approval gate that a regulator expects to be enforced.
- An agent can loop pathologically, producing correct output at 40x the intended cost.
- An upstream model update can silently change reasoning patterns, passing output evaluation but failing trajectory evaluation.
For regulated deployments - CBUAE AI Guidance, EU AI Act Article 15, FDA SaMD, ISO/IEC 42001 - trajectory-level evidence is increasingly expected, not just output correctness.
The 5-Day Sprint Structure
Day 1 - Trajectory Architecture Review: understand your agent, identify core flows, classify decision types.
Day 2 - Golden Trajectory Curation: produce 30-50 canonical traces covering 80% of production + known edge cases + regulatory-critical scenarios.
Day 3 - Evaluation Harness Deployment: LangSmith (LangChain-native), Braintrust (agnostic), or Arize Phoenix (self-host). Instrument the agent, import dataset, configure evaluators.
Day 4 - Metrics & Replay: 5-8 trajectory metrics wired up. Replay-testing harness if production traces are available.
Day 5 - CI Integration & Handover: pre-merge + nightly gates, runbook delivery, team training, baseline performance report.
Framework Support
We support every production agent framework in 2026:
- LangChain / LangGraph via LangSmith (zero-friction integration)
- Claude Agent SDK via Anthropic evaluation patterns + Braintrust or Phoenix
- OpenAI Agents SDK via OpenAI Evals + Braintrust or Phoenix
- CrewAI / AutoGen via Braintrust or Arize Phoenix
- Semantic Kernel via custom harness + Phoenix
- Custom / in-house via OpenTelemetry-based instrumentation
Platform selection depends on framework fit plus data-residency requirements. For UAE regulated deployments we typically deploy self-hosted Arize Phoenix; for non-residency-sensitive deployments LangSmith or Braintrust offer the polished SaaS path.
CBUAE AI Guidance Alignment
The February 2026 CBUAE AI Guidance expects licensed financial institutions to maintain ongoing evaluation evidence for every production AI feature. Trajectory testing produces the machine-readable evidence that maps to the 5 CBUAE AI principles:
- Fairness - trajectory metrics split by demographic or customer-segment dimensions
- Transparency - documented trajectory methodology and per-trajectory audit trails
- Accountability - named trajectory owners, defined escalation on metric regressions
- Data Governance - training-data lineage captured at trajectory-trace level
- Human Oversight - HITL gate adherence as an audit-critical metric
For CBUAE-regulated clients, our sprint output includes explicit evidence mapping formatted for inspector review.
Complementary Services
Trajectory testing is one layer of comprehensive agent QA. Pair with:
- Agentic AI Safety Assessment for adversarial/safety testing
- GenAI Red-Team Sprint for broader application adversarial testing
- LLM Evaluation Suite at aiml.qa for model-layer single-turn evaluation
- AI Security Assessment at pentest.ae for security-focused red-teaming
Many enterprise clients engage 2-3 of these services as a coordinated programme producing comprehensive coverage across trajectory + safety + security + model-layer concerns.
Book a Discovery Call
Book a free 30-minute discovery call to scope your Agent Trajectory Testing Sprint with a genai.qa specialist. We will review your agent architecture, identify the most impactful evaluation scope, and scope the sprint to your framework and compliance context.
Engagement Phases
Trajectory Architecture Review
Review agent architecture (LangGraph / Claude Agent SDK / OpenAI Agents SDK / CrewAI / custom), identify the core use cases to evaluate, map the expected tool-call patterns, and classify decision types (HITL / HOTL / advisory / automated). Produce evaluation scope document.
Golden Trajectory Curation
Curate 30-50 golden trajectories covering the 80% of production queries plus known edge cases, adversarial inputs, and regulatory-critical scenarios. Each trajectory includes expected tool calls, expected intermediate state, expected approval gates, and expected final output.
Evaluation Harness Deployment
Deploy chosen trajectory evaluation platform (LangSmith if LangChain-native, Braintrust for multi-framework, Arize Phoenix for self-hosted residency), instrument the agent, import golden trajectories as evaluation datasets, configure LLM-as-judge evaluators.
Metrics & Replay Testing
Establish 5-8 trajectory metrics (trajectory accuracy, tool-call precision/recall, path efficiency, cost per trajectory, safety refusal rate, recovery rate, HITL gate adherence). Configure replay testing harness if production traces are available.
CI Integration & Handover
Integrate trajectory tests into CI pipeline (pre-merge subset + nightly full-suite), produce trajectory evaluation runbook, train engineering team on adding new golden trajectories, deliver baseline report with current agent performance across all metrics.
Deliverables
Before & After
| Metric | Before | After |
|---|---|---|
| Trajectory Regression Detection | Manual QA only | Automated CI gate |
| Tool-Call Correctness Visibility | Unknown | Precision/Recall tracked |
| Cost per Successful Trajectory | Aggregate only | Per-trajectory attribution |
| HITL Gate Adherence | Unmeasured | Audit-ready metric |
| Upstream Model Drift Detection | Customer complaints | Automated alert within hours |
Tools We Use
Frequently Asked Questions
How is agent trajectory testing different from your existing GenAI Red-Team Sprint?
The Red-Team Sprint is adversarial security testing - can an attacker break your agent's safety boundaries? Trajectory testing is quality evaluation - does your agent take the correct multi-step path on legitimate queries? Both are needed for production agents; they answer different questions. Many enterprise clients run both sprints sequentially or in parallel for comprehensive agent QA coverage.
Which agent frameworks do you support?
We support LangChain and LangGraph (via LangSmith), Claude Agent SDK (via Anthropic's evaluation patterns plus Braintrust or Phoenix), OpenAI Agents SDK (via OpenAI Evals plus Braintrust or Phoenix), CrewAI and AutoGen (via Braintrust or Arize Phoenix), and custom agent frameworks. Platform selection depends on your framework and data-residency requirements.
What does a golden trajectory look like in practice?
A golden trajectory captures: the original user input, the expected sequence of tool calls with exact arguments, expected intermediate state values, expected approval gate decisions, and the expected final output. For example, a refund-processing agent trajectory might be: user_query -> validate_customer_tool -> check_refund_eligibility_tool -> verify_approval_threshold_tool -> request_human_approval (at $500+) -> execute_refund_tool -> send_confirmation_tool -> final_output. Each step has expected inputs and outputs.
Can you integrate with CBUAE AI Guidance compliance evidence?
Yes. The February 2026 CBUAE AI Guidance expects ongoing evaluation evidence for every production AI feature. Our trajectory testing engagements map metrics directly to the 5 CBUAE AI principles (Fairness, Transparency, Accountability, Data Governance, Human Oversight). HITL gate adherence maps to the Human Oversight principle; trajectory-consistency metrics support Accountability evidence. We deliver documentation explicitly formatted for CBUAE inspector review.
What is replay testing and do we need it?
Replay testing captures real agent trajectories from production traffic and re-runs them against new agent versions to detect regressions. It complements golden trajectories - goldens are curated edge cases; replay covers real-world distribution. Strongly recommended for agents with meaningful production traffic. We help anonymize captured traces for UAE PDPL compliance before re-use as test data.
How do we maintain golden trajectories over time?
Golden trajectories are version-controlled alongside application code. When your agent's correct behaviour legitimately changes (new feature, policy update), you update the corresponding goldens. Our engagement includes a runbook for trajectory maintenance and a review process. We recommend quarterly golden-set reviews and ongoing addition of 5-10 new trajectories per month based on production failure signals.
Can we run this alongside an ongoing AI QA Retainer?
Yes. Many clients engage the trajectory testing sprint as the foundation, then continue with an AI QA Retainer that maintains the evaluation harness, expands golden trajectories, operates replay testing, responds to drift, and produces quarterly compliance reports. The sprint establishes the foundation; the retainer keeps the trajectory-evaluation programme current.
How does this work with Claude Agent SDK's Computer Use?
Claude's Computer Use (visual GUI agents) has trajectory characteristics beyond traditional tool-use - screen-state transitions, click sequences, window-management. We extend the standard trajectory-testing methodology with Computer-Use-specific evaluators (screen-state correctness, action sequence evaluation, visual hallucination detection) built on top of Anthropic's evaluation patterns. This is one of the more complex agent evaluation scenarios and we have dedicated tooling for it.
Production Guides From This Series
Complementary Services
Break It Before They Do.
Book a free 30-minute GenAI QA scope call. We review your AI application, identify the top risks, and show you exactly what to test before you ship.
Talk to an Expert