Traces Beat Demos

If You Only Read One Thing

The useful signal on May 12 is not a model launch. It is that agent infrastructure is moving from demos to replayable evidence. ClawBench's May 12 leaderboard move bundles tasks, traces, scoring, and partial-batch correction into the same artifact, while Pydantic AI is turning provider quirks into declared runtime capabilities. Agent quality is becoming an audit trail.

ClawBench Makes Agents Replayable

Browser-agent benchmarks used to fail in two opposite ways: synthetic sites were too clean, while real websites were too hard to score reproducibly. ClawBench is interesting because it tries to split that difference with live sites, intercepted final requests, and trace bundles that can be replayed after the run.

The project says ClawBench now covers V1 with 153 tasks across 144 live websites and V2 with 130 newer tasks across 63 platforms. The May 12 update moved the canonical leaderboard to TIGER-Lab's Hugging Face Space, bundled the paper, V1 and V2 datasets, trace datasets, and Space into one collection, and corrected ranking so partial-batch runs no longer outrank complete runs with lower reward. The scoring pipeline is the important part: Stage 1 intercepts the final HTTP request that matches the task's URL and method schema; Stage 2 asks an LLM judge whether the intercepted payload actually fulfills the natural-language instruction. V1 traces include session replay, screenshots, HTTP traffic, browser actions, and agent reasoning.

Why it matters: This is closer to how production agent evaluation has to work. A browser agent that "looked right" in a screen recording but submitted the wrong payload is a product failure. A model that reaches the right endpoint but cannot satisfy the instruction is not done. By separating endpoint reachability from payload correctness, ClawBench turns agent failure into a diagnosable state rather than a single leaderboard number. That is more valuable than the headline scores, which are still low: the project lists the top V1 model at 33.3% success and its May 11 V2 note had the best full batch at 18.5% reward with 48.5% interception.

Room for disagreement: Live websites drift, task schemas can be brittle, and the LLM judge currently sees the intercepted request rather than the full visual state. That means ClawBench is not a universal browser-agent truth machine. The stronger claim is narrower and more useful: if an agent vendor cannot hand over request-level traces, replayable artifacts, and failure categories, its demo is not yet an evaluation.

What to watch: The next step is whether vendors submit full batches across the same harness and whether teams start using ClawBench-style traces as regression tests for browser agents rather than as one-time benchmark theater.

Pydantic Names Capabilities

Pydantic AI's May 12 release is easy to skim as normal SDK churn. It is more important than that: the framework is moving from provider abstraction to provider capability negotiation.

The v1.95.0 release adds native Tool Search for Anthropic and OpenAI, custom search strategies on any provider, an Instrumentation capability, Gemini 3 structured-output-plus-tool support, and V2 preparation that renames built-in tools to native tools registered through capabilities=[NativeTool(...)]. It also adds a local= opt-in for provider-adaptive capability fallback and deprecates automatic fallback. The bug-fix list points the same way: Bedrock model IDs are normalized for capability-profile lookup, Bedrock clients can be swapped at runtime again, and Vercel AI tool-input events now emit availability or error states.

Why it matters: The lowest-common-denominator AI SDK is running out of road. OpenAI, Anthropic, Google, Bedrock, Vercel AI, and local providers do not expose identical concepts for tool search, structured output, native tools, reasoning traces, or fallback behavior. Hiding that behind a generic chat() call feels simple until an app silently loses structured output when it changes models, or routes to a provider that cannot support the requested tool mode. Pydantic AI's release makes a better bet: declare what the model/provider pair can do, surface when a fallback is local and intentional, and treat instrumentation as a capability rather than a side-channel.

That is a shift from adapter code to runtime contracts. It also explains why the release has so much naming work. "Built-in tool" suggests a framework-owned abstraction. "Native tool" admits that the provider has semantics the framework must preserve. The practical consequence is less magical portability but fewer silent downgrades.

Room for disagreement: Capability profiles can become their own complexity tax. Teams may now need to understand provider-specific semantics anyway, plus Pydantic's profile layer. That is the right tax when the alternative is a clean abstraction that fails invisibly.

What to watch: The real test is whether Pydantic's instrumentation makes provider-capability mismatches visible in traces, eval reports, or CI failures. Capability negotiation matters most when a deployment can prove which capability was requested, which provider satisfied it, and which fallback path ran.

The Contrarian Take

Everyone says: Agent leaderboards will tell us which model to use once the tasks are realistic enough.

Here's why that's incomplete: The harness is becoming part of the model. ClawBench has to distinguish interception from payload correctness because browser agents can half-complete a task in ways a final-answer score would miss. WildClawBench reports that switching harness alone can move a model by up to 18 points. Pydantic AI is making the same argument from the framework side: the provider surface determines which tools, structured outputs, fallbacks, and instrumentation are even available. The purchase decision is no longer "which model is best?" It is "which model, harness, capability profile, and trace format produce work we can inspect?"

Under the Radar

The llm CLI moved reasoning models onto Responses. Simon Willison's 0.32a2 release now routes most reasoning-capable OpenAI models through /v1/responses, enabling interleaved reasoning across tool calls, round-tripping encrypted reasoning items, and optional visible reasoning summaries. The interesting part is not OpenAI support by itself; it is that even a small CLI now has to preserve provider-native reasoning state instead of flattening everything into chat completions.
Copilot's billing preview is really a telemetry preview. GitHub's April usage reports convert April Copilot activity into AI credits ahead of June 1 usage-based billing, but GitHub warns that some model usage, duplicate entries, and code-review credit estimates are incomplete. That caveat is the story: inference economics are moving into engineering dashboards before the measurement layer is clean.

Quick Takes

WildClawBench makes harness variance explicit. The arXiv paper describes 60 long-horizon tasks running inside real CLI agent harnesses including OpenClaw, Claude Code, Codex, and Hermes Agent, with tasks averaging roughly eight minutes and more than 20 tool calls. The best model reaches 62.2%, and the paper says harness choice alone can shift a model by up to 18 points. (Source)
Copilot code review is becoming triage software. GitHub added severity labels and grouped repeated comments in the new pull request experience. That sounds like UI polish, but it changes the cost model for AI review: the scarce resource is not generating comments, it is helping maintainers decide which comments deserve attention. (Source)
ClawBench's scoring doc is worth reading before the leaderboard. The project defines intercepted_rate and reward_rate separately, treats judge errors as failures, and explains why Stage 1 alone can be 1.5 to 2 times higher than the final reward score. That is a useful antidote to agent demos that count navigation progress as task completion. (Source)

The Thread

This issue's thread is that agents are being pulled into the same discipline that made ordinary software reliable: logs, contracts, replay, measurement, and explicit failure states. ClawBench is doing it from the evaluation side by capturing the final request and the behavioral trace. WildClawBench is doing it from the runtime side by showing that the harness itself changes outcomes. Pydantic AI is doing it from the framework side by naming provider capabilities rather than pretending they are interchangeable. Copilot's usage reports show the economic version of the same shift. Once AI work becomes metered, delegated, and reviewable, the system around the model becomes the product surface.

Predictions

I predict: By 2026-08-31, at least one mainstream agent evaluation suite will require submitted runs to include replayable traces or request-level artifacts, not just final task scores. (Confidence: medium; Check by: 2026-08-31)

Generated: 2026-05-12 03:41 ET