Benchmarks Chase the Work
7 stories · ~7 min read

If You Only Read One Thing
The new AI benchmark race is not about harder questions; it is about longer work. Ai2's AstaBench update and Artificial Analysis's AA-AgentPerf both abandon the fantasy that agents can be judged by a clean prompt-answer exchange. One measures scientific workflows; the other measures hardware under multi-turn agent load. The unit of progress is becoming the completed session.
AstaBench Prices Science Agents
A scientific agent that can write code, search papers, and analyze data is still not the same thing as an agent that can finish a research workflow. That is the useful tension in Ai2's spring AstaBench update.
Ai2 says the refreshed benchmark now tests frontier models on more than 2,400 research problems across literature understanding, code execution, data analysis, and end-to-end discovery workflows. The new leaderboard also has institutional uptake: UK AISI is adding AstaBench through Inspect Evals, General Reasoning integrated a task into OpenReward, and external submissions include Elicit, SciSpace, Distyl AI, and EvoScientist. The benchmark matters because it keeps the research task intact instead of reducing scientific work to a single answer.
The headline scores are sobering. Claude Opus 4.7 leads at 58.0% overall with an average measured cost of $3.54 per problem. GPT-5.5 reaches 52.9% at $1.61 per problem, just behind Ai2's older Asta v0 at 53.0% and ahead of Gemini 3.1 Pro Preview at 49.6% and GPT-5.4 at 46.5%. Opus 4.7 improves 2.7 points over Opus 4.6, but Ai2 says that costs about 62% more per problem. Most of the improvement comes from End-to-End Discovery, where Opus 4.7 wins by 10.2 points while taking 54% more steps and costing 65% more.
Why it matters: The benchmark is teaching a sharper concept than "AI can do science." Think of AstaBench as a lab-notebook test: the model is not rewarded merely for knowing a fact or producing a plausible method, but for chaining search, code, data, and writeup into something that survives evaluation. The structural signal is that component skill and workflow completion are separating. GPT-5.5 can lead several component categories at lower cost, but still trail on the hardest end-to-end work. Claude can buy more completion with more steps and more money. That turns scientific-agent progress into a three-variable problem: quality, workflow closure, and cost per completed research problem.
This is why AstaBench's adoption matters more than the exact first-place rank. Once a benchmark becomes a shared environment used by AI safety evaluators, research-tool startups, and RL infrastructure companies, it becomes training substrate as well as measurement. The field is starting to define "good at science" as something agents do inside an inspectable environment, not as a model-card claim about reasoning.
Room for disagreement: AstaBench is still partly scaffold-dependent, and Ai2 notes that End-to-End Discovery is judged by a Claude model. The right read is not that Claude is definitively the best scientific agent; it is that science benchmarks are now exposing the cost and fragility of multi-step research work.
What to watch: The confirmation signal is external submissions that improve end-to-end scores without simply spending more steps. If open agents or smaller specialist systems can move the Pareto frontier, scientific-agent progress becomes an architecture story rather than a frontier-model story.
AgentPerf Makes Hardware Work
Token throughput was a good metric when inference mostly meant answering independent prompts. Agents break that abstraction because a single user session can run for hundreds of turns, reuse context, call tools, and keep long state alive.
Artificial Analysis has now made that shift explicit with AA-AgentPerf, a hardware benchmark for agent workloads. It uses real coding-agent trajectories, including sessions up to 200 turns and sequence lengths above 100,000 tokens. It allows production optimizations such as key-value cache reuse, disaggregated prefill/decode, and speculative decoding. The output metric is not peak tokens per second; it is maximum concurrent users at target output-speed service levels, normalized per accelerator, per kilowatt, per dollar per hour, and per rack.
The methodology shows why this is not a cosmetic change. The dataset's input sequence length ranges from roughly 1,000 to 131,000 tokens, with a mean around 27,000. Output length has a median around 150 tokens and a P95 around 2,000 for DeepSeek V3.2. The launch models are gpt-oss-120b and DeepSeek V3.2, with service tiers such as 30, 100, and 300 tokens per second for DeepSeek and up to 2,000 tokens per second for gpt-oss-120b. The benchmark finds maximum supported users by ramping concurrency and then using binary search.
Why it matters: Agent serving is a queueing problem disguised as a model problem. The old benchmark asked, "How fast can this system generate tokens?" The agent-era question is, "How many long-running workers can this system keep useful before latency and context handling collapse?" That makes scheduler behavior, cache retention, prefill/decode separation, memory bandwidth, and power cost first-class evaluation variables. It also removes a convenient hiding place for hardware claims: a system optimized for clean 1,000-token prompts may fail when real agents carry messy histories and short, tool-driven outputs.
The deeper shift is that benchmark designers are catching up to product reality. Coding agents do not consume compute like chatbots. Scientific agents do not fail like exams. Multimodal agents do not use context like summarizers. Once the workload changes, the winning architecture changes too. AA-AgentPerf is important because it lets accelerator vendors compete on the actual demand curve: concurrent agent sessions under a service-level objective.
Room for disagreement: First public results are still pending, and the full test dataset is private to reduce benchmark targeting. That helps integrity but limits outside replication. The benchmark's value will depend on whether the disclosed system configurations are detailed enough for buyers and researchers to understand what actually won.
What to watch: The first rolling results should reveal whether B200, H200, MI300X, and emerging SRAM-heavy systems rank differently under agent concurrency than under conventional throughput tests.
The Contrarian Take
Everyone says: AI benchmarks are getting more realistic because the old leaderboards are saturated.
Here's why that's incomplete: The bigger change is that benchmarks are becoming procurement instruments. AstaBench prices a completed scientific problem; AA-AgentPerf prices concurrent agent users under latency targets. Those are not just better academic tests. They are the units that determine whether an agent product is economically deployable. The winners will be the systems that close workflows at tolerable cost, not the models that look best when every task is flattened into one score.
Under the Radar
-
Endpoint choice is now an eval variable — TokenArena argues that the relevant deployment unit is not the model but the endpoint: provider, model, SKU, quantization, decoding setup, region, and serving stack. Across 78 endpoints and 12 model families, the authors report same-model accuracy gaps up to 12.5 points on math and code, tail-latency gaps of an order of magnitude, and a 6.2x spread in modeled joules per correct answer.
-
Code reward models are moving beyond pass/fail — Themis ships code, datasets, and reward models trained on more than 350,000 code preference pairs across five quality dimensions and eight programming languages. The important move is away from execution correctness as the only reward signal: maintainability, readability, safety, and task fit start to become trainable preferences.
Quick Takes
-
Robotics RL is becoming a fleet loop. Learning While Deploying frames robot improvement as offline-to-online reinforcement learning across deployed machines: a fleet of 16 dual-arm robots, eight manipulation tasks, human interventions, and a single generalist policy that reaches 95% average success as experience accumulates. The notable idea is not another VLA benchmark; it is treating deployment failures as the training stream. (Source)
-
Video diffusion is trying to become a shared prior. UniVidX uses one video diffusion backbone for multiple pixel-aligned generation tasks, including RGB-to-intrinsic maps and RGBA layer decomposition, with stochastic condition masking and modality-specific LoRA adapters. The claimed generalization from fewer than 1,000 videos is the part to watch because it hints at reusable video priors instead of task-specific generators. (Source)
-
Coding agents still struggle to reproduce science. AutoMat asks coding agents to reproduce claims from computational materials-science papers, where procedures are underspecified and toolchains are specialized. The best reported setting reaches 54.1% overall success, with failures concentrated in procedure reconstruction and execution fragility. That is exactly the failure mode AstaBench is trying to surface at broader scientific scale. (Source)
The Thread
Today's throughline is workload fidelity. AstaBench says scientific ability means closing a messy research loop at an observable cost. AA-AgentPerf says inference hardware should be judged by sustained agent sessions, not synthetic token flow. TokenArena and Themis push the same direction from opposite sides: endpoint behavior and reward criteria are part of the system, not metadata. The field is getting less impressed by isolated intelligence and more demanding about the work container around it.
Prediction Ledger
Weekly Scorecard
- By 2026-08-31, at least two major model launch posts will include a third-party held-out evaluation or an explicit "not independently verified" caveat next to headline benchmark claims. — Made 2026-05-03, medium confidence. Pending: AstaBench and AA-AgentPerf support the evaluation-infrastructure trend, but this prediction is about model-launch disclosure.
- By 2026-08-31, at least one public coding-agent benchmark will add a trace-review or approval-quality metric alongside pass rate. — Made 2026-05-03, medium confidence. Pending: AA-AgentPerf moves toward real agent traces, but it measures hardware capacity rather than reviewer approval quality.
New prediction
- I predict: By 2026-08-31, at least two public agent benchmarks or inference leaderboards will report either cost per completed workflow or maximum concurrent agent users, not just pass rate or tokens per second. (Confidence: medium; Check by: 2026-08-31)
Generated May 4, 2026 at 3:38 AM ET.
Tomorrow morning in your inbox.
Subscribe for free. 10-minute read, every weekday.