Effort Becomes the SKU

If You Only Read One Thing

The technical story today is not another smarter model; it is control surfaces hardening around smarter models. Inspect's eval-runtime hardening shows evaluation becoming stateful infrastructure, while Grok 4.3's effort dial shows model choice collapsing into runtime policy. Start with the Inspect changelog, because its release-note details make the hidden failure modes of agent systems visible.

Inspect Makes Evals Stateful

The quiet change in AI evaluation is that the benchmark script is turning into a runtime. It now has to survive rate limits, retries, tool errors, provider quirks, partial logs, and cost accounting.

That is the signal in Inspect AI's May 3-8 release train. The 0.3.217 through 0.3.220 changelog adds adaptive API concurrency based on rate-limit feedback, scoring on errored samples, immediate eval retries, higher default task parallelism, retry reuse of previous model and role usage, S3 log-write fixes for concurrent flushes, Model Context Protocol timeout errors, Gemini 3 native web search and code execution, a vLLM completions provider, Grok parallel-tool-call plumbing, and Grok 4 reasoning-effort support.

Why it matters: An eval harness is the system that runs model tests, records transcripts, scores outputs, and turns the result into something a team can trust. The old mental model was a clean loop: run prompts, collect answers, score them. That model breaks once agent evaluations start using tools, long contexts, multiple providers, sandboxes, and hundreds of concurrent samples. At that point, a failed result may mean the model failed, the rate limiter failed, the retry logic duplicated work, the log write corrupted state, or the provider adapter lost a tool result.

Inspect's releases are interesting because they move evaluation closer to production semantics. Adaptive connections acknowledge that model APIs are capacity-managed systems. Retry-immediate and usage roll-forward acknowledge that partial work is state, not trash. Scoring errored samples acknowledges that failures should be measured rather than erased. Provider-specific patches for Gemini, vLLM, Grok, Bedrock, Vertex, and Hugging Face acknowledge the same thing agent developers already know: "OpenAI-compatible" is not a behavioral contract.

The counterargument is that this is just plumbing. That is exactly why it matters. When evals are used to choose models for agents, plumbing becomes measurement integrity. A model ranking that silently reruns successful samples, drops usage accounting, or treats operator interruption like model failure is not a noisy benchmark. It is bad instrumentation.

What to watch: The next serious eval tools will expose retry provenance, rate-limit behavior, usage roll-forward, and provider-normalization errors as first-class report fields. If those stay hidden, model comparisons will keep mixing capability with harness fragility.

Grok Turns Effort Into SKU

xAI is doing something more interesting than retiring old model names. It is compressing a messy model menu into a smaller number of runtime knobs.

The company says eight models will stop working in the API on May 15 at noon Pacific, including grok-4-1-fast-reasoning, grok-4-fast-non-reasoning, grok-4-0709, grok-code-fast-1, and grok-3. Its migration guide points reasoning and code workloads to grok-4.3, non-reasoning workloads to grok-4.20-non-reasoning, and says Grok 4.3 has a 1 million token context window, three reasoning-effort levels, and $1.25 per million input tokens / $2.50 per million output tokens pricing. The newer reasoning docs show the actual control surface: reasoning_effort can be "none", "low", "medium", or "high", with low as the default.

Why it matters: Reasoning effort is becoming a deployment parameter, not a model category. The old API shape had separate names for fast, non-reasoning, reasoning, code, and flagship variants. The new shape pushes more of that decision into a per-request policy: how much thinking, how much latency, how much cost, and whether the model is allowed to return summarized or encrypted reasoning content.

Artificial Analysis gives the tradeoff some independent shape. Its Grok 4.3 analysis puts the model at 53 on its Intelligence Index, says it costs about 20% less than Grok 4.20 0309 v2 to run the benchmark suite, and reports a 321-point GDPval-AA gain to 1500 Elo. But it also flags the part routing systems care about: Grok 4.3 used roughly 44% more output tokens than Grok 4.20 and lost 8 points on non-hallucination rate while gaining 8 points on accuracy. The cheaper model is not simply "better." It changes the frontier between cost, verbosity, and trust.

That is the structural shift. Model selection is moving from a static vendor spreadsheet to a live control loop. A router now has to decide whether Grok 4.3 at low effort is good enough, whether high effort is worth the tokens, whether Grok 4.20 non-reasoning is safer for factual tasks, and whether the absence of presencePenalty, frequencyPenalty, and stop on reasoning models breaks existing prompts. The model name matters less; the policy around it matters more.

Room for disagreement: xAI's docs are still vendor claims, and Grok's ecosystem remains thinner than OpenAI, Anthropic, and Google around evals, observability, and enterprise integration. The stronger version of this story requires independent task-level routing results, not just model-level benchmarks.

What to watch: The useful comparison is not Grok 4.3 versus Claude or GPT in isolation. It is low-effort Grok 4.3 versus non-reasoning Grok 4.20 versus a frontier model on completed agent workflows with cost, latency, and hallucination penalties included.

The Contrarian Take

Everyone says: The next AI platform fight is about who has the best frontier model.

Here's why that's wrong, or at least incomplete: Today's useful signal is that model intelligence is becoming an input to a control system. Inspect is hardening the measurement layer because evals fail like distributed software. xAI is shrinking model sprawl into reasoning-effort policy because one static model pick cannot cover every step. The advantage moves to whoever can measure failure, route effort, and preserve state across tools and providers. The model still matters, but the switching cost is moving into the harness.

Under the Radar

MTP is local inference's new fork pressure - Multi-token prediction (MTP) lets a model draft several future tokens and verify them in a single pass. The llama.cpp MTP pull request is still draft code, but the local community is already treating it as the missing path for Qwen3.6-class models. One consumer Blackwell test reports Qwen3.6-27B moving from 35 to 78 tokens per second with the PR and the right MTP-enabled GGUF, while a separate RTX 3090 benchmark shows speculative decoding can still lose badly depending on engine and method. The point is not a universal speedup; it is that local inference is becoming runtime-specific rather than model-specific.
HTML is becoming an agent review surface - Simon Willison's note on Claude Code and HTML output looks like a prompt trick, but the deeper signal is output ergonomics. For PR review, exploit analysis, and architecture explanation, HTML lets agents return diagrams, navigation, annotations, and interactive state instead of a long Markdown scroll. That makes the artifact reviewable, not merely readable.

Quick Takes

Cache pricing is now a leaderboard variable. Artificial Analysis added cache-hit pricing to language-model pricing, which matters because prompt caching is no longer a vendor footnote. Once cached-input price sits beside input and output price, cost comparisons can reflect agent workloads that reuse long prefixes rather than one-shot chat prompts. (Source)
MTPLX is promising but not yet a runtime default. A Hugging Face Qwen3.6-27B MTPLX checkpoint claims up to 2.24x faster decode on Apple Silicon using native MTP heads and an MLX-focused runtime contract. The limitation is explicit: the MTPLX runtime is not yet released, and without it the checkpoint falls back to ordinary autoregressive decoding. (Source)
The paper firehose failed the practicality floor today. Hugging Face's May 8 list is packed with agent-training and reinforcement-learning papers: Skill1, SkillOS, StraTA, A2TGPO, and more. The useful takeaway is negative: most of the novelty is training-methodology churn, not something that changes a coding agent, model pick, inference bill, or reliability workflow this week. (Source)

The Thread

Today's thread is effort control. Inspect is turning evaluation into a stateful runtime because effort is wasted when retries, logs, tools, and provider adapters lie. xAI is turning reasoning into a per-request knob because effort is wasted when every task gets the same model tier. MTP and cache pricing tell the same story at the inference layer: the valuable systems are not simply smarter. They spend compute only where the workflow can prove it needs the extra work.

Predictions

New predictions:

I predict: By 2026-07-31, at least two major eval or observability tools will expose retry provenance, provider-normalization errors, or rate-limit adaptation as report-level fields rather than debug logs. (Confidence: medium; Check by: 2026-07-31)
I predict: By 2026-08-31, at least one mainstream AI SDK or router will normalize reasoning_effort-style controls across three providers, including xAI and at least one of OpenAI, Anthropic, or Google. (Confidence: medium; Check by: 2026-08-31)

Generated: 2026-05-10 03:45 ET