AI Intelligence

Proof Beats Fan-Out

7 stories · ~7 min read

Proof Beats Fan-Out

If You Only Read One Thing

The bottleneck is no longer parallel work; it is provable discard. Claude Code Learns To Fan Out turns subagents into a temporary job graph, while Speculation Gets a Checkpoint makes draft tokens cheap only because the target model can reject them. Start with Anthropic's dynamic-workflows post because it makes verification, state, and token burn one product surface.

Claude Code Learns To Fan Out

Claude Opus 4.8 is the headline model release, but the developer story is dynamic workflows. Anthropic is moving from "the agent can call subagents" to "the product can author and run a temporary orchestration system around a task."

Dynamic workflows are in research preview for Claude Code CLI, Desktop, VS Code, the Claude API, Bedrock, Vertex AI, and Microsoft Foundry. Claude can write orchestration scripts, run tens to hundreds of parallel subagents in one session, verify outputs before returning them, and resume long runs after interruption. Anthropic's example is deliberately extreme: Jarred Sumner's Bun rewrite from Zig to Rust, with roughly 750,000 lines of Rust, 99.8% of the existing test suite passing, and eleven days from first commit to merge.

Why it matters: The baseline for coding agents has been sequential delegation: you ask for a plan, maybe spawn a few subagents, then reconcile the result yourself. Dynamic workflows turn the plan into an executable job graph. That shifts the constraint from "can the model fit the problem in context?" to "can the harness cheaply create independent attempts, reject bad ones, and preserve enough state to converge?" This is why the companion Opus 4.8 details matter: mid-conversation system messages let a harness update instructions after a user turn without restating the whole system prompt, preserving prompt-cache hits; lower cache minimums make shorter agent loops cacheable; effort defaults and fast mode turn reasoning depth and output speed into explicit budget knobs.

The structural change is that autonomy is becoming a scheduling problem. A single agent is a conversation. A dynamic workflow is closer to a build system: decompose, run, check, fold back, repeat. That is a more honest shape for large migrations, bug hunts, and security reviews because it admits that one model pass is not the unit of work. The useful unit is an independently checked slice.

Room for disagreement: The countercase is cost and observability. Anthropic warns that workflows can consume substantially more tokens than a normal Claude Code session, and research preview means the failure modes are still being discovered. Hundreds of subagents are useful only when the task has separable surfaces and a real acceptance test; otherwise the product has made confusion parallel.

What to watch: The decisive variable is whether Claude Code exposes a durable workflow artifact: the generated script, subagent assignments, verification checks, and cost trail. Without that audit layer, dynamic workflows are impressive demos. With it, they become the first serious product form for long-running agent work.

Speculation Gets a Checkpoint

The vLLM news is not one blog post. It is a cluster: EAGLE 3.1, Laguna XS.2 integration, DFlash speculators, and quantized checkpoints all point at the same inference-economics shift. Speed is moving from a serving-engine feature into a model-release artifact.

Speculative decoding means a small draft model proposes several next tokens, then the large model verifies them in parallel. If the large model would have produced those tokens anyway, the system accepts them and saves time without changing output quality. The old difficulty was brittleness: draft models could work in controlled tests and then fall apart under long contexts, chat templates, or production prompts.

EAGLE 3.1 names that failure mode "attention drift," where the drafter gradually attends to its own generated tokens instead of the prompt context. The fix is not just a paper claim. vLLM says EAGLE 3.1 is merged as a config-driven extension, backward-compatible with EAGLE 3 checkpoints, and usable with an open Kimi K2.6 draft model. On a Kimi-K2.6-NVFP4 deployment with tensor parallelism of 4 on GB200, vLLM reports 2.03x higher per-user output throughput at concurrency 1, 1.71x at concurrency 4, and 1.66x at concurrency 16 on a coding workload.

Why it matters: The key change is packaging. A model used to ship as weights plus maybe a quantized variant. The new high-performance release ships with a serving recipe: quantized checkpoints, draft checkpoints, parser flags, attention backend choices, and speculative config. Poolside's Laguna XS.2 vLLM post makes the same point in another form: Laguna XS.2 gets first-class vLLM integration, a five-layer 0.6B DFlash draft model, and LLM Compressor checkpoints. vLLM says the DFlash path predicts eight tokens at once and delivers tokens 2-3x faster when verified by the target model.

This matters because agent workloads are output-token heavy. Coding agents spend their budget on long tool plans, diff explanations, test failures, and retries. A 2x output-throughput gain on the right workload is not a benchmark vanity metric; it changes how many concurrent agent sessions a fixed GPU pool can support before latency becomes unusable. It also creates a new selection axis: the best deployment model may be the one with the best surrounding inference kit, not the one with the cleanest raw leaderboard number.

Room for disagreement: These are vendor and project benchmarks, not independent production measurements. Speculative decoding gains vary by prompt shape, sampling settings, concurrency, hardware, and acceptance rate. The signal is strong enough to matter because the code and checkpoints are inspectable, but the real proof is provider-side replication on ordinary traffic.

What to watch: vLLM says EAGLE 3.1 will be in nightly releases and the upcoming v0.22.0 release. The useful confirmation would be a hosted provider publishing per-model speculator settings, latency curves, and rollback controls rather than just saying a model is "optimized."

The Contrarian Take

Everyone says: Claude is scaling agents by adding more subagents, and vLLM is scaling inference by adding more throughput. The shared story is "more parallelism."

Here's why that's incomplete: Parallelism is cheap only when rejection is cheap. Claude's dynamic workflows matter because they pair fan-out with independent verification and test-suite convergence. vLLM's speculative decoding matters because every drafted token is checked by the target model before it is accepted. The alpha is not "hundreds of agents" or "eight drafted tokens." It is the control system that decides which parallel guesses survive.

Under the Radar

  • OpenJarvis makes local personal AI measurable - OpenJarvis 1.0 now has built-in Ollama support, but the deeper point is the spec architecture behind it. The paper models a personal AI stack as editable primitives for intelligence, engine, agents, tools and memory, and learning; its LLM-guided spec search matches or beats cloud accuracy on 4 of 8 benchmarks while cutting marginal API cost by about 800x and latency by 4x.

  • vLLM is standardizing RL weight updates - The native RL API work is niche but important for teams training models with live inference rollouts. vLLM now exposes weight-syncing APIs, pause/resume support, and deadlock fixes for distributed async RL; Prime-RL validated a wide expert-parallel run for more than 100 training steps across 16 eight-H200 nodes.

Quick Takes

  • Codex is making support state inspectable. Codex 0.135.0 adds richer codex doctor diagnostics, remote connection details in /status, named permission-profile display, Vim editing improvements, and Python SDK expansion. The boring interpretation is "CLI polish." The structural one is that coding agents now need support bundles because the runtime environment is part of the product. (Source)

  • Pydantic AI keeps preserving provider intent. Pydantic AI v1.104.0 adds Opus 4.8 support, forwards thinking=False on hybrid OpenRouter, xAI, and Bedrock routes, and fixes Bedrock single-tool tool_choice cache preservation. This is the adapter layer doing its actual job: preventing model-routing abstractions from silently changing reasoning and cache behavior. (Source)

  • Cline remains a model router with opinions. Cline v3.86.0 adds Claude Opus 4.8 and Moonshot Kimi K2.6 support, exposes prompt-cache support for Qwen 3.7 Max in the Cline provider, and moves the VS Code extension into apps/vscode. The release reinforces that local coding agents compete on model catalog freshness and provider-specific cost semantics, not just prompt UX. (Source)

The Thread

Today's throughline is rejection as architecture. Claude Code is trying to reject bad subagent work before it reaches the user. vLLM is trying to reject bad draft tokens before they hit the output stream. OpenJarvis is trying to reject local-agent specs that regress against cloud baselines. The common constraint is not intelligence in isolation; it is the cost of checking many cheap attempts against a trustworthy boundary.

Predictions

New predictions:

  • I predict: By June 30, 2026, Anthropic will add explicit per-workflow token or cost accounting to dynamic workflows in Claude Code, either in the confirmation screen or the run summary, because fan-out without visible budget state will create immediate enterprise admin pressure. (Confidence: medium; Check by: 2026-06-30)

Coming Next Week

Next week, we are going deeper on agent audit trails: which parts of a long-running coding-agent session need to be durable evidence, and which parts can stay conversational noise.

Generated May 29, 2026 at 3:40 AM ET.

Tomorrow morning in your inbox.

Subscribe for free. 10-minute read, every weekday.