Tools Stop Sprawling

If You Only Read One Thing

Start with the catalog before the context window. Pydantic AI's on-demand capabilities turn tool access into a recorded load event, while Step 3.7 Flash makes the same discipline economic through sparse experts and advisor calls. The shared signal is that agent stacks are being designed around staged availability: first discover, then reveal, then pay.

Pydantic AI's newest release is easy to mistake for another framework feature. It is more important than that: it changes the default shape of tool-heavy agents.

The v1.105.0 release shipped on June 2 with on-demand capabilities: bundles of instructions, tools, model settings, native tools, and lifecycle hooks that can be hidden until the model loads them. The merged PR describes the basic pattern: give a capability a stable id, set defer_loading=True, and the model initially sees a compact catalog plus a framework-managed load_capability tool. When it needs the orders workflow, security runbook, or returns handler, it asks for that capability and receives the relevant instructions and tools on the next turn.

The prior baseline was tool stuffing. Agent builders either exposed every tool up front, which bloats context and makes tool choice noisier, or split workflows into separate agents and paid the handoff cost. Pydantic's new primitive is a third option: a menu the model can search, not a buffet it must ingest.

Why it matters: Agent reliability is becoming a retrieval problem inside the harness. A tool schema is not just metadata; it is prompt mass, cache state, and behavioral temptation. Pydantic's ToolSearch makes deferred tools discoverable through provider-native search on OpenAI Responses and Anthropic, with a local search_tools fallback elsewhere. The cache detail is the real signal: on providers with a native client-executed surface, the discovery exchange is append-only, so the tool list can stay stable and the prompt-cache prefix can remain warm. That is a small sentence with a large cost implication. It means the framework can reduce visible tool surface without destroying the cached prefix that makes repeated agent turns economically tolerable.

This is also a governance move. The docs say deferred capability instructions resolve only after load_capability, model settings and hooks register during run setup but apply only once loaded, and stable IDs let message history identify what was loaded in resumed runs. In plain English: the agent run now has a record of which capability was made available when. That is closer to permission accounting than prompt engineering.

Room for disagreement: Deferred loading does not prove the model will request the right capability. It can still fail to search, load a capability too late, or miss a tool whose description is poorly written. The PR also says deferring native tools can break the prompt-cache prefix because native tool definitions sit in the request prefix. The useful claim is narrower: framework-level progressive disclosure is becoming part of production agent design, not that Pydantic has solved tool choice.

Step 3.7 Flash Prices the Executor

StepFun's Step 3.7 Flash is not interesting because it is another open multimodal model. It is interesting because it packages agent work as a two-tier cost problem.

StepFun's launch writeup presents a 196B-parameter model optimized for search-heavy, coding, visual, and tool-use tasks. NVIDIA's technical docs list the production-facing version as a 198B total-parameter model with about 11B active parameters per token, 288 experts with 8 active, a 256K context window, text and image inputs, and deployment paths through SGLang, TensorRT-LLM, vLLM, Hugging Face, and NVIDIA NIM. NVIDIA's NIM page says the model is ready for commercial and non-commercial use, globally deployable, and intended for multimodal understanding, agentic workflow support, coding, frontend generation, tool calling, and GUI-oriented screenshot work.

The mechanism is sparse specialization. A mixture-of-experts model is like a large firm where only a few specialists attend each meeting. Step 3.7 Flash carries the capacity of a large model but activates a small subset per token, so the serving target is closer to an 11B active model than a dense 198B model. NVIDIA also lists MTP-3 acceleration, a three-token prediction path, with 100-300 tokens per second throughput and a 350 tokens-per-second coding peak.

Why it matters: The old model-selection question was "Which model is smartest enough?" Step 3.7 Flash pushes a better question: "Which model is cheap enough to remain the executor?" StepFun reports 49.5 on Toolathlon, 67.1 on ClawEval-1.1, 47.2 on HLE with tools, and a Step-SWE-Bench average of 67.08% across six harnesses, up from 56.50% for Step 3.5 Flash. The more revealing number is Advisor Mode: StepFun says Step 3.7 Flash can drive the trajectory end-to-end and consult a larger advisor model only at planning or recovery points, reaching 97% of Claude Opus 4.6's coding performance at about one-ninth the per-task cost, $0.19 versus $1.76.

That claim needs independent replication, but the architecture is the point. A small active executor plus selective advisor calls is the model-side version of Pydantic's hidden menu. Do not show every tool. Do not activate every expert. Do not call the frontier model at every step. The economic frontier for agents is moving from "best single model" toward conditional computation across tools, experts, and advisors.

Room for disagreement: Several benchmark comparisons are self-reported, internal, or version-mismatched. StepFun says some Toolathlon numbers use an internal fixed version, and its Terminal-Bench comparisons mix internal tests with official reported results from other labs. Treat Step 3.7 Flash as a serious deployment candidate, not a proven Claude replacement. The confirmation will come from independent Aider, SWE-Bench Pro, Artificial Analysis coding-agent, and real repo runs across multiple harnesses.

The Contrarian Take

Everyone says: Agent systems are suffering because models need larger context windows and more tools. Give the model the whole repo, the whole tool registry, and a stronger frontier model, and the scaffolding gets simpler.

Here's why that's wrong, or at least incomplete: Today's best evidence points the other way. Pydantic AI is adding a loadable capability catalog because dumping every instruction and tool into the first turn makes agents worse and more expensive. Step 3.7 Flash is useful because most tokens do not wake the whole model, and its Advisor Mode only escalates when the executor needs help. The winning agent stack is not the one that sees everything. It is the one that makes access conditional and leaves an audit trail when the condition fires.

Under the Radar

CoreWeave is selling the sandbox as the agent execution layer. CoreWeave Sandboxes gives RL, agent tool-use, and model-evaluation workloads isolated environments on customer CoreWeave Kubernetes clusters or as a serverless runtime through Weights & Biases. The undercovered detail is not "cloud sandbox." It is identity and trace coupling: W&B serverless sandboxes are pre-authenticated to the W&B identity, secrets come from the W&B Team store, lifecycle events land in the W&B run timeline, and Weave traces connect model/tool calls back to the sandbox. (Source)
Voice-agent evals are becoming latency frontiers, not WER tables. Artificial Analysis launched AA-WER Streaming, measuring word error rate and post-speech latency together. The practical surprise is that first partial transcripts are only about 0.7 percentage points less accurate than final transcripts on average, while the fastest model, Deepgram Flux, returns first partial output in 0.019 seconds at 7.36% WER. Voice agents need that tradeoff because every transcript delay steals time from reasoning and tool calls. (Source)

Quick Takes

Gemini 2.0 Flash is gone from the API. Google's Gemini API release notes say gemini-2.0-flash, gemini-2.0-flash-001, gemini-2.0-flash-lite, and gemini-2.0-flash-lite-001 shut down on June 1; developers should use gemini-3.5-flash or gemini-3.1-flash-lite. This is mundane until an agent fleet has old aliases pinned in fallback routes. Model lifecycle is now runtime reliability. (Source)
llama.cpp tightened speculative decoding defaults. The June 1 b9464 release fixes n_outputs_max, adds a shared speculative max-draft-size helper, removes draft-simple auto-enable behavior, and enables server tests on PRs. The small local-inference lesson is that speculative decoding is becoming ordinary enough that surprising defaults are now bugs, not clever shortcuts. (Source)
Step 3.7 Flash is already provoking local-serving experiments. A Hugging Face post reports a narrow MLX run that kept a 199.74 GiB Q8 Step 3.7 Flash tensor set under a hard 96 GiB model-memory budget by lazily loading routed experts. The clean run was tiny, but the memory numbers matter: sparse MoE local serving will be about expert residency policy, not only quantization. (Source)

The Thread

The thread is conditionality. Pydantic AI makes tool access conditional on model intent and records the loaded capability in history. Step 3.7 Flash makes compute conditional through sparse experts and selective advisor calls. CoreWeave makes execution conditional on an isolated sandbox identity. Even Gemini's model shutdown is the same lesson in reverse: if a runtime depends on implicit availability, it will eventually break. Agents are not becoming simpler as they get more capable. They are becoming systems where the important engineering question is what stays hidden until the moment it is justified.

Predictions

New predictions:

I predict: By 2026-08-31, at least one major coding-agent benchmark or model-release report will publish separate cost-per-task results for small-executor-plus-frontier-advisor routing, instead of reporting only single-model runs. (Confidence: medium; Check by: 2026-08-31)

Generated June 2, 2026 at 3:32 AM ET.