Cheap Gets Capable

If You Only Read One Thing

The capability story today has a price tag. Cursor Prices the Agent puts Composer 2.5 near the top of an independent coding-agent index at commodity-task cost, while Command A+ Tests Open Efficiency asks whether open weights can move enterprise agents into smaller hardware envelopes. The common thread is deployment control. Start with Artificial Analysis's Composer 2.5 benchmark.

Cursor Prices the Agent

Cursor did not just release a better coding model. It put a model, a harness, and a price point into the same competitive frame.

Composer 2.5, released May 18, is built on Moonshot's Kimi K2.5 checkpoint and further trained by Cursor for coding-agent behavior. Cursor says the update used more difficult reinforcement-learning environments, 25x more synthetic tasks than Composer 2, and a technique it calls targeted textual feedback: instead of waiting until the end of a long agent run to assign credit, the training loop inserts local hints near the specific bad tool call, style error, or confusing response it wants to correct. The model is available only inside Cursor, with standard pricing at $0.50 per million input tokens and $2.50 per million output tokens, while the default Fast variant costs $3.00 and $15.00.

The outside measurement is what makes this more than a launch post. Artificial Analysis says its Coding Agent Index averages three benchmark families: SWE-Bench-Pro-Hard-AA for repository patching, Terminal-Bench v2 for shell workflows, and SWE-Atlas-QnA for repository understanding. In its May 21 analysis, Composer 2.5 scored 62, up from Composer 2's 48, behind only Claude Opus 4.7 max in Claude Code at 66 and GPT-5.5 xhigh in Codex at 65. The cost spread is the point: those two higher-scoring variants cost $4.10 and $4.82 per task, versus $0.44 for Composer 2.5 Fast and $0.07 for standard Composer 2.5.

Why it matters: The old model race asked which lab had the best general model. The coding-agent race is starting to ask a narrower and more economically useful question: who can own the task distribution tightly enough to train a cheaper model that is good enough inside one workflow? Cursor has privileged data about how developers ask, approve, interrupt, and recover from agent work. That gives it a post-training surface that a general model API does not automatically have, especially when the target is not "write code" in the abstract but "operate inside Cursor's agent harness."

That does not make Composer 2.5 the universal best model. Artificial Analysis reports the largest improvement on SWE-Bench-Pro-Hard-AA, from 12% to 47%, while Terminal-Bench v2 moved only from 64% to 66%. That split matters. Cursor appears to have improved repository patching much more than terminal-native operation. It also does not expose a separate API, which means the measured product is a closed bundle: model, IDE, agent loop, tool surface, and pricing. But that is exactly the structural signal. Coding agents are becoming less like interchangeable completions endpoints and more like vertically tuned work systems where the model is priced against completed tasks.

Room for disagreement: Benchmark cost is not the same as production cost. Subscription limits, retries, human supervision, environment startup, and failed runs still matter. The result is still strong because it compares the same agent-benchmark mix and shows a real Pareto move: not highest score, but far lower cost near the top of the table.

Command A+ Tests Open Efficiency

Cohere's new model is not trying to win the "largest open model" headline. It is trying to make a large model look deployable without hyperscaler hardware.

Command A+, released May 20, is an Apache 2.0 open-weight mixture-of-experts model. A mixture-of-experts model has many parameter groups but activates only some for each token; here Cohere reports 218 billion total parameters and 25 billion active parameters. The model supports text, image input, tool use, reasoning output, 48 languages, a 128,000-token input context, and up to 64,000 generated tokens. Cohere says it supports vLLM and Transformers and can run at W4A4 quantization on one B200 or two H100 GPUs.

The deployment details are unusually concrete. The Hugging Face model card includes BF16, FP8, and W4A4 variants, minimum hardware guidance, vLLM and SGLang launch commands, and a warning that W4A4, shorthand for 4-bit weights and 4-bit activations, requires vLLM 0.21.0 or newer plus Cohere's parsing library for accurate tool and reasoning handling. Cohere's docs list command-a-plus-05-2026 as available through Chat endpoints and Model Vault. The benchmark picture is more mixed: Artificial Analysis places Command A+ at 37 on its Intelligence Index, roughly in Claude 4.5 Haiku territory, with strong speed at about 281 output tokens per second on Cohere's API but weak hard-coding scores: Terminal-Bench Hard around 25% and SciCode around 38%.

Why it matters: Open weights used to imply a simple trade: more control, less frontier capability. Command A+ complicates that in a useful way. Its strongest claim is not that it beats the frontier. It is that a regulated or infrastructure-constrained team can get tool use, multimodal document handling, multilingual coverage, and low hallucination behavior into an owned deployment envelope. The important concept is the hardware envelope: the number and type of accelerators required before a model is realistic outside a giant shared cloud service. Two H100s is still expensive, but it is a different planning object from an eight-GPU frontier serving stack.

The model also exposes a second axis of quality. Artificial Analysis says Command A+ ranks first on its non-hallucination measure at 86%, while its accuracy is only 9% in that same benchmark family. In plain English, it often knows when not to guess. That profile is less exciting for autonomous coding than for enterprise retrieval and document workflows where a confident fabrication is worse than a refusal or citation-heavy answer. Cohere's own North-derived evaluations point in that direction too: agentic question answering, spreadsheet analysis, and memory use improved over Command A Reasoning.

Room for disagreement: This is not an obvious default for coding agents. The hard-coding benchmark numbers are weak next to frontier closed models, and "sovereign AI" can become marketing fog when a model still needs high-end Nvidia hardware. The practical read is narrower: Command A+ is a credible open-weight option for controlled enterprise agent workloads, not a broad replacement for Opus, GPT-5.5, or Gemini in code.

The Contrarian Take

Everyone says: AI models are converging, so the only interesting questions are who tops the leaderboard and who cuts token prices.

Here's why that's wrong, or at least incomplete: The relevant competition is shifting from raw model quality to costed control. Composer 2.5 is cheap because Cursor can tune the model to its own agent harness and keep distribution inside the IDE. Command A+ is interesting because Cohere is making the deployment envelope explicit: quantization, vLLM, SGLang, hardware minimums, and API availability. The durable advantage is not "better model" in isolation. It is model plus route to repeated work at a predictable cost.

Under the Radar

Datasette Agent turns SQL into an agent surface. Simon Willison's Datasette Agent gives Datasette a conversational interface over local data, ships plugins for charts, image generation, and persistent Fly Sprites sandboxes, and already runs against local models through LM Studio. The missed angle is that small data apps are getting agent surfaces without waiting for an enterprise platform.
KVBoost brings production-style cache reuse to Hugging Face. KVBoost is a new MIT-licensed library and PyPI package for chunk-level key-value cache reuse, OpenAI-compatible serving, FlashAttention-2, AWQ layer streaming, and cache telemetry around Hugging Face models. The caution is in the HN thread: the biggest speedup claims need comparison against LMCache/vLLM paths before this graduates from promising library to infrastructure answer.

Quick Takes

Codex made goals and permissions more inspectable. Codex 0.133.0 enables goals by default, improves codex remote-control, adds permission-profile list APIs, inheritance, managed requirements.toml, runtime refresh behavior, and gives extensions more lifecycle events. The practical signal is that Codex is turning state, permission, and extension hooks into first-class runtime objects. (Source)
Cline patched the boring failure paths. Cline CLI v3.0.10 adds file:// plugin installs, keeps idle or approval-waiting sessions alive, notifies connectors when scheduled execution fails, preserves SDK output-token limits, and caches global settings reads by mtime. None of that is a launch narrative; all of it is runtime quality for long-lived local agents. (Source)
LangGraph shipped a small streaming boundary fix. LangGraph 1.2.1 adds an opt-in before_builtins path for stream transformers and keeps tool results out of v3 messages. This is a minor release, but the direction fits the broader pattern: framework value is moving into typed streams, message boundaries, and recovery semantics, not agent slogans. (Source)

The Thread

Today's thread is costed control. Cursor is using post-training and closed distribution to make a cheaper coding agent competitive near the top of an independent benchmark. Cohere is using open weights, quantization, and concrete serving recipes to make a large enterprise model deployable in smaller hardware envelopes. Datasette Agent and KVBoost show the same idea at smaller scale: bring the agent or cache closer to the workload instead of assuming a frontier API will absorb every constraint. The next model selector will not be a leaderboard. It will be a budget, a harness, and a failure model.

Predictions

New predictions:

I predict: By 2026-09-30, at least one coding-agent vendor besides Cursor will ship an in-house or heavily post-trained coding model with public benchmark results and separate model-specific pricing. (Confidence: medium; Check by: 2026-09-30)
I predict: By 2026-08-31, at least two open-weight models above 100B total parameters will publish official FP8 or W4A4 deployment recipes that target two H100s or one B200 with vLLM or SGLang commands. (Confidence: medium; Check by: 2026-08-31)

Coming Next Week

Next week, we are going deeper on cost-per-task benchmarks: when they are useful, where they hide failure cost, and why prompt-cache behavior may become the quietest model-selection variable.

Generated: 2026-05-22 03:42 EDT