The Harness Starts Judging
7 stories · ~7 min read

If You Only Read One Thing
A coding agent's most important output may soon be a yes-or-no decision before anything reaches the shell. Cursor Adds a Policy Stack turns tool approval into an execution checkpoint, while AgentDoG Makes Guardrails Small shows that checkpoint can be benchmarked on full trajectories. Start with Cursor's Auto-review release because it makes inline judging a product surface.
Cursor Adds a Policy Stack
Cursor's latest release is easy to undersell as a convenience feature. It is closer to an admission that the approval prompt has become the bottleneck in coding-agent work.
Cursor 3.6 added Auto-review Run Mode on May 29. The mode applies to Shell, MCP, and Fetch tool calls. Instead of asking the user on every ambiguous action, the runtime checks an allowlist, then tries a sandbox, then sends remaining calls to a classifier subagent that decides whether to allow the call, try a different path, or ask for approval. In the release discussion, Cursor adds the important caveat: the classifier is non-deterministic, can fail both ways, and should be treated as convenience rather than a security boundary.
Why it matters: The old approval model treated human attention as the scarce safety primitive. That works for a short edit loop, but it breaks when agents run scheduled loops, fetch external state, call MCP tools, and keep working after the developer looks away. Cursor is turning approval into a policy stack: deterministic allowlists first, OS-level sandboxing second, classifier judgment third, and human review last. That order matters because it makes the model's freedom conditional on runtime mediation instead of prompt trust. The structural shift is that the IDE is becoming a control plane for delegated work, not merely a place where model output appears. The strongest evidence is not the existence of an LLM classifier; it is the repo-level .cursor/permissions.json example in the forum thread, where users can write allow and block instructions for auto-run behavior. If that pattern sticks, the unit of agent governance becomes a versioned project file, much like CI configuration or dependency policy.
Room for disagreement: A classifier that approves commands is still a model judging another model's proposed action. That is weaker than a formal policy engine, and Cursor says so directly. The useful reading is not "Cursor solved agent safety"; it is "Cursor made the approval bottleneck explicit enough to productize."
What to watch: The immediate product question is whether teams can make repo policies mandatory across workspaces or whether each developer can tune the classifier locally. Central enforcement would make Auto-review an enterprise control surface; local-only tuning would keep it a personal productivity setting.
AgentDoG Makes Guardrails Small
The safety story in agent systems has usually been too vague: add a guardrail, hope it catches bad behavior, and keep the real enforcement logic somewhere between prompts, product settings, and human review. AgentDoG 1.5 is interesting because it pushes in the opposite direction: make the guardrail small enough to run next to the agent, and evaluate it on full trajectories rather than final text.
AgentDoG 1.5 was published May 28 and released with open models, datasets, a project page, and a Hugging Face collection. The framework updates its agent safety taxonomy for Codex-style repository execution and OpenClaw-style stateful tool use. It reports a family of 0.8B, 2B, 4B, and 8B checkpoints, trained from roughly 1,000 taxonomy-guided samples. The project page says its training pipeline scales to more than 10,000 concurrent agentic environments on an 8-core machine, and its ATBench family includes 1,000 trajectories, 2,084 available tools, 1,954 unique invoked tools, and about nine turns per trajectory.
Why it matters: Ordinary content moderation looks at the answer; agent safety has to inspect the path. A coding agent can do harm through a shell command, file rewrite, credential fetch, or network call long before the final response becomes obviously unsafe. AgentDoG's practical bet is that trajectory-level diagnosis can be delegated to a small specialist model rather than another expensive frontier call. The reported results support the shape of that bet: its unified Qwen3.5-4B checkpoint posts 78.4% ATBench accuracy and 77.7 F1, roughly comparable with GPT-5.4's 73.7% accuracy and 76.7 F1 on the same table, while the model collection packages multiple checkpoints for coarse and fine-grained diagnosis. That does not make it a finished enforcement system. It does make guardrails look less like a compliance slogan and more like a deployable runtime component with measurable latency, cost, and recall tradeoffs.
Room for disagreement: The benchmark is produced by the same team, and trajectory safety benchmarks can overfit to their own taxonomy. A guardrail that labels risk also has to be wired into a runtime that can block, pause, or redirect the agent. The evidence that would settle the question is production trace data showing low false negatives without making agents refuse routine tool use.
What to watch: Agent safety will bifurcate between monitors and enforcers. AgentDoG is strongest as a monitor today; the next step is whether coding-agent runtimes expose a clean interception point where a small guard model can stop an action before the tool call commits.
The Contrarian Take
Everyone says: The next phase of coding agents is about giving them more autonomy and reducing approval prompts.
Here's why that's incomplete: The real movement is toward more review, just less review by the human sitting at the keyboard. Cursor adds allowlists, sandbox checks, classifier decisions, and repo permissions. AgentDoG adds small trajectory-level judges for tool-use risk. CodeGraph, below, tries to reduce exploratory tool calls before they happen. The winning agent runtimes will not be the ones that ask least often; they will be the ones that can prove why they did not ask.
Under the Radar
-
CodeGraph treats code search as cached infrastructure — CodeGraph is a local pre-indexed knowledge graph for Claude Code, Cursor, Codex, Gemini, OpenCode, and related agents. Its self-reported benchmark across seven repositories claims 25% lower cost, 57% fewer tokens, 23% faster answers, and 62% fewer tool calls, revalidated on Opus 4.8 on May 29. Even if those numbers compress under independent testing, the direction is right: codebase discovery is becoming state, not repeated grep.
-
PilotDeck is making memory inspectable — OpenBMB open-sourced PilotDeck on May 28 as a task-oriented agent platform with workspace isolation, traceable memory, smart routing, and always-on execution. The mainstream frame would be "another agent OS." The more useful read is that Chinese agent tooling is converging on the same production primitives as Western coding agents: bounded workspaces, editable memory, model routing, and background work as first-class state.
Quick Takes
-
Claude Code put Auto mode into enterprise cloud lanes. Claude Code 2.1.158 makes Auto mode available on Bedrock, Vertex, and Foundry for Opus 4.7 and Opus 4.8 behind
CLAUDE_CODE_ENABLE_AUTO_MODE=1. The technical signal is not the flag; it is that delegated approval modes now have to travel through cloud procurement and compliance surfaces, not only local terminals. (Source) -
Aider's leaderboard keeps rewarding model routing. The current Aider table shows an o3 architect configuration with GPT-4.1 as editor at 78.2% for $17.55, while cheaper single-model rows sit lower. The point is not that one row is definitive; it is that coding evals increasingly measure harness composition, edit format, and cost, not just model IQ. (Source)
-
SGLang shows why RL serving is about weight movement. SGLang's P2P weight-transfer writeup reports moving 1T-parameter Kimi-K2 updates from 53 seconds to 7.2 seconds using RDMA and a source-side CPU engine replica. That sits next to yesterday's vLLM RL story: open serving stacks are competing on how quickly training and rollout can exchange state. (Source)
The Thread
Today's throughline is not "agents are getting smarter." It is that agent runtimes are absorbing the work humans used to do around agents: deciding when a command is safe, keeping codebase context warm, judging whether a tool trajectory is risky, and routing easier work away from expensive models. The model still matters, but the boundary around the model is becoming the product.
Predictions
New predictions:
- I predict: By July 31, 2026, at least two major coding-agent runtimes will expose exportable per-tool decision logs that distinguish allowlist, sandbox, classifier, and human-approval outcomes. (Confidence: medium; Check by: 2026-07-31)
Generated: 2026-05-30 03:38 ET
Tomorrow morning in your inbox.
Subscribe for free. 10-minute read, every weekday.