Daily AI Briefing — April 26, 2026

Opener

The two AI stories that mattered this weekend pulled in opposite directions and made the same point. OpenAI publicly retired SWE-bench Verified, the benchmark it spent 20 months legitimizing, and a 23-year-old non-mathematician used GPT-5.4 Pro to clear a 60-year-old Erdős conjecture. The frontier evaluation stack and the frontier capability stack are drifting apart, and both deep dives sit on that gap.

OpenAI Just Killed SWE-bench Verified. The Number Was Always a Lie.

The strangest concession in OpenAI's writeup is the dataset audit. Mia Glaese and Olivia Watkins's Frontier Evals team took the 27.6% of SWE-bench Verified problems that frontier models reliably failed and looked at why. At least 59.4% had broken test cases. Forty-nine tests rejected functionally correct solutions because they were "too narrowly defined"; 26 required extra features the problem statement never mentioned. More than half of the supposedly hardest problems on the gold-standard coding benchmark were unsolvable not because the models hit a ceiling but because the test was wrong.

That came alongside a contamination disclosure. OpenAI reports that GPT-5.2's chain of thought showed the model reusing knowledge about specific patch arguments not in the prompt: memorizing solutions from training data. All frontier models could reproduce the human-written reference fix verbatim from task IDs alone. Progress on the benchmark slowed from 74.9% to 80.9% over six months, which the field read as a saturation curve and OpenAI now reads as the contamination floor. The replacement is SWE-bench Pro, Scale AI's 1,865-task benchmark with held-out commercial codebases and 1-4 hour task horizons. The best public score on Pro is 46% (Claude Opus 4.5 on the standardized SEAL board) versus 81% on Verified. The gap is the contamination tax.

Why it matters: This is a structural break in how the field communicates progress, not a benchmark swap. Since SWE-bench Verified shipped in August 2024, the 500-task harness has been the single number every frontier lab quoted in launch posts: Anthropic, Google, OpenAI, DeepSeek, Alibaba all reported it. The April 21 Kimi Vendor Verifier dispatch flagged a 37% lab-vs-deployment gap on serving infrastructure; today's announcement extends the same diagnosis to the benchmark itself. The constraint that just tightened: any score quoted on a public coding eval older than ~12 months is now presumed contaminated unless the lab explicitly says otherwise. The constraint that loosened: scaffolding work, which Scale's data shows can swing the same model 5-10 points on Opus 4.5, now competes for real attention because the model-level signal got cleaner. The real signal to watch is whether Anthropic, Google, and DeepSeek follow OpenAI off the platform within 30 days. (Latent Space's "End of SWE-Bench Verified" has Glaese and Watkins on the technical detail.)

Room for disagreement: Vendors will keep reporting Verified because the press won't care about contamination for another cycle. The pessimist read is that Pro will rot the same way: train on it once, and the next public score becomes a memorization measure. Scale's commercial held-out half is the real contamination defense, and held-out benchmarks have a poor history of staying held out.

What to watch: Whether Anthropic publishes a Claude release between now and end-of-May without quoting SWE-bench Verified. Anthropic has been the most disciplined about evaluation hygiene historically; if it leads with Pro and a held-out internal eval, the rest of the field follows within a quarter.

A 23-Year-Old, GPT-5.4 Pro, and a 60-Year-Old Erdős Problem

Liam Price has no graduate mathematics training. He plugged Erdős Problem #1196, a 60-year-old conjecture about primitive sets (collections of integers in which no number divides another), into a GPT-5.4 Pro prompt, got back a sketch, and brought it to Kevin Barreto, a second-year math undergraduate at Cambridge, who tightened it. Terence Tao and Jared Lichtman, who proved Erdős's main 1986 primitive-set conjecture in his 2022 doctorate, then refined the proof for publication. Tao said the AI "took an entirely different route," using a formula from a different subfield that "no one had thought to apply to this type of question." Lichtman called it "the first Book Proof from AI," referencing Erdős's notion of "The Book" of maximally elegant proofs.

The history this displaces is specific. Erdős posed thousands of low-cost research questions designed as on-ramps for graduate students; Tao's erdosproblems.com catalogues 1,500+ of them. The previous baseline for AI on this catalogue was Erdős #281, solved with GPT-5.2 Pro earlier in April. #1196 is a step harder, and Lichtman has been the field's primary worker on adjacent problems for half a decade. The novel ingredient was redirection between subfields, not raw computation; the proof, once cleaned up, is short.

Why it matters: This is the first credible data point that frontier reasoning models can produce mathematical insight specialists missed, not just retrieve known results. Read against April 23's Jena study, which showed agentic LLMs ignored evidence in 68% of scientific reasoning traces, the contrast sharpens. Structured conjecture-checking, where every step is symbolic and verifiable, is exactly the regime where AI reasoning can be cheaply audited by an expert. The constraint that just tightened is talent-cost: a reservoir of accessible-but-unworked Erdős problems, designed as PhD seed projects, is suddenly priced as commodity output. The mechanism that confirmed the signal is the audit itself. Tao and Lichtman shortened the raw GPT-5.4 output substantially before publication, which means the model produced raw insight that survived expert compression. Falsification is straightforward: if no further #1000-tier Erdős problem falls to a frontier model in the next 90 days, this is one lucky pull.

Room for disagreement: Thomas Bloom, who curates erdosproblems.com, pushed back on the "novel proof" framing. He says GPT-5.4 found references to results in adjacent literature he was unaware of, which makes the AI's value "literature searching ability," not theorem-proving. A separate commenter noted Lichtman is involved in an AI startup, giving him a soft incentive to frame this as a breakthrough. The honest read sits between Tao's "new way to think about large numbers" and Bloom's "very good search engine"; what counts as a contribution in math is genuinely contested when the LLM is doing connective tissue.

What to watch: Whether DeepMind's AlphaProof team or Tao himself publishes a formal evaluation harness for the Erdős catalogue within 60 days. The right test is not "did GPT solve a problem" but "what fraction of the 1,500-problem corpus falls per model generation." That number, plotted over time, is the actual capability curve.

The Contrarian Take

Everyone says: SWE-bench Verified is being retired because models saturated it — coding ability has finally outrun the test, the same way ImageNet was outrun a decade ago.

Here's why that's wrong (or at least incomplete): The 80.9% ceiling on Verified isn't a capability ceiling, it's a test-quality floor. OpenAI's audit shows the residual ~20% gap is mostly broken tests, not unsolved problems. The right comparison is not ImageNet (where models genuinely exceeded human-level classification) but Winograd schemas, which were declared "solved" while everyone quietly knew the dataset had failure modes that didn't reflect language understanding. Treating Verified as saturated implies models can do everything it tests; what actually happened is the test stopped being able to tell models that can apart from models that can't. SWE-bench Pro at 46% is the first honest reading of frontier coding capability since the original SWE-bench paper. Any narrative using the Verified retirement to argue "AI coding has plateaued" is reading the wrong number. The ceiling we hit is the measurement ceiling.

Under the Radar

The Recurrent Transformer (Harvard, Sham Kakade et al.). A new arXiv preprint proposes one tweak: each layer attends to key-value pairs from its own activations, giving the model unbounded effective depth without the optimization fragility of true RNNs. An O(N log N) memory-traffic algorithm replaces the standard O(N²), with cross-entropy parity to deeper Transformers at 150M and 300M parameters. Architectural details like this only matter if a major lab adopts them; Kakade's track record makes that more than zero probability.
Vista4D from Eyeline Labs takes CVPR 2026 Highlight for video reshooting. The paper grounds an input video in a 4D point cloud, then re-renders it from arbitrary camera trajectories. It's the first general-purpose method that handles real dynamic scenes without per-scene training. The capability that didn't exist before is rotating the camera around an existing video without rebuilding the scene.
XPeng Robotics' UniT pushes a unified physical language for humanoid policy learning. The paper frames human and humanoid body data in a shared vocabulary so observation and action transfer across embodiments. Quietly the most ambitious humanoid stack right now because XPeng is the only company shipping product to consumers in volume already.

Quick Takes

Andon Market, the AI-run retail boutique, has lost $13,000 in two weeks. Andon Labs handed Claude Sonnet 4.6 a debit card and a Cow Hollow lease; agent "Luna" hired contractors, picked inventory, and ordered 1,000 toilet seat covers for the employee bathroom before listing them as merchandise. The implication for agentic deployment is that the bottleneck isn't reasoning, it's ordinary common-sense adjudication of edge cases that no benchmark scores. The loss rate is the price of confirming current models execute confidently far past their plausibility envelope. (Source)
Quotient-Space Diffusion Models picked up an ICLR 2026 Oral. The 40-page paper reformulates diffusion training on the quotient space induced by symmetry groups (rotation, permutation, scale), producing cleaner samples on physics-simulation and molecular-design benchmarks. It's the first concrete bridge between geometric deep learning and the diffusion stack used in production image and video systems. (Source)
GPT-5.2 Pro also cracked Erdős #281 earlier this month. A separate, less-publicized result from earlier in April. Two corpus-tier conjectures from the Erdős catalogue have now fallen to frontier models in three weeks. The pattern, not the individual proofs, is the story: 1,500 graduate-tier on-ramps are being commoditized in real time. (Source)
Hyperloop Transformers (anonymous, arXiv). A submission under double-blind review explores recurrent-feedback transformer variants for sequence modeling. It claims throughput comparable to Mamba at fewer parameters. The architecture lineage (recurrent-depth, looped-block, hybrid-state) has now produced four distinct papers in 30 days, suggesting the post-Transformer landscape is fragmenting along an axis that didn't exist a year ago. (Source)

Stories We're Watching

The Inference Trust Gap (Day 6). April 21's dispatch named the Kimi Vendor Verifier and the 37% lab-vs-deployment delta. SWE-bench Verified's retirement extends the same problem one layer up: not just serving infrastructure misreporting, but the headline metric itself failing to track capability. Contamination is now a publicly-disclosed phenomenon at a frontier lab, not a research-paper allegation.
Post-Transformer Architecture (Day 9). April 23's deep dive on Qwen3.6-27B framed Gated DeltaNet as the live alternative to MoE scaling. The Recurrent Transformer (Kakade) and Hyperloop are the same intellectual move from a different angle: depth vs. width vs. recurrence. Harvard, not Alibaba, is now publishing in this lane, broadening the architectural search beyond the Qwen lineage.

The Thread

Today's two deep stories refract one story through opposite ends of the evaluation pipeline. SWE-bench Verified is being retired because its measurements got too clean to be honest; the Erdős primitive-set proof matters because there is no benchmark for novel mathematical insight, and one happened anyway. Together they say what the field has been working hard not to say out loud: public benchmarks no longer track what matters at the frontier, and what matters is getting harder to instrument. An honest AI lab quarterly update is starting to look less like a benchmark sweep and more like a portfolio of capability anecdotes (Erdős proofs, agent retail experiments, autonomous vulnerability discovery) held together by an internal eval set the public will never see. That is closer to how pharma reads progress than how AI used to.

Predictions

New predictions:

I predict: At least two of {Anthropic, Google DeepMind, DeepSeek, Alibaba} will drop SWE-bench Verified from their next major model release notes and lead with SWE-bench Pro or a held-out internal eval. (Confidence: high; Check by: 2026-06-26)
I predict: A frontier lab (DeepMind, OpenAI, or Anthropic) ships a formal Erdős-corpus benchmark covering ≥100 problems with automated evaluation by Q3 2026. (Confidence: medium; Check by: 2026-09-26)

Generated 2026-04-26 12:15 ET.