Three Parts Linear, One Part Full: The Transformer Monopoly Cracks

The One Thing: A robotics model just learned to use an air fryer it had never seen by remixing skills from laundry folding and espresso making. If that sounds like how LLMs compose concepts they were never explicitly taught — that's exactly the point, and exactly the bet a $1 billion startup is racing to prove wrong.

If You Only Read One Thing: Physical Intelligence's π0.7 blog post — the clearest evidence yet that robotic foundation models are approaching a compositional generalization threshold. Free, technical, and worth the 10 minutes.

TL;DR: Alibaba's Qwen3.6-35B-A3B ships a hybrid architecture — three linear attention layers for every one full attention layer — that matches dense models 10x its active parameter count on agentic coding tasks. Physical Intelligence's π0.7 demonstrates compositional generalization in robotics, combining skills from unrelated tasks to solve problems it was never trained on. The transformer monopoly is fracturing from below, and the robotics scaling curve just bent upward.

The 3:1 Ratio That Ends the Pure Transformer Era

Every major efficient model released in the past three months has made the same architectural bet, and almost nobody outside the ML systems community has noticed.

Alibaba's Qwen team released Qwen3.6-35B-A3B today — a 35-billion-parameter mixture-of-experts (MoE) vision-language model that activates only 3 billion parameters per forward pass. The benchmark numbers are striking: 73.4% on SWE-bench Verified, 51.5 on Terminal-Bench 2.0 (beating Gemma 4's 42.9), and a 43% jump on QwenWebBench over its predecessor. These results compete with — and often beat — dense models that are 10x its active size, including Qwen's own 27B dense model. Apache 2.0 license.

The architecture is what matters. Qwen3.6-35B-A3B uses a 3:1 hybrid layout: for every three transformer blocks using Gated DeltaNet (a linear attention variant that scales linearly with context length rather than quadratically), there is one block using traditional full softmax attention. The model has 40 layers organized as 10 repeating groups of this 3:1 pattern, with 256 MoE experts (8 routed plus 1 shared per forward pass).

Why it matters (Value Chain Shift): Gated DeltaNet, published at ICLR 2025 by researchers at NVIDIA and others, combines Mamba2's gated decay mechanism with the delta update rule — gating enables rapid memory erasure while the delta rule handles precise memory modifications. It was an interesting research paper a year ago. Today it is the default attention mechanism in at least four production model families: Qwen3-Next (80B-A3B), Qwen3.5's small models, Qwen3.6-35B-A3B, and Moonshot AI's Kimi Linear. All use the same 3:1 ratio.

This migration matters because it restructures the inference cost equation. Linear attention layers compress history into a fixed-size memory state rather than maintaining the full key-value cache that standard transformers require. The practical result: Qwen3.6 runs with 3B active parameters on hardware that couldn't touch a 27B dense model, while delivering comparable quality. For anyone running inference at scale — cloud providers, enterprises deploying coding agents, hobbyists on consumer GPUs — this is the efficiency breakthrough that actually ships, not the theoretical one in a paper.

Room for disagreement: Linear attention's fixed-size memory state is both its advantage and its ceiling. Research from NVIDIA's own team shows that at batch-1, Gated DeltaNet decode is memory-bound because the full recurrent state must be round-tripped through GPU high-bandwidth memory every token. The 3:1 ratio exists precisely because linear attention alone cannot match full attention's retrieval accuracy — the full attention blocks serve as periodic "correction layers" that prevent quality degradation. This isn't the end of the transformer. It's the transformer becoming a minority partner in a hybrid stack.

What to watch: The convergence on 3:1 across independent teams (Alibaba, Moonshot AI) suggests this ratio may be empirically optimal for current architectures. The next question is whether this ratio shifts as models scale — do 100B+ models need more or fewer full attention layers? The answer determines whether hybrid architectures are a temporary efficiency hack or a permanent architectural shift.

π0.7: The Moment Robots Started Remixing

Physical Intelligence released a blog post on Wednesday that, if the results hold up to independent scrutiny, represents the most significant robotics AI milestone since Google DeepMind shipped Gemini Robotics-ER as an API two days earlier.

π0.7 is a Vision-Language-Action (VLA) model built on a three-component architecture: a high-level policy that generates language subtask instructions, a world model that produces synthetic visual subgoals, and an action expert that executes physical behaviors. The team, led by co-founder Sergey Levine, claims π0.7 demonstrates compositional generalization — the ability to combine skills learned in different contexts to solve problems the model was never explicitly trained on.

The concrete evidence: π0.7 successfully operated kitchen appliances (including an air fryer) it had never encountered during training, using language coaching and skill transfer from unrelated manipulation tasks. It matched or exceeded specialist RL-fine-tuned models on laundry folding (1.5x throughput), espresso making (1.2x), and box assembly (1.6-2.0x) — tasks where the specialists were purpose-built. It transferred laundry folding to a bimanual UR5e industrial arm configuration with zero additional training, matching human teleoperators with an average of 375 hours of experience.

Why it matters (Historical Parallel): The compositional generalization claim maps directly onto the inflection point that transformed large language models from "impressive demos" to "general-purpose tools." LLMs showed emergent capabilities — writing code, solving math, reasoning about novel problems — once training data diversity and model scale crossed a threshold. Levine's key claim is that π0.7's capabilities now scale more than linearly with the amount of data, a "favorable scaling property seen in other domains like language and vision." If true, this means the robotics industry can stop building task-specific systems and start building general-purpose ones. The implications for manufacturing, logistics, and household robotics are structural: the unit economics of deployment shift from "one robot per task" to "one model for many tasks."

Room for disagreement: Yann LeCun has been arguing for months that VLA models are "too LLM-pilled" — they manipulate language well enough to fool observers but lack genuine world models. His $1 billion AMI Labs is an explicit bet against the VLA approach, prioritizing visual imagination and intuitive physics. The π0.7 results are also self-reported — Physical Intelligence measured against its own specialist models, not independent benchmarks. Compositional generalization on a curated set of kitchen and manipulation tasks is a far cry from the infinite variety of the physical world. LLMs had the advantage of operating in text space, where failure modes are bounded; a robot that "composes" wrong can break things, hurt people, or simply stop working in ways that are expensive to debug.

What to watch: The critical test is cross-category transfer — can π0.7 compose manipulation skills to solve navigation or inspection tasks? If generalization is real, it should extend beyond kitchen appliances to fundamentally different task categories. Watch for independent reproductions and for Google DeepMind's response — they shipped Gemini Robotics-ER as an API on April 15, and now Physical Intelligence is claiming the same scaling properties that made LLMs transformative.

The Contrarian Take

Everyone says: Hybrid linear attention architectures (Gated DeltaNet, Mamba, etc.) are replacing the transformer. The 3:1 ratio proves that linear attention has "won" the efficiency argument, and full attention is a legacy technology being phased out.

Here's why that's incomplete: The 3:1 ratio is an admission that linear attention cannot stand alone. Every production hybrid still needs periodic full-attention layers to correct the retrieval errors and memory collisions that accumulate in fixed-size recurrent states. The real lesson from Qwen3.6 isn't that transformers are dying — it's that the attention mechanism has become a tunable parameter rather than an architectural commitment. The future isn't "linear attention replaces full attention" — it's architects choosing ratios (3:1, 5:1, 7:1) the way they now choose learning rates. And the ratio will likely shift with scale, task, and deployment context. Anyone building infrastructure around the assumption that a single attention type wins is building on sand.

Under the Radar

Agent reverse-engineering as a research genre. A new paper (Liu et al.) reverse-engineers Claude Code's TypeScript source, identifying a 5-layer context compaction pipeline, a 7-mode permission framework with ML-based classification, and 4 extensibility mechanisms (MCP, plugins, skills, hooks). This is the first systematic architectural analysis of a production AI agent — the kind of work that used to be done on operating systems and databases. If you're building agents, this is your reference architecture paper.
LeCun's $1B counter-thesis is about to collide with evidence. AMI Labs raised $1.03 billion at a $3.5 billion valuation to build world models as an alternative to LLM-derived approaches. π0.7's compositional generalization results are the strongest evidence yet for the VLA approach that LeCun has been publicly dismissing. One of them will be right within 18 months.

Quick Takes

OpenAI Codex gets persistent agency. OpenAI expanded Codex beyond coding with "Heartbeat Automations" — agents that schedule future work for themselves and wake up to continue long-term tasks. The technical concept is genuinely novel: persistent agent scheduling without human re-invocation. This is the first major implementation of what the agent research community has been calling "durable execution" — agents with lifecycles measured in days, not conversation turns. Also: 3 million weekly developers, cross-app access via computer use, and a built-in web browser. (Source)

Cloudflare ships a unified inference layer for agents. The AI Platform routes to 70+ models across 12+ providers through a single API, with automatic failover when providers go down and streaming resilience for long-running agent chains. The key technical detail: AI Gateway buffers streaming responses independently of the agent's lifetime, allowing reconnection without re-invoking inference. If you're building agents that chain multiple model calls, this solves the cascade failure problem that kills production deployments. (Source)

Tencent open-sources HY-World 2.0 for 3D world generation. A multi-modal world model that generates, reconstructs, and simulates 3D environments from text, images, or video. Exports to Mesh, 3D Gaussian Splatting, and point clouds with game workflow integration. The same Tencent Hunyuan group that shipped HY-Embodied-0.5 last week — they're building a full stack from embodied reasoning to world simulation. 50 upvotes on HuggingFace Daily Papers. (Source)

Stories We're Watching

Post-Transformer Architecture: Linear vs. Full Attention (Month 5) — Four production model families now use the 3:1 Gated DeltaNet ratio. Does the ratio shift at 100B+ scale, or is 3:1 the new architectural constant? Next inflection: when a frontier lab (OpenAI, Anthropic, Google) adopts a hybrid architecture for a flagship model.
Robotics Foundation Models: VLA vs. World Models (Week 1) — Physical Intelligence (π0.7, compositional generalization) vs. Google DeepMind (Gemini Robotics-ER, API-first) vs. LeCun's AMI Labs (world models, $1B). Three fundamentally different bets on how robots will learn. First independent reproduction of π0.7's generalization claims will be decisive.
Agent Runtime Standardization: SDK Wars (Day 2) — OpenAI Codex adds persistent scheduling, Cloudflare ships cross-provider inference routing, and OpenAI's Agents SDK formalized harness/compute separation last week. Anthropic has not yet responded with a competing agent SDK. Clock is ticking.

The Thread

Today's two deep stories share a structural insight that neither makes explicit on its own. Qwen3.6-35B-A3B uses full attention only where linear attention fails — a 3:1 ratio that admits the transformer is necessary but no longer sufficient. π0.7 delegates to a world model and language planner when direct motor control isn't enough — a three-component architecture that admits action experts alone can't generalize. Both represent the same design principle applied to different domains: the future of AI isn't bigger monolithic systems. It's smarter composition of specialized components, each doing what it does best, with explicit interfaces between them.

This is also, not coincidentally, how the agent infrastructure layer is evolving. Cloudflare's multi-provider routing, OpenAI's Heartbeat Automations, the harness/compute split from last week — all are about composing specialized capabilities rather than building one system that does everything. The architecture of AI models and the architecture of AI infrastructure are converging on the same principle: modularity with well-defined boundaries beats monolithic scale.

Predictions

New predictions:

I predict: 3+ major model families (beyond Qwen/Kimi) will ship production models with hybrid linear attention (Gated DeltaNet or equivalent) as the default architecture by Q3 2026. The 3:1 ratio will become as standard as the transformer block itself. (Confidence: high; Check by: 2026-09-30)
I predict: Physical Intelligence will demonstrate cross-category task transfer (manipulation skills applied to navigation or inspection tasks, not just kitchen variants) within 6 months, but independent reproduction of compositional generalization claims will take at least 9 months. (Confidence: medium; Check by: 2026-10-17)

Generated 2026-04-17 06:12 ET by the Daily Briefings Agent.