I've been studying three architectural approaches to one of the hardest problems in large language models: how to make a model read very long documents without the cost becoming prohibitive. The Jan 2025 MiniMax-01 paper, the Jun 2025 MiniMax-M1 paper, DeepSeek V2/V3, and the Qwen3 family all tackle the same problem — and they arrived at genuinely different answers.
This note uses a library analogy to make the differences intuitive. But I wanted to do more than explain the first-level metaphor — I wanted to stress-test each one against the hard questions a skeptical reader would raise. Each section below leads with the core architectural bet, then gives the analogy, then works through the questions the analogy invites. The goal is an explanation you could actually defend, not just one that sounds right on first read.
The core problem: as a language model reads more text, the cost of "paying attention" to all of it grows quadratically. Double the context, quadruple the work. At a million tokens this becomes physically impossible at any reasonable cost. Three teams found three different ways out — and understanding precisely how they differ requires going past the surface of each analogy.
Imagine every language model is a researcher working in a library. Their job: read everything on the desk and write an answer, one word at a time. Before writing each word, they consult everything they've read so far. The pile grows; the consulting grows with it.
In the original Transformer, the researcher re-reads every document completely before writing every single word. Pile doubles → work quadruples. Most models cap out around 128K tokens because beyond that, the re-reading becomes physically impossible within reasonable time and cost.
Here is how three teams fixed it.
The researcher still re-reads everything — same walk across every shelf, same sweep before each word. But each document has been condensed into a tight sticky note using a special learned shorthand. Much less to carry. The reading operation hasn't changed; only the card size has.
What gets stored: a compressed latent vector cKV per token. A complete archive where nothing is ever deleted. Compression is spatial — a smaller representation — not temporal. Every token's information is stored and retrievable.
Why this matters in practice: fewer GPUs needed to serve the same number of users. The saved memory holds more concurrent sessions. At the scale DeepSeek operates, this is a significant infrastructure win.
The researcher keeps a single evolving notepad — one fixed-size matrix M — updated each time a new document arrives. The update uses only the prediction error: how wrong is the current notepad about what value to retrieve for this key, and by how much? Every 3 documents, one full traditional re-read catches what the notepad missed.
The update rule: M ← M + (v − Mk)kᵀ — an in-place correction, not an append. The term (v − Mk) is the prediction error. No history of past states of M is kept.
Why this matters: the desk size stays constant regardless of how many documents arrive. No more quadratic cost growth for the dominant computation. The 1-in-4 full re-reads are a genuine necessity — not a performance choice — to rescue retrievals the notepad has degraded.
Desk 1 — local (intra-block): full, exact re-read of the last N tokens. Nothing compressed, full precision, full RoPE positional encoding. Every token in the recent window gets exact attention.
Desk 2 — global (inter-block): a running cumulative sum S of all prior tokens. Update rule: S ← S + k·vᵀ. Purely additive — no overwriting, no competition. S stays fixed-size regardless of how many tokens have been processed. The researcher consults S for distant context, not the original shelves.
Every 7 layers, one full traditional softmax layer re-reads everything to patch what Desk 2 missed. This is the 7:1 lightning-to-softmax ratio — more aggressive than Qwen's 3:1, enabled by the fact that Desk 1 already handles local precision.
What this achieves: at 100K generation tokens, MiniMax-M1 uses approximately 25% of the compute of DeepSeek R1. The 1M-token context window becomes genuinely affordable — this is the only architecture among the three purpose-built to make that regime practical.
M ← M + (v − Mk)kᵀ — similar keys compete, later tokens can erase earlier ones. MiniMax's inter-block uses plain additive accumulation: S ← S + k·vᵀ — no competition, no overwriting. Token #847's information is potentially gone from DeltaNet if overwritten. It's still present in S — just unextractably diluted. Dilution is a meaningfully less destructive form of loss than erasure.Two of the three approaches degrade positional information for distant tokens. It's worth being precise about what that actually costs — because content information and positional information carry very different things.
Models with degraded positional information become bag-of-words at scale — they know what was said but lose when and in what order, collapsing a sequence into a pile of facts rather than a narrative. "The policy was announced. Three months later, it was reversed." Content gives you two events. Position gives you the causal arc that makes it a story.
This matters most for applications tracking how something evolved — emotional trajectory across a conversation, how reasoning shifted across a document, what returned after being dropped. That signal lives almost entirely in positional information. Systems that replace exact RoPE with coarse recency decay degrade exactly that signal for everything in the distant part of their context.
The Sticky-Note System (DeepSeek MLA): Keeps a complete archive — every token's compressed representation stored, nothing deleted. Memory shrinks 5–13×; compute has a bounded constant-factor improvement (content attention only) but stays quadratic overall. The right choice when your binding constraint is serving throughput at moderate context. Cannot economically reach 1M tokens.
The Rolling Summary Notepad (Qwen3 DeltaNet): A stateful cache with destructive writes — not an append log. Similar keys overwrite each other; distant content can be fully erased rather than just compressed. Loses both content precision (interference) and positional precision (coarse decay). The 1-in-4 full re-reads are a necessity. The "linear" claim is a practical approximation with a bounded O(n²) residual.
The Two-Desk System (MiniMax Lightning): The global accumulation and Qwen's notepad are genuine cousins — both fixed-size, both blur distant positional information. The key difference is the update rule: additive accumulation (dilution) vs. error-correcting overwriting (potential erasure). MiniMax also preserves exact local positional precision within each block. The "linear" claim carries the same qualified honesty: dominant regime at practical lengths, with half the quadratic residual overhead of the Qwen approach. The measured 25% FLOPs at 100K tokens is the honest expression of what this achieves.
Choosing an attention architecture is choosing which constraint you believe is the binding one. DeepSeek bet on serving memory. Qwen bet on training stability. MiniMax bet on the cost of thinking. All three are right about their chosen constraint — and none of them solve the same problem.
| Dimension | 🗒️ The Sticky-Note System DeepSeek MLA |
📔 The Rolling Summary Notepad Qwen3 Gated DeltaNet |
🖥️🖥️ The Two-Desk System MiniMax Lightning Attention |
|---|---|---|---|
| Core Architectural Bet | |||
| What constraint does this solve? | Memory bandwidth. Compress KV cache 5–13× so more users fit on the same hardware. | Training stability. Replace unstable full attention with a gated delta-rule memory that avoids gradient pathologies at scale. | Test-time compute cost. Make generating 80K thinking tokens affordable by keeping the dominant cost linear. |
| What it refuses to sacrifice | Retrieval quality — full softmax preserved; compression learned to be near-lossless. | Training stability — the output gating eliminates attention sink. A first-class requirement. | Context length ceiling — breaking 1M was a first-class design goal. |
| The Library Analogy | |||
| What gets stored between words | Compressed card cKV per token. Complete archive — nothing ever deleted. Spatial compression only. | One fixed-size matrix M, overwritten in place each step. No history. Key-value store with no versioning. | Two states: exact local buffer (last N tokens) + global running sum S (all prior, additive). Both fixed-size. |
| Can token #847's info be retrieved? | Yes — its cKV is stored; decompress on demand. | Maybe not — if later similar-key tokens overwrote M, the contribution may be gone. | Sort of — its k·vᵀ is in S but unextractably averaged with all others. Diluted, not deleted. |
| Type of information loss | Bounded spatial compression. Empirically near-zero on benchmarks. | Two kinds: content erasure (similar keys interfere, later overwrites earlier) + positional coarsening (RoPE replaced by recency decay). | Uniform dilution in global S. Positional precision exact within blocks; blurred in inter-block accumulation. |
| Database analogy | Compressed but complete archive. | Key-value store, no versioning. Later writes can destroy earlier values. | Running aggregate. No row deleted, but no individual row is separately recoverable. |
| Technical Mechanics | |||
| The update rule | Per-token: project to cKV = WDKV·h, store. Decompress via up-projection when needed. No single global state.
DeepSeek-V2 §3.1
|
M ← M + (v − Mk)kᵀError-correcting in-place overwrite. (v−Mk) is prediction error. Similar keys compete. Gated Delta Networks 2024 |
S ← S + k·vᵀPlain additive outer-product sum. No overwriting. Query: q·S ≈ Σᵢ(q·kᵢ)vᵢ Lightning Attention-2 · Qin et al. 2024 · MiniMax-01 §3 |
| Positional info (RoPE) | Fully preserved. Decoupled RoPE: content and position are independent paths. DeepSeek-V2 §3.1 · "decoupled RoPE" | Coarsened in linear layers. Gating (α, β) gives recency bias only. Restored only at 1-in-4 full-attention layers. | Mixed. Exact within each block (full RoPE). Blurred in inter-block S. 1-in-8 softmax layers re-anchor. MiniMax-01 §3.2 |
| Hybrid ratio | N/A — all layers are MLA (softmax-based). | 3 : 1 linear to full. Conservative; prioritises retrieval safety. | 7 : 1 lightning to softmax. More aggressive; enabled by intra-block exact computation. MiniMax-01 §3.1 |
| The Math | |||
| True complexity | O(n² · dcompressed) Still quadratic. Smaller constant; same scaling class. |
(3/4) × O(n·d²) + (1/4) × O(n²·d) "Linear" at practical lengths; O(n²) in strict limit. |
(7/8) × O(n·d²) + (1/8) × O(n²·d) Half the quadratic residual of Qwen. MiniMax-M1 §2 |
| Practical FLOPs at 100K tokens vs. full softmax | ~100% — bounded constant-factor saving from content attention; positional path unchanged. | ~40–55% estimated | ~25% of DeepSeek R1 — measured. MiniMax-M1 §2 |
| Benchmark Evidence | |||
| Long-context retrieval at 1M tokens (MRCR v2) | Cannot compete — DeepSeek R1 context limit is 128K. | Qwen3 dense: 128K. Qwen3-Next: not publicly benchmarked at 1M. | Ranked #2 globally — outperforms OpenAI o3 and Claude 4 Opus. Only Gemini 2.5 Pro ranks higher. MiniMax-M1 release · June 2025 |
| Software engineering (SWE-bench Verified) | DeepSeek R1-0528: 57.6% — leads open-weight models on isolated reasoning. | Qwen3-235B: competitive with DeepSeek R1 on LiveCodeBench. | M1-80k: 56.0% · M1-40k: 55.6% MiniMax-M1 arXiv 2506.13585 Table 2 |
| RL training cost | Significant H800 use (undisclosed). | Not disclosed for reasoning variant. | $537K total on 512 H800s · 3 weeks. MiniMax-M1 §4 |
| Honest Limitations | |||
| The unsolved problem | FLOPs still quadratic. Cannot economically reach 1M contexts regardless of compression quality. | Destructive overwriting means distant semantically-similar content can be unrecoverably lost. Positional precision for distant tokens is coarse only. | Global S loses sharp retrieval on distant tokens. Positional blurring in inter-block is real. Model verbosity (4× average output tokens) amplifies per-call cost. |
| One-line honest limitation | Excellent at serving many users cheaply; cannot scale to where the context itself is the product. | The notepad overwrites its own history — exact recall of distant, semantically-similar content is unreliable. | The global desk blurs what it doesn't erase — and the model sometimes doesn't know what it missed from document #750,000. |
Research notes, half-baked ideas. Probably overthought, definitely over-architected.