How Modern AI Reads: Three Ways to Solve the Attention Problem

I've been studying three architectural approaches to one of the hardest problems in large language models: how to make a model read very long documents without the cost becoming prohibitive. The Jan 2025 MiniMax-01 paper, the Jun 2025 MiniMax-M1 paper, DeepSeek V2/V3, and the Qwen3 family all tackle the same problem — and they arrived at genuinely different answers.

This note uses a library analogy to make the differences intuitive. But I wanted to do more than explain the first-level metaphor — I wanted to stress-test each one against the hard questions a skeptical reader would raise. Each section below leads with the core architectural bet, then gives the analogy, then works through the questions the analogy invites. The goal is an explanation you could actually defend, not just one that sounds right on first read.

The core problem: as a language model reads more text, the cost of "paying attention" to all of it grows quadratically. Double the context, quadruple the work. At a million tokens this becomes physically impossible at any reasonable cost. Three teams found three different ways out — and understanding precisely how they differ requires going past the surface of each analogy.

The Library

Imagine every language model is a researcher working in a library. Their job: read everything on the desk and write an answer, one word at a time. Before writing each word, they consult everything they've read so far. The pile grows; the consulting grows with it.

In the original Transformer, the researcher re-reads every document completely before writing every single word. Pile doubles → work quadruples. Most models cap out around 128K tokens because beyond that, the re-reading becomes physically impossible within reasonable time and cost.

Here is how three teams fixed it.

🗒️

DeepSeek V2 / V3 / R1 · MLA · Multi-Head Latent Attention

The Sticky-Note System

Same walk to every shelf. Smaller cards on each shelf.

The core bet

Memory is the binding constraint, not compute. At production scale, what limits how many users you can serve simultaneously isn't raw processing speed — it's how much desk space each session occupies. Compress what gets written down by 5–13×, and you can serve far more users with the same hardware. The reading process itself doesn't change; only the card size does.

How it works in the library

The researcher still re-reads everything — same walk across every shelf, same sweep before each word. But each document has been condensed into a tight sticky note using a special learned shorthand. Much less to carry. The reading operation hasn't changed; only the card size has.

What gets stored: a compressed latent vector c_KV per token. A complete archive where nothing is ever deleted. Compression is spatial — a smaller representation — not temporal. Every token's information is stored and retrievable.

Why this matters in practice: fewer GPUs needed to serve the same number of users. The saved memory holds more concurrent sessions. At the scale DeepSeek operates, this is a significant infrastructure win.

DeepSeek-V2 · Jun 2024 · "low-rank key-value joint compression"

Stress-testing the sticky-note analogy

Q If the notes are compressed, doesn't that cause information loss?

Yes, technically — low-rank projection is mathematically lossy. But DeepSeek's own ablations show MLA outperforms standard full-size attention on most benchmarks. The full KV space is massively redundant: most dimensions carry correlated, overlapping signal. The compression forces the model to learn a more essential basis — it's not JPEG losing visible detail, it eliminates dimensions that weren't doing independent work. Correct framing: technically lossy, empirically lossless in ways that matter — and sometimes better, because it eliminates redundancy.

Q If you're reading smaller notes, isn't that also a compute reduction — not just memory?

Partially right, and the analogy oversimplified by saying "same cognitive operation." The content part of attention can be computed in compressed latent space — a real saving. But MLA uses decoupled RoPE: the positional encoding component of keys cannot be absorbed into this trick and must be computed separately at full cost. Content attention cheaper ✓, positional attention full-cost ✗. The dominant scaling term stays O(n²). Correct framing: primarily a memory solution, with a bounded constant-factor compute benefit that doesn't change the scaling complexity class. It cannot economically reach 1M token contexts.

Q So what is MLA genuinely good at, and what doesn't it solve?

It excels at production serving — more users per GPU, lower infrastructure cost at typical context lengths. It preserves complete information (nothing deleted) and maintains exact positional precision via its separate RoPE path. What it doesn't solve: the quadratic cost growth. The right bet if your constraint is serving throughput at moderate context. The wrong bet if your goal is breaking the 1M-token barrier.

📔

Qwen3-Next / Kimi Linear · Gated DeltaNet · 3:1 hybrid ratio

The Rolling Summary Notepad

Don't re-read the shelf. Update the notepad. Check the shelf every 4th time.

The core bet

Training stability is the real blocker, and full attention is the cause. At large scale, standard attention creates gradient instabilities and "attention sink" pathologies. The fix: replace most attention layers with a fixed-size, continuously-updated memory that uses only the delta — what changed — rather than re-reading everything. Occasional full re-reads keep precision from degrading entirely. The dominant cost becomes sub-quadratic, but some distant content can be partially overwritten and lost.

How it works in the library

The researcher keeps a single evolving notepad — one fixed-size matrix M — updated each time a new document arrives. The update uses only the prediction error: how wrong is the current notepad about what value to retrieve for this key, and by how much? Every 3 documents, one full traditional re-read catches what the notepad missed.

The update rule: M ← M + (v − Mk)kᵀ — an in-place correction, not an append. The term (v − Mk) is the prediction error. No history of past states of M is kept.

Why this matters: the desk size stays constant regardless of how many documents arrive. No more quadratic cost growth for the dominant computation. The 1-in-4 full re-reads are a genuine necessity — not a performance choice — to rescue retrievals the notepad has degraded.

Qwen3-Next · Apr 2025 · Gated Delta Networks 2024

Stress-testing the rolling notepad analogy

Q Is this like a database append log — tracking changes and walking through to reconstruct state?

No — and this is the most important correction. An append log keeps every entry. DeltaNet's M is a single matrix overwritten in place every step. A better database analogy: a key-value store with no versioning. When a new document has a key similar to an earlier one, the later update partially overwrites the earlier contribution to M. The earlier token's information may be degraded or gone. There is no log to walk through — the history is destroyed by subsequent writes.

Q What kinds of information actually get lost?

Two distinct kinds. First, content loss via overwriting: tokens with similar keys interfere — document #200 and #847 on the same topic means #847's update partially erases #200's contribution. Unbounded; grows with semantic overlap. Second, positional loss via coarse decay: DeltaNet has no explicit RoPE in its linear layers. Gating signals (α, β) create a rough recency bias but you lose the exact "token 847 vs. 923" granularity. You know recent vs. distant, not exactly when.

Q Is "linear complexity" a rigorous claim?

Approximately, not asymptotically. True total: (3/4)×O(n·d²) + (1/4)×O(n²·d). At very large n, the quadratic correction layers eventually dominate. "Linear" is honest as a description of the dominant regime at practical lengths. Better framing: makes the dominant cost linear, with a bounded quadratic correction overhead at every 4th layer.

🖥️🖥️

MiniMax-01 / MiniMax-M1 · Lightning Attention · 7:1 hybrid ratio

The Two-Desk System

Desk 1: read the last few pages exactly. Desk 2: run a tally of everything else.

The core bet

The test-time compute era changes the cost equation for reasoning. Reasoning models generate 40K–80K+ tokens of internal "thinking" before producing an answer. With standard attention, each thinking token pays quadratic cost — so thinking longer gets exponentially expensive. The answer: split every attention layer into exact local computation and an approximate global accumulation. The dominant cost becomes linear, making million-token contexts and deep reasoning chains genuinely affordable. The tradeoff: distant content is diluted in the global accumulation, not precisely retrievable.

How it works in the library

Desk 1 — local (intra-block): full, exact re-read of the last N tokens. Nothing compressed, full precision, full RoPE positional encoding. Every token in the recent window gets exact attention.

Desk 2 — global (inter-block): a running cumulative sum S of all prior tokens. Update rule: S ← S + k·vᵀ. Purely additive — no overwriting, no competition. S stays fixed-size regardless of how many tokens have been processed. The researcher consults S for distant context, not the original shelves.

Every 7 layers, one full traditional softmax layer re-reads everything to patch what Desk 2 missed. This is the 7:1 lightning-to-softmax ratio — more aggressive than Qwen's 3:1, enabled by the fact that Desk 1 already handles local precision.

What this achieves: at 100K generation tokens, MiniMax-M1 uses approximately 25% of the compute of DeepSeek R1. The 1M-token context window becomes genuinely affordable — this is the only architecture among the three purpose-built to make that regime practical.

MiniMax-01 · Jan 2025 · arXiv 2501.08313 · MiniMax-M1 · Jun 2025 · arXiv 2506.13585

Stress-testing the two-desk analogy

Q Isn't Desk 2 basically the same as Qwen's notepad? Both are fixed-size states accumulating from all past tokens.

This is the most important stress test, because the similarity is genuine. Both maintain a fixed-size state. Both accumulate from all past tokens. Both lose exact positional information for distant content. These are real parallels. The divergence is the update rule. Qwen's DeltaNet uses error-correcting overwriting: M ← M + (v − Mk)kᵀ — similar keys compete, later tokens can erase earlier ones. MiniMax's inter-block uses plain additive accumulation: S ← S + k·vᵀ — no competition, no overwriting. Token #847's information is potentially gone from DeltaNet if overwritten. It's still present in S — just unextractably diluted. Dilution is a meaningfully less destructive form of loss than erasure.

Q Does MiniMax also lose positional information for distant tokens, like Qwen?

Yes — and this should be stated honestly. The inter-block S matrix carries no explicit RoPE. Distant tokens lose positional precision in the global accumulation, similar to DeltaNet. MiniMax's structural advantage is in the intra-block: within each block, exact softmax attention with full RoPE runs. Every token has its last N tokens with perfect positional precision. DeltaNet's positional degradation runs across all linear layers everywhere; MiniMax's runs only in the inter-block global state, which handles distant context rather than recent.

Q Is "linear complexity" a rigorous claim here either?

Same caveat as Qwen. True formula: (7/8)×O(n·d²) + (1/8)×O(n²·d). Technically O(n²) in the strict limit. But the quadratic residual is half of Qwen's (1/8 vs. 1/4), and at practical lengths (100K–1M tokens) the linear terms dominate — which is what the measured 25% FLOPs figure reflects. More defensible: lightning attention makes the dominant cost term linear, with a bounded quadratic residual from correction layers — half the quadratic overhead of the Qwen approach. The "linear" claim is honest at practical sequence lengths, not a strict asymptotic statement.

Why Positional Information Deserves Its Own Note

Two of the three approaches degrade positional information for distant tokens. It's worth being precise about what that actually costs — because content information and positional information carry very different things.

Content information

What something means

Semantic identity — what a token is about
Which topics relate to the current query
Conceptual relationships between ideas
Answers: what was said?
Reasonably preserved by all three architectures within their limits

Positional information (RoPE)

When and in what order

Relative distance between tokens
Coreference: "he" → "John" by proximity, not meaning
Causal-temporal ordering — this, then that
Trajectory: how something evolved across the sequence
Answers: when, in what sequence, how it changed

Models with degraded positional information become bag-of-words at scale — they know what was said but lose when and in what order, collapsing a sequence into a pile of facts rather than a narrative. "The policy was announced. Three months later, it was reversed." Content gives you two events. Position gives you the causal arc that makes it a story.

This matters most for applications tracking how something evolved — emotional trajectory across a conversation, how reasoning shifted across a document, what returned after being dropped. That signal lives almost entirely in positional information. Systems that replace exact RoPE with coarse recency decay degrade exactly that signal for everything in the distant part of their context.

The Three Bets, Summarised

The Sticky-Note System (DeepSeek MLA): Keeps a complete archive — every token's compressed representation stored, nothing deleted. Memory shrinks 5–13×; compute has a bounded constant-factor improvement (content attention only) but stays quadratic overall. The right choice when your binding constraint is serving throughput at moderate context. Cannot economically reach 1M tokens.

The Rolling Summary Notepad (Qwen3 DeltaNet): A stateful cache with destructive writes — not an append log. Similar keys overwrite each other; distant content can be fully erased rather than just compressed. Loses both content precision (interference) and positional precision (coarse decay). The 1-in-4 full re-reads are a necessity. The "linear" claim is a practical approximation with a bounded O(n²) residual.

The Two-Desk System (MiniMax Lightning): The global accumulation and Qwen's notepad are genuine cousins — both fixed-size, both blur distant positional information. The key difference is the update rule: additive accumulation (dilution) vs. error-correcting overwriting (potential erasure). MiniMax also preserves exact local positional precision within each block. The "linear" claim carries the same qualified honesty: dominant regime at practical lengths, with half the quadratic residual overhead of the Qwen approach. The measured 25% FLOPs at 100K tokens is the honest expression of what this achieves.

Choosing an attention architecture is choosing which constraint you believe is the binding one. DeepSeek bet on serving memory. Qwen bet on training stability. MiniMax bet on the cost of thinking. All three are right about their chosen constraint — and none of them solve the same problem.

Dimension	🗒️ The Sticky-Note System DeepSeek MLA	📔 The Rolling Summary Notepad Qwen3 Gated DeltaNet	🖥️🖥️ The Two-Desk System MiniMax Lightning Attention
Core Architectural Bet
What constraint does this solve?	Memory bandwidth. Compress KV cache 5–13× so more users fit on the same hardware.	Training stability. Replace unstable full attention with a gated delta-rule memory that avoids gradient pathologies at scale.	Test-time compute cost. Make generating 80K thinking tokens affordable by keeping the dominant cost linear.
What it refuses to sacrifice	Retrieval quality — full softmax preserved; compression learned to be near-lossless.	Training stability — the output gating eliminates attention sink. A first-class requirement.	Context length ceiling — breaking 1M was a first-class design goal.
The Library Analogy
What gets stored between words	Compressed card c_KV per token. Complete archive — nothing ever deleted. Spatial compression only.	One fixed-size matrix M, overwritten in place each step. No history. Key-value store with no versioning.	Two states: exact local buffer (last N tokens) + global running sum S (all prior, additive). Both fixed-size.
Can token #847's info be retrieved?	Yes — its c_KV is stored; decompress on demand.	Maybe not — if later similar-key tokens overwrote M, the contribution may be gone.	Sort of — its k·vᵀ is in S but unextractably averaged with all others. Diluted, not deleted.
Type of information loss	Bounded spatial compression. Empirically near-zero on benchmarks.	Two kinds: content erasure (similar keys interfere, later overwrites earlier) + positional coarsening (RoPE replaced by recency decay).	Uniform dilution in global S. Positional precision exact within blocks; blurred in inter-block accumulation.
Database analogy	Compressed but complete archive.	Key-value store, no versioning. Later writes can destroy earlier values.	Running aggregate. No row deleted, but no individual row is separately recoverable.
Technical Mechanics
The update rule	Per-token: project to `c_KV = W_DKV·h`, store. Decompress via up-projection when needed. No single global state. DeepSeek-V2 §3.1	`M ← M + (v − Mk)kᵀ` Error-correcting in-place overwrite. (v−Mk) is prediction error. Similar keys compete. Gated Delta Networks 2024	`S ← S + k·vᵀ` Plain additive outer-product sum. No overwriting. Query: q·S ≈ Σᵢ(q·kᵢ)vᵢ Lightning Attention-2 · Qin et al. 2024 · MiniMax-01 §3
Positional info (RoPE)	Fully preserved. Decoupled RoPE: content and position are independent paths. DeepSeek-V2 §3.1 · "decoupled RoPE"	Coarsened in linear layers. Gating (α, β) gives recency bias only. Restored only at 1-in-4 full-attention layers.	Mixed. Exact within each block (full RoPE). Blurred in inter-block S. 1-in-8 softmax layers re-anchor. MiniMax-01 §3.2
Hybrid ratio	N/A — all layers are MLA (softmax-based).	3 : 1 linear to full. Conservative; prioritises retrieval safety.	7 : 1 lightning to softmax. More aggressive; enabled by intra-block exact computation. MiniMax-01 §3.1
The Math
True complexity	O(n² · d_compressed) Still quadratic. Smaller constant; same scaling class.	(3/4) × O(n·d²) + (1/4) × O(n²·d) "Linear" at practical lengths; O(n²) in strict limit.	(7/8) × O(n·d²) + (1/8) × O(n²·d) Half the quadratic residual of Qwen. MiniMax-M1 §2
Practical FLOPs at 100K tokens vs. full softmax	~100% — bounded constant-factor saving from content attention; positional path unchanged.	~40–55% estimated	~25% of DeepSeek R1 — measured. MiniMax-M1 §2
Benchmark Evidence
Long-context retrieval at 1M tokens (MRCR v2)	Cannot compete — DeepSeek R1 context limit is 128K.	Qwen3 dense: 128K. Qwen3-Next: not publicly benchmarked at 1M.	Ranked #2 globally — outperforms OpenAI o3 and Claude 4 Opus. Only Gemini 2.5 Pro ranks higher. MiniMax-M1 release · June 2025
Software engineering (SWE-bench Verified)	DeepSeek R1-0528: 57.6% — leads open-weight models on isolated reasoning.	Qwen3-235B: competitive with DeepSeek R1 on LiveCodeBench.	M1-80k: 56.0% · M1-40k: 55.6% MiniMax-M1 arXiv 2506.13585 Table 2
RL training cost	Significant H800 use (undisclosed).	Not disclosed for reasoning variant.	$537K total on 512 H800s · 3 weeks. MiniMax-M1 §4
Honest Limitations
The unsolved problem	FLOPs still quadratic. Cannot economically reach 1M contexts regardless of compression quality.	Destructive overwriting means distant semantically-similar content can be unrecoverably lost. Positional precision for distant tokens is coarse only.	Global S loses sharp retrieval on distant tokens. Positional blurring in inter-block is real. Model verbosity (4× average output tokens) amplifies per-call cost.
One-line honest limitation	Excellent at serving many users cheaply; cannot scale to where the context itself is the product.	The notepad overwrites its own history — exact recall of distant, semantically-similar content is unreliable.	The global desk blurs what it doesn't erase — and the model sometimes doesn't know what it missed from document #750,000.

Papers and sources:
MiniMax-01 · Jan 2025 · arXiv 2501.08313 · "Scaling Foundation Models with Lightning Attention"
MiniMax-M1 · Jun 2025 · arXiv 2506.13585 · "Scaling Test-Time Compute Efficiently with Lightning Attention"
DeepSeek-V2 · Jun 2024 · "A Strong, Economical, and Efficient Mixture-of-Experts Language Model"
Gated Delta Networks · 2024 · "Improving Mamba2 with Delta Rule"
Qwen3-Next architecture report · Apr 2025 · Lightning Attention-2 · Qin et al. 2024

On benchmarks: Figures are vendor-reported unless noted. Architecture comparisons are based on published papers; capability benchmarks evolve faster than architectures.

DeepSeek MLA

Qwen3 Gated DeltaNet

MiniMax Lightning Attention

■ Win

■ Limitation

■ Trade-off