Introduction: The Memory Wall That Defines 2026 Inference
Long-context large language models (LLMs) have hit a wall that has nothing to do with model weights. During decoding, transformers cache key and value (KV) vectors for every token at every layer. This cache grows linearly with sequence length and batch size. For Llama-3.1-70B in BF16, the KV cache costs about 0.31 MB per token. At 128K tokens that is ~40 GB; at 1M tokens it exceeds 300 GB—more than the 140 GB of weights themselves. Every newly decoded token must stream the entire cache out of high-bandwidth memory (HBM), making decoding memory-bandwidth-bound rather than compute-bound. Shrinking the KV cache is therefore the most direct lever for cutting both cost and decode latency.
Three recent 2026 methods—Google and NYU’s TurboQuant (ICLR 2026), Together AI’s OSCAR, and Apple’s EpiCache—attack this problem from different angles. TurboQuant pushes the theoretical, model-agnostic frontier. OSCAR leads on deployable INT2. EpiCache solves conversational memory across turns. This briefing analyzes the strategic consequences of each approach and what they mean for cloud providers, hardware vendors, and enterprise adopters.
TurboQuant: The Theoretical Frontier
TurboQuant handles outliers without ever looking at your data. It randomly rotates each vector so coordinates become nearly independent and approximately Gaussian, then applies an optimal precomputed scalar (Lloyd–Max) quantizer per coordinate. A 1-bit Quantized Johnson–Lindenstrauss (QJL) transform on the residual gives a provably unbiased estimate of attention logits. The selling point is theoretical: TurboQuant’s distortion is provably within a small constant factor (≈ 2.7×) of the information-theoretic lower bound. In practice it reaches essentially full-precision recall on Needle-in-a-Haystack at 4× compression, and the paper reports absolute quality neutrality at 3.5 bits and only marginal degradation at 2.5 bits per channel.
Strategic consequence: TurboQuant is the go-to for 3–4 bit near-lossless compression on any model, with no calibration needed. It works on any model untouched and doubles as a fast vector-database quantizer. However, the widely repeated “8× faster attention on H100” figure comes from Google’s blog, not the paper, and refers to a narrow attention-logit microbenchmark. TurboQuant’s documented sweet spot is the 3–4 bit regime. For INT2, it drops by more than 40 points in OSCAR’s evaluation framework—though that evaluation quantizes all layers, uses a single random seed, and operates below TurboQuant’s intended bit-width.
OSCAR: Deployment-Ready INT2
OSCAR bets the opposite way. Its premise is that at INT2’s four levels, a data-oblivious rotation is the wrong tool. So OSCAR computes an attention-aware rotation from a one-time offline calibration pass: keys are rotated into the eigenbasis of the query covariance, values into the score-weighted value covariance. A Hadamard transform plus a bit-reversal permutation spreads channel importance evenly across quantization groups. What sets OSCAR apart is that it ships as a complete system: mixed-precision paged cache (sink and recent tokens stay in BF16 while history compresses to INT2—at 128K context only ~0.24% of tokens remain in BF16), fused Triton kernels with full SGLang integration, and precomputed rotations for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7.
At an effective 2.28 bits, OSCAR lands within 1.42 points of BF16 on Qwen3-8B and is essentially on par on Qwen3-32B (a 0.02-point gap). On GLM-4.7-FP8—where naive INT2 collapses to zero—OSCAR matches BF16. Together AI reports up to 7.83× job-level throughput and roughly 8× KV-cache memory reduction at 100K context, with up to ~3× faster decoding.
Strategic consequence: For deployable INT2 at 128K tokens on supported models, OSCAR is currently the only demonstrated option that doesn’t collapse. It comes with production-ready SGLang support. Together AI gains a strong competitive edge in cloud inference, potentially capturing cost-sensitive customers who need long-context without accuracy loss.
EpiCache: Conversational Memory Across Turns
TurboQuant and OSCAR are both built for a single long context. Neither handles extended multi-turn conversations, where history piles up across many exchanges. Apple’s EpiCache is a training-free KV-cache management framework aimed exactly at that gap. It uses block-wise prefill to keep peak memory bounded, episodic clustering to segment conversation into coherent semantic “episodes,” episode-matched retrieval to route each query to the most relevant episode, and adaptive layer-wise budget allocation to distribute memory budget according to each layer’s sensitivity to eviction.
Across LongMemEval, RealTalk, and LoCoMo, EpiCache reports up to 40% higher accuracy than eviction baselines, near-full-cache accuracy at 4–6× compression, and up to 3.5× lower peak memory (and ~2.4× lower latency). Because it decides which tokens to keep rather than how precisely to store them, it composes directly with OSCAR or TurboQuant for compounding savings.
Strategic consequence: EpiCache solves a problem neither quantizer addresses: multi-turn conversational memory. Apple can leverage this for on-device long-context LLMs, strengthening its AI ecosystem. For enterprises building chatbots or virtual assistants, EpiCache offers a path to maintain context across long sessions without blowing memory budgets.
Winners & Losers
Winners: Together AI gains a strong competitive edge in cloud inference with OSCAR’s 7.83× throughput and 8× memory reduction. Apple strengthens its on-device AI with EpiCache’s memory and latency improvements. Google and NYU position themselves as leaders in quantization research, potentially licensing or integrating TurboQuant into Google Cloud.
Losers: NVIDIA faces pricing pressure if compression reduces memory bandwidth demand for high-margin HBM-based GPUs. Eviction-based caching solutions (e.g., H2O, SnapKV) may become obsolete as EpiCache and quantization methods outperform them by 40% accuracy. Smaller cloud providers without proprietary compression may struggle to compete on cost and latency.
Second-Order Effects
The race shifts from model size optimization to inference-time memory management. Compression becomes a key differentiator for cloud and edge AI platforms. Expect consolidation: inference frameworks (vLLM, TensorRT-LLM) will integrate these methods, commoditizing basic quantization. Hardware vendors may respond with specialized memory architectures that reduce the need for aggressive compression. The most interesting possibility is that TurboQuant and OSCAR are complementary: pairing a calibration-aware rotation with an optimal scalar quantizer is a promising combination nobody has shipped yet.
Market / Industry Impact
For cloud providers, the ability to serve 1M-token contexts at reasonable cost will be a competitive differentiator. Together AI’s OSCAR gives it a lead in the short term, but Google’s TurboQuant offers broader generality. Apple’s EpiCache targets the growing market for on-device conversational AI. Enterprises should evaluate these methods based on their specific constraints: bit-width budget, model portability, and conversation length.
Executive Action
- If you need INT2 compression on supported models today, adopt OSCAR via SGLang for immediate throughput gains.
- If you need model-agnostic near-lossless compression at 3–4 bits, evaluate TurboQuant for its theoretical guarantees and no-calibration deployment.
- If you run multi-turn conversational agents, integrate EpiCache to maintain context without memory blow-up, and combine it with a quantizer for compounding savings.
Source: MarkTechPost
Rate the Intelligence Signal
Intelligence FAQ
For deployable INT2 on supported models, OSCAR is currently the only demonstrated option that doesn’t collapse. For model-agnostic near-lossless compression at 3–4 bits, TurboQuant offers broader generality.
Both teams have noted the idea is promising but nobody has shipped it yet. Pairing a calibration-aware rotation with an optimal scalar quantizer could yield even higher compression ratios.
EpiCache decides which tokens to keep rather than how precisely to store them. It composes directly with quantizers for compounding savings and is designed for multi-turn conversations.



