Google's TurboQuant Compresses AI Memory by 6x With No Accuracy Loss, Triggering a Selloff in Memory Chip Stocks

Overview

Google Research published a pair of papers on March 24 describing TurboQuant, a software-only compression technique that shrinks the key-value (KV) cache used by large language models to just 3 bits per value, down from the standard 16. The result, according to Google, is a 6x reduction in KV cache memory consumption and up to an 8x speedup in attention computation on Nvidia H100 GPUs, all with zero measurable accuracy loss. The announcement, set to be formally presented at ICLR 2026 in late April, immediately rattled financial markets: shares of SK Hynix and Samsung fell roughly 5 to 6 percent, with Micron and SanDisk also declining, as investors recalculated near-term demand forecasts for high-bandwidth memory.

How TurboQuant Works

The algorithm employs a two-stage compression pipeline developed by Google Research scientist Amir Zandieh and VP and Google Fellow Vahab Mirrokni.

The first stage, called PolarQuant, randomly rotates data vectors and then converts them from standard Cartesian coordinates into polar form, separating each vector into a magnitude (“radius”) and a set of angles. Because the resulting angle distributions are highly concentrated and predictable, the technique maps data onto a fixed grid rather than requiring the constant normalization adjustments that conventional quantization demands, as detailed in Google’s technical blog post.

The second stage, Quantized Johnson-Lindenstrauss (QJL), applies a mathematical projection that preserves distances in high-dimensional space to reduce the remaining approximation error to a single sign bit per value. A specialized estimator then balances the high-precision query against the low-precision stored data to recover accurate attention scores, according to the same post.

Critically, TurboQuant is both training-free and data-oblivious: organizations can apply it as a drop-in optimization layer to existing fine-tuned models, whether based on Llama, Mistral, or Google’s own Gemma, without retraining or access to model-specific data, as reported by Techi.

Benchmark Results

Google evaluated TurboQuant across standard long-context benchmarks including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open-source models such as Llama-3.1-8B-Instruct, Mistral-7B, and Gemma. The 4-bit variant delivered up to 8x faster attention logit computation on H100 GPUs, while the 3-bit configuration achieved at least a 6x memory footprint reduction, according to Google Research. Independent developers have since built working implementations in PyTorch, MLX for Apple Silicon, and C/CUDA for llama.cpp, reportedly achieving character-identical outputs compared to uncompressed baselines, per Techi.

Market Reaction

The practical implications of a 6x reduction in memory requirements sent shockwaves through the semiconductor sector. Shares of SK Hynix and Samsung, the world’s two largest memory chipmakers, fell approximately 6 percent and nearly 5 percent respectively in South Korean trading. In the United States, Micron and SanDisk also declined, according to Open Data Science. The selloff followed a year-long rally in memory stocks driven by supply constraints and surging AI demand, with the announcement forcing a recalculation of near-term high-bandwidth memory (HBM) demand forecasts.

This development arrives amid a period of intense activity in the memory sector. SK Hynix, Samsung, and Micron recently began mass production of HBM4, while Samsung has been pursuing multi-year supply agreements with Google and Microsoft to lock in AI chip supply.

What We Don’t Know

TurboQuant remains a research result, not yet a production system. TechCrunch characterized it as “a lab experiment for now”, and there is no confirmed timeline for integration into Google’s own products or cloud services beyond a planned open-source release timed with the ICLR 2026 presentation in late April.

It is also unclear how the technique will interact with emerging architectures designed around larger context windows and agentic workflows, which are themselves driving KV cache sizes upward. Analysts cited by Open Data Science argued that the selloff was driven more by profit-taking than by a fundamental demand shift, noting that efficiency improvements historically expand the scope of what AI systems can do rather than reduce overall hardware requirements.

Whether TurboQuant compresses away a meaningful slice of the HBM supercycle or simply moves the bottleneck elsewhere will depend on how quickly inference providers adopt it and how model architectures evolve in response.