AI EngineeringJune 1, 2026 · 5 min read · PhotonSpark

TurboQuant and the KV cache problem

TurboQuant is a KV-cache compression technique from Google Research. The practical question is simple: can it buy enough context or concurrency to justify the latency cost on your own model server?

TurboQuant is easy to put in the wrong bucket.

Do not treat it as another way to shrink model weights. AWQ, GPTQ, FP8, GGUF and similar formats deal mostly with the model you load before traffic arrives. TurboQuant is aimed at the memory that grows while the server is working: the KV cache.

That distinction matters on real inference boxes. A model can load cleanly and still fall over once users send long prompts or too many requests at once. The cache, not the weights, becomes the wall.

The test I care about is narrow: does TurboQuant let the same GPU keep more active context in memory without making latency or output quality unacceptable?

The cache is the problem

When an LLM generates text, it keeps key and value tensors from previous tokens. That cache saves work. Without it, every new token would require too much repeated attention computation.

The bill shows up in VRAM. Long prompts use more cache. Long conversations use more cache. Concurrent users multiply it. A 128K context setting is also a memory commitment.

This is why KV-cache quantization keeps showing up in serving frameworks. vLLM has FP8 KV-cache options. LMDeploy has INT4, INT8, and now TurboQuant support. Everyone is trying to keep more live tokens in memory.

What Google built

Google Research describes TurboQuant as an online vector quantization method. For LLM serving, the relevant use is KV-cache compression.

The method combines two ideas.

PolarQuant rotates the vectors first, then quantizes them. The rotation makes the data easier to compress with less wasted metadata.

QJL, or Quantized Johnson-Lindenstrauss, stores a small residual correction. That matters because attention depends on inner products. A quantizer can look fine under a normal error metric and still bias those inner products in a way the model feels.

That is the technical pitch: compress the cache hard, but preserve enough of the attention math that long-context quality does not collapse.

The research numbers

Google's March 2026 post reports tests on long-context benchmarks including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. The paper also reports runs with open models such as Gemma and Mistral.

The numbers worth keeping in mind:

Google reports at least 6x KV-cache memory reduction in its long-context needle tests.
The paper reports 3-bit KV quantization without training or fine-tuning in the tested setup.
Google reports up to 8x attention-logit speedup for 4-bit TurboQuant versus 32-bit unquantized keys on H100.
The paper reports quality neutrality at 3.5 bits per channel and marginal degradation at 2.5 bits per channel.

Those are research numbers. They are useful, but I would not use them as a customer capacity plan. Hardware, model architecture, context length, prompt mix and serving engine all change the outcome.

The LMDeploy benchmark is the one I would quote first

LMDeploy has a more deployment-shaped benchmark in its KV quantization docs. Their test uses:

H200
Qwen3-30B-A3B-Base
ShareGPT dataset
concurrency 64
5000 requests
quant_policy=0 for baseline
quant_policy=42 for TurboQuant K4V2

The reported table is less flashy than the research headline, which makes it more useful.

LMDeploy reports:

Input throughput: 2368.8 tok/s baseline, 2195.8 tok/s with TurboQuant, down 7.3%.
Output throughput: 2186.7 tok/s baseline, 2027.0 tok/s with TurboQuant, down 7.3%.
Request throughput: 10.74 req/s baseline, 9.96 req/s with TurboQuant, down 7.3%.
Mean end-to-end latency: 5.888s baseline, 6.348s with TurboQuant, up 7.8%.
Mean TTFT: 1.139s baseline, 1.235s with TurboQuant, up 8.4%.
Mean TPOT: 0.024s baseline, 0.026s with TurboQuant, up 8.3%.
Mean ITL: 0.059s in both runs, roughly unchanged.

LMDeploy's summary: about 5x KV-cache memory reduction with about 7-8% end-to-end overhead.

That is a real trade. If your service is blocked by cache memory, paying 8% latency to get much more context room may be a win. If your users already complain about latency and VRAM is not tight, it may be a bad swap.

Where I would test it

I would look at TurboQuant for long-context private AI endpoints first.

Good candidates:

codebase assistants that read large files or many chunks
legal or medical review workflows with long source material
support agents that keep large ticket history in context
internal search tools with heavy retrieved context
batch jobs where memory limits matter more than first-token latency

I would be slower to use it on short chat endpoints. If the prompts are small and the GPU has plenty of headroom, KV-cache compression can add moving parts without solving a real problem.

Current limits

LMDeploy lists some constraints that matter. TurboQuant is tied to the PyTorch engine in their docs. It does not support MLA. It does not support speculative decoding. The attention head dimension needs to be a power of two. The optional fast_hadamard_transform package is recommended for better performance.

Those details decide whether this is usable on a given stack. They also explain why a benchmark from H200 plus Qwen3 does not automatically tell you what will happen on RTX 4090s, L40S, A100s, or a fine-tuned model.

A small test plan

Do not start by arguing about the paper. Start with a baseline.

bash

nvidia-smi
python3 --version
python3 -m venv .venv
. .venv/bin/activate
pip install lmdeploy fast_hadamard_transform

Run a small TurboQuant smoke test:

python

from lmdeploy import pipeline, PytorchEngineConfig

engine_config = PytorchEngineConfig(
    tp=1,
    cache_max_entry_count=0.8,
    quant_policy=42,
)

pipe = pipeline("Qwen/Qwen3-8B", backend_config=engine_config)
response = pipe.infer(
    "Read this long support thread and list the unresolved questions.",
    max_new_tokens=128,
)
print(response.text)

Then run the same prompt without KV quantization:

python

from lmdeploy import pipeline, PytorchEngineConfig

engine_config = PytorchEngineConfig(
    tp=1,
    cache_max_entry_count=0.8,
    quant_policy=0,
)

pipe = pipeline("Qwen/Qwen3-8B", backend_config=engine_config)
response = pipe.infer(
    "Read this long support thread and list the unresolved questions.",
    max_new_tokens=128,
)
print(response.text)

Watch VRAM during the longest prompt:

bash

watch -n 1 nvidia-smi

Record the same values both times:

text

Baseline run
- Peak VRAM:
- TTFT:
- Tokens/sec:
- Max stable concurrency:
- Quality failures:

TurboQuant run
- Peak VRAM:
- TTFT:
- Tokens/sec:
- Max stable concurrency:
- Quality failures:

Notes
- Watch VRAM during the longest prompt.
- Use the same prompt set and max-token settings.
- Increase concurrency until errors or latency break the target.
- Mark missing facts, citation drift, and bad long-context recall.

Use prompts from the product, not generic demo questions. If the product handles clinical notes, test clinical-note prompts. If it reads repositories, test real repository files. If it answers support tickets, use ugly support tickets.

When I would turn it on

I would enable TurboQuant only when the test says the cache is the bottleneck.

Keep it if the memory headroom unlocks longer context or more concurrent users and the latency hit stays inside the service target. Leave it off if the endpoint is latency-bound, quality gets weird on long prompts, or the extra memory is not needed.

That is the PhotonSpark angle on it. Treat TurboQuant as something to benchmark when the KV cache is the reason the deployment cannot do what the customer needs.

Sources

← Back to blog