Introduction to KV Cache in LLMs
The KV cache is a critical component in LLMs, responsible for storing the key and value tensors in the self-attention mechanism. As LLMs scale to longer context windows and serve more concurrent users, the KV cache has emerged as a primary memory bottleneck in production inference systems. For a 30-billion-parameter model with a batch size of 128 and an input length of 1,024 tokens, the resulting KV cache can occupy up to 180 GB of memory.
Eviction-Based KV Cache Compression Techniques
Eviction-based techniques involve removing less important key-value pairs from the cache to make room for new ones. One popular approach is to use a sliding window of the most recent tokens up to the available memory budget. This strategy is simple to implement but may lead to performance degradation if the eviction policy is not carefully designed.
💡 Eviction-Based Techniques
Eviction-based techniques are simple to implement but require careful design of the eviction policy to avoid performance degradation.
Quantization-Based KV Cache Compression Techniques
Quantization-based techniques involve reducing the precision of the key and value tensors to reduce memory usage. One popular approach is to use 8-bit or 16-bit integers instead of 32-bit floating-point numbers. This approach can lead to significant memory savings but may also result in accuracy degradation if not carefully implemented.
import torch
tensor = torch.randn(10, 10)
quantized_tensor = tensor.half()Quantizing a tensor using PyTorch
2x
Memory reduction
10%
Accuracy degradation

Low-Rank Methods for KV Cache Compression
Low-rank methods involve approximating the key and value tensors using lower-rank matrices to reduce memory usage. One popular approach is to use singular value decomposition (SVD) to factorize the tensors into lower-rank matrices. This approach can lead to significant memory savings but may also result in accuracy degradation if not carefully implemented.
import torch
tensor = torch.randn(10, 10)
U, S, V = torch.svd(tensor)Performing SVD on a tensor using PyTorch
5x
Memory reduction
5%
Accuracy degradation
KV Cache Compression Techniques Comparison
KV Cache Compression Techniques Comparison
| Component | Open / This Approach | Proprietary Alternative |
|---|---|---|
| Memory reduction | 2x-5x | Variable |
| Accuracy degradation | 5%-10% | Variable |
🔑 Key Takeaway
KV cache compression techniques are crucial for optimizing LLM inference and enabling their deployment in resource-constrained environments. The choice of technique depends on the specific use case and requirements. Eviction-based techniques are simple to implement but require careful design of the eviction policy. Quantization-based techniques can lead to significant memory savings but may result in accuracy degradation. Low-rank methods can also lead to significant memory savings but may result in accuracy degradation.
Key Links
