The KV Cache: Memory Usage in Transformers

Efficient NLP

21 Jul 202308:33

Summary

TLDRThe video explains why Transformer language models like GPT require so much memory, especially when generating longer text sequences. The key reason is that self-attention requires computing attention between the current token and all previous tokens, which is inefficient. The solution is to cache the key and value matrices so they don't have to be recomputed for each new token. This KV cache takes up a large portion of memory since it grows with the sequence length. There is also higher latency when processing the initial prompt since the KV cache doesn't exist yet. Overall, the KV cache enables efficient autoregressive decoding but requires storing matrices that scale quadratically with sequence length.

Takeaways

😱 Transformers require a lot of memory when generating long text because they have to recompute embeddings for all previous tokens on each new token
🤓 The memory usage comes mostly from the key-value (KV) cache that stores embeddings for previous tokens
🔍 The key matrix represents previous context the model should attend to
💡 The value matrix represents previous context applied as a weighted sum after softmax
✏️ On each new token, only the query vector is computed while K and V matrices are cached
🚀 Using a KV cache reduces computations from quadratic to constant as sequence length grows
📊 For a 30B parameter model, the KV cache takes 180GB for a sequence of 102 tokens - 3X the model size!
⚡ The KV cache dominates memory usage during inference
🐢 There is higher latency when processing the initial prompt without a KV cache
👍 The KV cache allows lower latency when generating each new token after the prompt

Q & A

Why do transformer language models require so much memory when generating long text?
-As more text is generated, more key and value vectors need to be computed and cached, taking up more and more GPU memory. This quadratic growth in computations is inefficient and consumes a large amount of memory.
What are the key, value and query vectors in the self-attention mechanism?
-The query vector represents the current token, the key matrix represents the previous context tokens, and the value matrix also represents the previous contexts but is applied as a weighted sum after softmax.
How does the KV cache help reduce memory usage and computations?
-The KV cache stores previously computed key and value matrices so they don't need to be recomputed for every new token. Only the key and value for the new token is computed, significantly reducing computations.
Where in the transformer architecture is the KV cache used?
-The KV cache is used in the self-attention layer. The cache and the current token embedding are passed in, and the new key and value vectors are computed and appended to the cache.
What factors contribute to the memory usage of the KV cache?
-The factors are: 2 matrices K and V, precision (bytes per parameter), number of layers, embedding dimension per layer, maximum sequence length including prompt, and batch size.
Why does processing the initial prompt have higher latency?
-For the initial prompt, there is no KV cache yet so the key and value matrices need to be computed for every prompt token. Subsequent tokens have lower latency since only the new token's KV is computed.
How large can the KV cache get for a typical transformer model?
-For a 30 billion parameter model, the KV cache can be 180GB while the model itself is 60GB, so over 3x larger.
Why does the KV cache dominate memory usage during inference?
-During inference, the KV cache holds the previously generated key and value matrices so they don't get recomputed. As the sequence grows longer, this cache grows much faster than the model size.
Does the KV cache reduce computational complexity for autoregressive decoding?
-Yes, without the KV cache, a quadratic number of matrix-vector multiplications would be needed. The KV cache reduces this to a constant amount of work per token, independent of past sequence length.
Are there other memory optimizations used with transformers?
-Yes, other optimizations include using lower-precision data types, gradient checkpointing, and knowledge distillation to compress models. But the KV cache addresses a key scalability issue.