This summary captures our entire discussion about KV cache, pruning strategies, attention mechanisms, and LLM inference.
Score = Element × Input
KV cache element: 0.8
Query input: 0.1
Score: 0.8 × 0.1 = 0.08 (LOW - can prune)
vs.
KV cache element: 0.3
Query input: 0.9
Score: 0.3 × 0.9 = 0.27 (HIGHER - should keep)Token "queen" input: [0.5, 0.8, 0.2, 0.9, ...]
After transformations:
Q = [0.000001, 1, 3, ...] # "I need person/musician info, not royal info"
K = [100, 2, 4, ...] # "I'm very relevant for royal queries"
V = [1, 2, 3, ...] # "I contain: royal=1, person=2, musician=3"Input: "System: You are helpful. User: What is the capital of France?"
Prefilling:
- Process all 17 input tokens at once
- Build KV cache: [17 × hidden_size]
- Generate first token: "The"
Decoding:
Time 1: Add "The" → Cache: [18 × hidden_size] → Generate "capital"
Time 2: Add "capital" → Cache: [19 × hidden_size] → Generate "of"
Time 3: Add "of" → Cache: [20 × hidden_size] → Generate "France"
...continue until complete responseThis comprehensive understanding provides the foundation for working with modern LLM optimization techniques and understanding their trade-offs between efficiency and quality.