KV Cache Memory Envelope (2020-2030)

GQA-8 8x reduction

INT4 quantization 2x reduction

MoD (12.5% active) 8x reduction

StreamingLLM eviction 2x reduction

Total SOTA ~256x reduction

Model Size —

High Bound —

SOTA (achievable) —

Compression —

High Bound: M = (T/2) * sqrt(2*N*L)

Chinchilla: N(t) = N0 * 2^((t-2020)/1.8)

Layers: L proportional to N^0.25

SOTA: M / (8*8*2*2) = M/256

Inference-efficient LLM scaling analysis

Dynamic compute allocation (8x reduction)

Hoffmann et al. (2022)