Compression Stack
GQA-8
8x reduction
INT4 quantization
2x reduction
MoD (12.5% active)
8x reduction
StreamingLLM eviction
2x reduction
Total SOTA
~256x reduction
2030 Projection
Model Size
—
High Bound
—
SOTA (achievable)
—
Compression
—
Mathematical Foundation
High Bound:
M = (T/2) * sqrt(2*N*L)
Chinchilla:
N(t) = N0 * 2^((t-2020)/1.8)
Layers:
L proportional to N^0.25
SOTA:
M / (8*8*2*2) = M/256
References
Scaling Laws Meet Model Architecture
Inference-efficient LLM scaling analysis
Mixture-of-Depths
Dynamic compute allocation (8x reduction)
Chinchilla Scaling Laws
Hoffmann et al. (2022)