Training Memory Explosion

Watch GPU memory consumed by weights, gradients, optimizer states & activations

Training Configuration

Model: Llama-3.2-1B
Parameters: 1.2B
Batch Size: 4
Sequence Length: 1024
Learning Rate: 5e-5
Step: 0 / 100K
Total Memory: 0 GiB
GPU Required: 1× H100

Interconnect Impact

Throughput: 100%
Step Time: 250ms
Data Movement: 0.8 GB/s
Efficiency: 95%

Memory Breakdown

Model Weights: 16 GiB
Gradients: 16 GiB
Optimizer (Adam): 32 GiB
Activations: 8 GiB

Training Loss

Training Impact

Training Memory Visualization | kvcache-view