Training Memory Explosion
Watch GPU memory consumed by weights, gradients, optimizer states & activations
Training Configuration
Model:
Llama-3.2-1B
Parameters:
1.2B
Batch Size:
4
Sequence Length:
1024
Learning Rate:
5e-5
Step:
0 / 100K
Total Memory:
0 GiB
GPU Required:
1× H100
Interconnect Impact
Throughput:
100%
Step Time:
250ms
Data Movement:
0.8 GB/s
Efficiency:
95%
Memory Breakdown
Model Weights:
16 GiB
Gradients:
16 GiB
Optimizer (Adam):
32 GiB
Activations:
8 GiB
Training Loss
Training Impact
▶️ Start Training
Speed: 50x
Model: Llama-1B
Batch: 4
Seq: 1024
Optimizer: AdamW
GC: ON
grad checkpoint
MP: ON
mixed precision
ZeRO: OFF
DeepSpeed ZeRO
FSDP: OFF
fully sharded
GA: OFF
grad accumulation
GPU: H100 80G
GPUs: 1
Link: PCIe 5.0
DC: None
← Inference (KV Cache) Simulation
Training Memory Visualization |
kvcache-view