Zipfian Distribution & Engram Cache Hierarchy

Zipf's Law: f(r) ∝ 1/r ^s

Example N-gram Ranks

Key Insight: Common patterns like "the", "of the", "in the" dominate access frequency. Rare N-grams may appear once per million tokens.

Multi-Head Hashing: Parallel Lookups & Collision Mitigation

Engram uses 8 hash heads × 2 N-gram orders = 16 independent parallel lookups per token position per Engram layer . Engram-27B inserts Engram at layers [2, 15], so a full token pass performs 32 lookups (16 × 2 layers).

Why Multiple Hash Heads?

Collision mitigation: With a single hash function, two different N-grams might map to the same slot. With 8 independent hash functions pointing to 8 separate tables, the probability that all 8 collide is near zero.

The model learns to rely on the heads that didn't collide for any given N-gram.

Why This Enables Parallelism

All 16 lookups per layer are:

Independent — no data dependency between them
Deterministic — computed purely from input tokens
Fixed size — each returns 80-dim (1280 ÷ 16)

Storage Implications *Projection

On GPU : Single batched gather operation
On SSD : 16 parallel io_uring submissions per layer
On tiered storage : Some hit GPU, some DRAM, some SSD — all in parallel

Note: The paper only measures DRAM offload over PCIe. SSD tiering and io_uring are projections. A naive per-head SSD read path would need batching, coalescing, and a DRAM cache in front to amortize fine-grained access overhead.

📊 From the Paper

"We set the maximum N-gram size to 3 , the number of heads to 8 , and the dimension to 1280 ."

N-gram orders used: {2, 3} (bigrams and trigrams)

Log-Log Plot: The Classic Zipf Signature

Zipfian distributions appear as straight lines on log-log plots. Slope = -s (the Zipf exponent).

Multi-Level Cache Hierarchy

From the paper: "This statistical property motivates a Multi-Level Cache Hierarchy: frequently accessed embeddings can be cached in faster storage tiers (e.g., GPU HBM or Host DRAM), while the long tail of rare patterns resides in slower, high-capacity media (e.g., NVMe SSD). This stratification allows Engram to scale to massive memory capacities with minimal impact on effective latency."

Access Distribution

Engram: Deterministic Addressing Enables Offload

Why Offload Works for Engram

Deterministic Hash IDs

"Unlike MoE, which relies on runtime hidden states for dynamic routing, Engram's retrieval indices depend solely on the input token sequence."

→ Hash IDs are known before the layer executes

Prefetch-and-Overlap Strategy

"The system can asynchronously retrieve embeddings from abundant host memory via PCIe. To effectively mask communication latency, the Engram module is placed at specific layers within the backbone, leveraging the computation of preceding layers as a buffer."

Measured Overhead (DRAM over PCIe)

"Offloading a 100B-parameter embedding table incurs a negligible throughput penalty, peaking at only 2.8% on the 8B backbone."

Measured on H800 with 512 sequences: throughput drops from 6315 tok/s to 6140 tok/s. This experiment forces all retrievals over PCIe from host DRAM — no NVMe experiment is shown in the paper.

Bandwidth Reality Check *Projection

Each Engram layer retrieves 16 chunks totaling 1280 values/token. At FP16 that is ~2.56 KB/token/layer. At 6140 tok/s: ~15.7 MB/s per layer, ~31.4 MB/s for two layers.

Even pessimistically assuming every lookup triggers a full 4 KB random read: ~98k IOPS per layer, ~196k IOPS for two. Enterprise SSDs like the Micron 9550 (3.3M 4K random-read IOPS) or Kioxia FL6 (1.5M IOPS, 29 µs latency) handle this with headroom — but the software path (batching, coalescing, avoiding CPU bounce buffers) remains the real engineering challenge.

Access Pattern per Token

Key: All 16 hash IDs per Engram layer are computed from input tokens alone — no dependency on hidden states. Engram-27B uses 2 layers (32 total lookups per token).

This enables prefetching from any storage tier while earlier layers compute.