A plain-English reference guide covering the jargon that shows up every time a new language model drops, from parameter counts to quantization methods. Contents 01 · Architecture & Model Design — Transformer · Dense Model · Mixture of Experts · Active Parameters · Feed-Forward Network · Layers · Hidden Dimension · Attention Heads 02 · Attention Mechanisms — Multi-Head Attention · Multi-Query Attention · Grouped-Query Attention · KV Cache · Sliding Window Attention · RoPE · RoPE Theta 03 · Sizing, Scale & Counting — Parameters · Embedding Parameters · Non-Embedding […]

Read More →

If you’ve ever tried to train a machine learning model or just wondered why your computer fans start screaming when you open too many Chrome tabs, you’ve probably run into the alphabet soup of processors: CPU, GPU, and TPU. They all “process” things, but they do it in ways that are fundamentally different. Choosing the […]

Read More →