The LLM Vocabulary Sheet
A plain-English reference guide covering the jargon that shows up every time a new language model drops, from parameter counts to quantization methods.
Contents
01 · Architecture & Model Design — Transformer · Dense Model · Mixture of Experts · Active Parameters · Feed-Forward Network · Layers · Hidden Dimension · Attention Heads
02 · Attention Mechanisms — Multi-Head Attention · Multi-Query Attention · Grouped-Query Attention · KV Cache · Sliding Window Attention · RoPE · RoPE Theta
03 · Sizing, Scale & Counting — Parameters · Embedding Parameters · Non-Embedding Parameters · Vocabulary Size · Tokenizer · Context Window · Training Tokens · Scaling Laws
04 · Training Process — Pre-Training · Supervised Fine-Tuning · RLHF · DPO · GRPO · Post-Training · Epochs · Loss · Perplexity
05 · Alignment & Model Variants — Base Model vs. Instruct Model · Distillation · LoRA / QLoRA · Reward Model
06 · Inference & Generation — Inference · Prefill vs. Decode · TTFT · Tokens Per Second · Throughput vs. Latency · Temperature · Top-p · Top-k · Speculative Decoding
07 · Precision & Quantization — FP32 / FP16 / BF16 · FP8 · Quantization
08 · Benchmarks & Evaluation — MMLU · HumanEval / MBPP · GSM8K · MATH · HellaSwag · ARC · TruthfulQA · GPQA · Arena ELO · Contamination
09 · Capabilities & Techniques — Multimodal · Chain of Thought · Extended Thinking · Tool Use · RAG · Agentic · System Prompt · Zero/Few/Many-Shot · Structured Output
10 · Infrastructure & Miscellaneous — FLOPS / FLOPs · Tensor / Pipeline Parallelism · Open Weights vs. Open Source · GGUF · Safetensors · Hallucination · Needle in a Haystack · MCP · Tokens vs. Words
01 · Architecture & Model Design
Transformer
The architecture behind virtually all modern LLMs. Introduced in the 2017 paper “Attention Is All You Need,” a Transformer processes text by letting every word look at every other word in the input simultaneously (via attention) rather than reading left-to-right one word at a time. Most LLMs today use only the decoder half of the original Transformer (which generates text one token at a time), while models like BERT used the encoder half.
Think of it like a room full of people who can all hear each other at once, instead of playing a game of telephone in a line.
Dense Model
A model where every parameter is active for every input token. When you run a 70B dense model, all 70 billion parameters are used to process every single token. This is the “standard” architecture: straightforward but computationally expensive at large scales.
Mixture of Experts (MoE)
Also: MoE, Sparse MoE
An architecture where the model contains multiple “expert” sub-networks (typically in the feed-forward layers), but a router network activates only a small subset of them for each token. For example, Mixtral 8×7B has 8 expert blocks per layer but only routes each token to 2 of them. This means the total parameter count is large (for capacity and knowledge), but the active parameter count per token is much smaller (for speed).
A model card might say: “46.7B total parameters, 12.9B active parameters.” That’s the MoE signature.
Imagine a hospital with 20 specialist doctors, but each patient only sees the 2 most relevant specialists. The hospital has enormous total expertise, but each visit is efficient.
Active Parameters
Also: Active params per token
The number of parameters actually used to process each token during inference. In a dense model, active parameters = total parameters. In a MoE model, active parameters are significantly less than total because only a few experts fire per token. This number is the better predictor of a model’s inference speed and cost, while total parameters better indicates its total knowledge capacity.
Feed-Forward Network (FFN)
Also: MLP block, FFN layer
Each Transformer layer has two major components: an attention mechanism and a feed-forward network. The FFN is a simple two-layer neural network (with a non-linearity in between) applied independently to each token’s representation. It’s where a huge portion of the model’s “knowledge” is believed to be stored. In MoE models, the FFN layer is what gets replaced by multiple expert networks.
Layers (Depth)
The number of stacked Transformer blocks in the model. Each layer contains an attention mechanism and a feed-forward network. Deeper models (more layers) can learn more complex, abstract representations. A typical LLM might have 32 layers (for a 7B model) up to 80+ layers (for 70B+ models). When a model card says num_hidden_layers: 80, this is what it refers to.
Hidden Dimension (Width)
Also: d_model, hidden size
The size of the vector used to represent each token inside the model. A hidden dimension of 4096 means each token is represented as a list of 4,096 numbers as it flows through the network. Larger hidden dimensions let the model encode more nuance per token. Increasing width (hidden dim) vs. depth (layers) is one of the core architectural trade-offs.
Attention Heads
Also: num_attention_heads, num_heads
Within each attention layer, the computation is split into multiple parallel “heads.” Each head independently learns to focus on different types of relationships. One head might learn to track grammatical structure, another might track coreference, another might focus on nearby words. The outputs of all heads are concatenated and combined. A 7B model might have 32 heads; a 70B model might have 64.
02 · Attention Mechanisms
Multi-Head Attention (MHA)
The original attention design from the Transformer paper. Each attention head has its own set of Query (Q), Key (K), and Value (V) projection matrices. If you have 32 heads, you have 32 independent sets of Q, K, and V. This gives maximum expressiveness but uses the most memory, especially because each head needs its own KV cache during text generation.
Multi-Query Attention (MQA)
A memory-saving variant where all attention heads share a single set of Key and Value projections, while each head still gets its own Query. This dramatically reduces the size of the KV cache (see below) and speeds up inference, but can slightly reduce model quality since the K and V representations are less expressive.
Grouped-Query Attention (GQA)
The middle ground between MHA and MQA. Instead of every head having its own KV (MHA) or all heads sharing one KV (MQA), heads are divided into groups, and each group shares a KV pair. For example, with 32 query heads and 8 KV groups, every 4 query heads share the same Key/Value. This preserves most of MHA’s quality while capturing most of MQA’s speed benefits. GQA has become the de facto standard in modern LLMs. You’ll see it listed on nearly every model card.
MHA = every student takes their own notes. MQA = the entire class shares one set of notes. GQA = students form study groups of 4, each group sharing notes.
KV Cache
Also: Key-Value Cache
During text generation, the model produces tokens one at a time. Without caching, it would need to re-compute the Key and Value matrices for every previous token at each step. The KV cache stores these computed K and V values so they can be reused. This is the single biggest consumer of GPU memory during inference for long sequences. GQA and MQA directly reduce KV cache size, which is why they matter so much for deployment.
Sliding Window Attention (SWA)
Instead of letting each token attend to all previous tokens (which scales quadratically with context length), SWA restricts attention to a fixed-size window of recent tokens (e.g., the last 4,096). Information from beyond the window can still propagate across layers (token at layer N can see window → token at layer N+1 can see its own window, which includes tokens that themselves saw earlier context). This makes long-context inference much more memory-efficient.
RoPE (Rotary Position Embeddings)
Also: Rotary Positional Encoding
A method for encoding the position of each token in the sequence. Unlike older approaches that add a position signal to the input, RoPE applies a rotation to the Q and K vectors in the attention mechanism. The rotation angle depends on the token’s position, so the dot product between any two tokens naturally encodes their relative distance. RoPE has become the standard positional encoding in modern LLMs because it works well and extends to longer contexts without much overhead.
RoPE Theta (θ)
Also: rope_theta, base frequency
The base wavelength parameter for RoPE, often seen in model configs as a large number like 500000 or 1000000. A higher theta stretches the rotation frequencies, which is the main trick used to extend a model’s context length beyond what it was originally trained with. When you see a model advertising “extended to 128K context,” they likely increased RoPE theta and did some continued training.
03 · Sizing, Scale & Counting
Parameters
Also: Weights, params
The individual learnable numbers inside a neural network. When a model is described as “70B,” it has approximately 70 billion parameters. During training, these numbers are adjusted to minimize prediction error. During inference, they are fixed. More parameters generally means more knowledge capacity, but also more compute and memory to run.
Embedding Parameters
The parameters in the model’s input embedding layer (which converts token IDs into vectors) and often the output layer (which converts vectors back to token probabilities). These are large lookup tables, one row per token in the vocabulary. With a vocabulary of 128,000 tokens and a hidden dimension of 4,096, the embedding matrix alone has ~524 million parameters.
Non-Embedding Parameters
Total parameters minus embedding parameters. This figure isolates the “reasoning” part of the model: the attention layers, feed-forward networks, and normalization layers that actually transform and process information. Model releases often report this separately because it’s a better measure of a model’s computational capacity. Two models might have the same total parameter count but different vocabulary sizes, which inflates the embedding count differently.
If total parameters are a library’s worth of knowledge, embedding parameters are the index/catalog, and non-embedding parameters are the actual books.
Vocabulary Size
Also: vocab_size
The total number of unique tokens the model can recognize. Modern LLMs typically have vocabularies of 32K to 200K+ tokens. A larger vocabulary means the model can represent text more efficiently (fewer tokens per sentence, especially for non-English languages and code), but it also increases the embedding layer size and the final output layer, costing more memory.
Tokenizer
Also: BPE, SentencePiece, tiktoken
The algorithm that splits raw text into tokens before the model processes it. Most modern LLMs use Byte-Pair Encoding (BPE), which starts with individual characters and iteratively merges the most frequent pairs into single tokens. The word “understanding” might become two tokens: [“understand”, “ing”]. Common tools include SentencePiece (Google), tiktoken (OpenAI), and the Hugging Face tokenizers library. The tokenizer is trained separately from the model itself and defines the vocabulary.
Context Window / Context Length
Also: Max sequence length, ctx_len
The maximum number of tokens the model can process in a single pass, including both the input (your prompt) and the output (the model’s response). A 128K context model can handle roughly 100,000 words at once. Longer contexts let the model work with entire codebases or book-length documents, but memory and compute scale with context length (linearly for KV cache, quadratically for full attention without optimizations).
Training Tokens
The total number of tokens the model saw during pre-training. This is a core indicator of how much data the model was trained on. Llama 3 was trained on 15 trillion tokens; some newer models exceed this. Training tokens and parameter count together roughly determine a model’s capability (the “scaling laws” describe their optimal ratio).
Scaling Laws
Also: Chinchilla scaling, Kaplan scaling
Empirical rules that predict model performance based on parameter count, dataset size, and compute budget. The Chinchilla paper (2022) showed that many earlier models were over-sized and under-trained: for a given compute budget, you get better results by training a smaller model on more data. “Chinchilla-optimal” means the parameter count and training token count are balanced according to these laws (roughly 20 tokens per parameter). Many modern models are intentionally trained “beyond Chinchilla optimal,” using extra data to make the model cheaper to run at inference time.
04 · Training Process
Pre-Training
The first and most expensive phase of building an LLM. The model learns to predict the next token on a massive, diverse corpus of text (web pages, books, code, etc.). This is where the model picks up language, facts, reasoning patterns, and code. Pre-training typically costs millions of dollars and takes weeks to months on thousands of GPUs. It’s the part of the process that only well-funded labs can do.
Supervised Fine-Tuning (SFT)
Also: Instruction tuning
After pre-training, the model is further trained on curated (prompt, response) pairs that demonstrate desirable behavior: following instructions, answering questions helpfully, formatting outputs properly. This turns a raw “text completion engine” into something that actually responds to users. The dataset is much smaller than pre-training data (tens of thousands to millions of examples) but is carefully curated by humans.
RLHF (Reinforcement Learning from Human Feedback)
A training technique where humans rank multiple model outputs for the same prompt, a “reward model” is trained on those rankings, and then the LLM is optimized (using RL algorithms like PPO) to produce outputs the reward model rates highly. RLHF is what makes models polite, safe, helpful, and aligned with human preferences. Without it, models are technically capable but unpleasant to actually talk to.
DPO (Direct Preference Optimization)
An alternative to RLHF that skips the separate reward model entirely. Instead, it directly optimizes the language model using pairs of (preferred response, rejected response). DPO reformulates the RLHF objective so it can be solved with a simple classification-style loss. It’s simpler to implement, more stable to train, and has become increasingly popular as an alternative or complement to RLHF.
GRPO (Group Relative Policy Optimization)
A reinforcement learning method (used notably in DeepSeek-R1) that estimates baselines from groups of sampled outputs rather than training a separate critic/value model. For each prompt, the model generates multiple responses, scores them, and uses the group’s average as the baseline. Responses better than average are reinforced; worse ones are penalized. This removes a major source of complexity and instability from standard RL approaches.
Post-Training
An umbrella term for everything that happens after pre-training: SFT, RLHF, DPO, tool-use training, safety training, and any other refinement. Model releases increasingly emphasize post-training because it’s where the model’s “personality,” safety behavior, and instruction-following quality are shaped. A model with excellent pre-training but poor post-training will feel clunky to use; great post-training can make a smaller model outperform a larger one.
Epochs
The number of times the model sees the full training dataset. One epoch = one complete pass. Most LLM pre-training runs use roughly 1–4 epochs (sometimes less than 1 for very large datasets). Seeing data too many times leads to memorization rather than generalization. Fine-tuning usually runs for 1–5 epochs on the smaller curated dataset.
Loss / Training Loss
A number that measures how badly the model’s predictions differ from the correct answers during training. For LLMs, this is typically cross-entropy loss, which measures how surprised the model is by the actual next token. Lower loss = better predictions. You’ll see training loss curves in technical reports showing the loss decreasing over time. A sudden spike in the loss curve usually means something went wrong (bad data batch, hardware failure).
Perplexity
A more intuitive way to express loss: it’s 2^(loss) or e^(loss) depending on the base. Perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options for each next token. Lower perplexity = better model. It’s a common evaluation metric for comparing language model quality on the same test data.
05 · Alignment & Model Variants
Base Model vs. Instruct Model
Also: Foundation model, Chat model
A base model is the raw output of pre-training. It completes text but doesn’t follow instructions naturally. An instruct (or chat) model has been fine-tuned with SFT and RLHF/DPO so it responds to prompts conversationally. When you see names like “Llama-3-70B” (base) vs. “Llama-3-70B-Instruct” (aligned), this is the distinction. Most users interact with instruct models; base models are for researchers and developers who want to fine-tune for specific purposes.
Distillation
Also: Knowledge distillation
Training a smaller “student” model to mimic the behavior of a larger “teacher” model. Instead of training on ground-truth labels, the student learns from the teacher’s output probabilities, which contain richer information: the teacher’s uncertainty and ranking across all options. This is how many capable small models are created. When you see a model labeled with a -distill suffix or a mention of “distilled from [larger model],” this technique was used. Distilled models often perform surprisingly well for their size.
LoRA / QLoRA
Also: Low-Rank Adaptation
LoRA is a parameter-efficient fine-tuning method. Instead of updating all of a model’s billions of parameters, LoRA freezes the original weights and injects small trainable matrices (low-rank decompositions) into specific layers. This reduces the trainable parameters by 100–1000× while achieving comparable results to full fine-tuning. QLoRA goes further by quantizing the frozen base model to 4-bit, making it possible to fine-tune a 70B model on a single GPU.
Reward Model (RM)
A separate model trained to predict how humans would rate a given response. It takes a (prompt, response) pair and outputs a scalar score. The reward model is the “judge” in RLHF; the LLM is optimized to maximize the reward model’s score. Building a good reward model requires thousands of human preference comparisons and directly determines the quality of the final aligned model.
06 · Inference & Generation
Inference
The act of running a trained model to produce outputs, as opposed to training, which updates the model’s weights. When you send a message to ChatGPT or Claude and get a response, that’s inference. Most of the cost discussion around LLMs is about inference, because training happens once but inference happens every time someone sends a message.
Prefill vs. Decode
Also: Prompt processing vs. generation
Inference happens in two phases. Prefill processes the entire input prompt in parallel, building up the KV cache. Decode generates output tokens one at a time, each step reading from the KV cache. Prefill is compute-bound (lots of math on the prompt all at once); decode is memory-bandwidth-bound (reading the cache for each token). This is why you might see separate speed metrics for each phase.
TTFT (Time to First Token)
How long it takes from when you submit a prompt to when the model starts streaming its first output token. This is mostly determined by the prefill phase, i.e. how fast the system can process your entire input. For long prompts, TTFT can be noticeably long. It’s the main UX metric for chat applications because users are staring at a blank screen until that first token appears.
Tokens Per Second (TPS)
Also: tok/s, generation speed
The rate at which the model produces output tokens during the decode phase. 40 tok/s is roughly comfortable reading speed. This metric depends heavily on hardware, batch size, quantization level, and model size. It’s the main number users “feel” as generation speed.
Throughput vs. Latency
Latency is how fast one individual request completes. Throughput is how many tokens per second the system can produce across all concurrent requests. Batching multiple requests together increases throughput (more efficient GPU utilization) but can increase per-request latency. API providers constantly balance this tradeoff.
Temperature
A number (typically 0.0 to 2.0) that controls how “random” the model’s output is. At temperature 0, the model always picks the most probable next token (deterministic). At higher temperatures, it’s more willing to pick less-probable tokens, leading to more creative or diverse outputs. Temperature 1.0 is the model’s “natural” distribution; below 1.0 is more focused, above 1.0 is more exploratory.
Top-p (Nucleus Sampling)
Also: nucleus sampling
An alternative to temperature for controlling randomness. Instead of scaling all probabilities, top-p limits the model to the smallest set of tokens whose combined probability reaches the threshold p. With top-p = 0.9, the model only considers the tokens that make up the top 90% of probability mass, ignoring the unlikely long tail. This prevents the model from picking extremely unlikely tokens while still allowing diversity.
Top-k
A simpler alternative: only consider the top k most probable tokens at each step. Top-k of 50 means the model chooses from its 50 best guesses. Often used in combination with temperature and/or top-p.
Speculative Decoding
Also: Assisted generation
A speed optimization where a small, fast “draft” model generates several candidate tokens quickly, and the large target model verifies them in a single parallel pass. If the draft tokens match what the large model would have produced (which happens surprisingly often), you get multiple tokens for the compute cost of one verification step. If they don’t match, you fall back to the large model’s choice. This can provide 2–3× speedups with no quality loss.
07 · Precision & Quantization
Floating-Point Precision: FP32, FP16, BF16
These describe how many bits are used to store each parameter. FP32 (32-bit) is full precision but rarely used for LLM inference (too much memory). FP16 (16-bit, half precision) halves the memory. BF16 (Brain Float 16) also uses 16 bits but allocates more bits to the exponent range, making it more numerically stable for training. BF16 is the default training precision for most modern LLMs.
FP8
An 8-bit floating-point format gaining adoption on newer GPUs (H100, Blackwell). It halves memory and doubles throughput compared to FP16/BF16 with minimal accuracy loss if applied carefully. Some model releases now specifically call out FP8 inference support as a feature.
Quantization
Also: INT8, INT4, GPTQ, GGUF, AWQ, GGML
The process of reducing the precision of a model’s weights below their training precision to save memory and increase speed. Common levels are INT8 (8-bit integer), INT4 (4-bit), and even lower. The most popular formats and methods you’ll encounter:
GPTQ — a post-training quantization method that calibrates on a small dataset. Fast on GPUs.
AWQ (Activation-aware Weight Quantization) — preserves the most important weights at higher precision. Generally better quality than GPTQ.
GGUF — a file format (used by llama.cpp) for CPU and mixed CPU/GPU inference. The community standard for running models locally.
EXL2 — a flexible format allowing mixed bit-widths across layers for optimal quality/size tradeoff.
The general quality ladder: FP16 ≈ BF16 > FP8 > INT8 > INT4, but modern quantization methods have narrowed the gaps significantly. A well-quantized 4-bit 70B model often outperforms a full-precision 13B model, which is why the local-LLM community spends so much time talking about quantization formats.
Like compressing a photo from RAW to JPEG. You lose some imperceptible detail but the file becomes much smaller and faster to load.
08 · Benchmarks & Evaluation
MMLU (Massive Multitask Language Understanding)
A multiple-choice test spanning 57 subjects, from abstract algebra to world religions. It’s the most commonly reported benchmark, basically a general-purpose IQ test for LLMs. Scores range from 25% (random guessing on 4 options) to 90%+ for frontier models. Variant: MMLU-Pro uses harder questions with 10 choices to better differentiate top models.
HumanEval / MBPP
Coding benchmarks. HumanEval has 164 Python programming problems where the model writes a function that must pass unit tests. MBPP (Mostly Basic Python Problems) is similar but with ~1,000 simpler tasks. The metric is pass@k, the percentage of problems the model solves correctly if given k attempts (usually pass@1). HumanEval+ and EvalPlus are harder versions that add more test cases to catch false positives.
GSM8K
Grade School Math 8K. A set of ~8,000 grade-school-level word problems requiring multi-step arithmetic reasoning. It tests whether models can chain together 2–8 reasoning steps correctly. Most strong models now score 90%+ on this, so it’s become more of a sanity check than a real differentiator.
MATH
A much harder math benchmark with 12,500 problems from high-school competition math (algebra, geometry, number theory, probability, etc.). Problems are graded by difficulty (Level 1–5). This is where models still have significant room to improve. Frontier models score 50–90% depending on the subset.
HellaSwag
A commonsense reasoning test where the model must pick the most plausible continuation of a scenario from 4 options. Originally designed to be hard for models but easy for humans. Most modern LLMs now score 85%+, making it somewhat saturated but still commonly reported.
ARC (AI2 Reasoning Challenge)
A science-question benchmark with an Easy and a Challenge set. The Challenge set contains questions that simple statistical methods fail on, requiring real reasoning. Questions come from grade-school science exams.
TruthfulQA
Tests whether models avoid repeating common misconceptions and falsehoods that are widespread on the internet. For example, “What happens if you crack your knuckles too much?” A truthful answer contradicts the popular myth. This measures a model’s calibration and resistance to reproducing misinformation.
GPQA (Graduate-Level Problem Questions and Answers)
Expert-level questions written by PhD-level domain experts in physics, chemistry, and biology. Even other domain experts outside the specific subfield score only around 34%. This is a ceiling test: it measures how well models handle genuinely hard expert reasoning.
Arena ELO / Chatbot Arena
A crowdsourced evaluation system where real users chat with two anonymous models side by side and pick a winner. The results are aggregated into an ELO rating (like chess ratings). Many people trust these rankings more than any benchmark because they reflect real-world usefulness rather than narrow test performance. Maintained by LMSYS at UC Berkeley.
Contamination / Data Leakage
A persistent concern in benchmarking: if the model’s training data included the benchmark questions and answers, its scores are inflated. This is why model reports often discuss contamination analysis, checking whether benchmark data was in the training set. It’s also why new benchmarks keep being created and why “live” benchmarks like Arena matter.
09 · Capabilities & Techniques
Multimodal
A model that can process and/or generate more than just text. Common modalities include images (vision), audio, and video. A “vision-language model” (VLM) can take images as input alongside text. Multimodal models typically use a separate encoder (like a vision transformer) to convert non-text inputs into the same vector space the LLM operates in, then the LLM reasons over the combined representation.
Chain of Thought (CoT)
Also: Reasoning traces, step-by-step reasoning
A technique where the model is prompted (or trained) to show its reasoning step-by-step before giving a final answer. Instead of jumping to “42,” the model writes “First, I’ll calculate X… Then Y… Therefore 42.” This dramatically improves accuracy on math, logic, and complex reasoning tasks. Some newer models are specifically trained with explicit reasoning tokens, sometimes called “thinking” or “reasoning” models.
Extended Thinking / Reasoning Models
Also: o1-style reasoning, "thinking" models
Models specifically trained to use large amounts of internal chain-of-thought reasoning before producing an answer. They allocate extra “thinking tokens,” sometimes thousands, to work through complex problems. The model gets a private scratchpad for reasoning. The tradeoff: much better performance on hard tasks (math, coding, science), but higher latency and token cost per response.
Tool Use / Function Calling
The model’s ability to recognize when it needs an external tool (calculator, web search, code execution, API) and output a structured request to invoke it. Instead of trying to mentally calculate 7,394 × 8,261, the model outputs a function call like calculate(7394 * 8261). The result is passed back to the model, which incorporates it into its response. This is what makes AI agents possible in practice.
RAG (Retrieval-Augmented Generation)
A system design where the model is given relevant documents (retrieved via search) as context alongside the user’s question. Instead of relying solely on memorized knowledge from training, the model can read fresh, specific information and cite it. RAG reduces hallucinations and allows models to work with proprietary or up-to-date information. RAG is an architecture pattern built around the model, not a model feature itself.
Agentic
An umbrella term for models/systems that can take autonomous multi-step actions to complete a task: plan, use tools, observe results, adjust, and repeat. An “agent” might browse the web, write and run code, manage files, and interact with APIs across multiple turns to solve a problem. “Agentic capabilities” in a model release usually means the model has been specifically trained to maintain goal-tracking, use tools reliably, and recover from errors across long action sequences.
System Prompt
A special set of instructions provided to the model at the start of a conversation that sets its behavior, personality, constraints, and capabilities. Unlike the user’s messages, the system prompt is typically written by the developer and is persistent across the conversation. It’s the primary mechanism for customizing model behavior without fine-tuning.
Zero-Shot / Few-Shot / Many-Shot
Zero-shot: asking the model to do a task with no examples. Few-shot: providing a handful of (input, output) examples in the prompt before the actual question. Many-shot: providing dozens or hundreds of examples (enabled by long context windows). Few-shot prompting is one of the most reliable ways to steer model behavior. Benchmark scores are often reported for both zero-shot and few-shot (e.g., “5-shot MMLU”).
Structured Output
Also: JSON mode, constrained decoding
The model’s ability to produce output in a specific format, typically valid JSON matching a provided schema. Without this, integrating LLMs into software systems is a headache. Some providers implement this via constrained decoding (guaranteeing valid output by restricting which tokens the model can produce at each step), while others rely on training + prompting.
10 · Infrastructure & Miscellaneous
FLOPS / FLOPs
Also: Floating Point Operations Per Second
A measure of computational power. FLOPs (lowercase ‘s’) refers to the total number of floating-point operations for a task (e.g., training a model took 10²⁵ FLOPs). FLOPS (uppercase ‘S’) is operations per second, a measure of hardware speed. An H100 GPU delivers roughly 2 petaFLOPS (2 × 10¹⁵) in BF16. Training compute budgets are often quoted in FLOPs as the universal currency for comparing different training runs.
Tensor Parallelism / Pipeline Parallelism
Strategies for splitting a model across multiple GPUs. Tensor parallelism (TP) splits individual matrix operations across GPUs (each GPU computes part of each layer). Pipeline parallelism (PP) puts different layers on different GPUs (GPU 1 runs layers 1–20, GPU 2 runs layers 21–40). Both are typically combined for training very large models. You’ll see these in infrastructure specs for model releases.
Open Weights vs. Open Source
A distinction the community increasingly emphasizes. Open weights means the model parameters are downloadable and usable, but the training data, code, and full pipeline may not be shared. Open source (in the stricter sense) means the full stack (code, data, training pipeline) is available. Most models colloquially called “open source” (Llama, Mistral, etc.) are technically open-weight releases. The license terms vary widely (some allow commercial use, some don’t).
GGUF
A file format for storing quantized LLM weights, designed for the llama.cpp ecosystem. GGUF files encode the model weights, tokenizer, and metadata in a single file, and support various quantization levels (Q4_K_M, Q5_K_S, Q8_0, etc.). This is the standard format for running models locally on consumer hardware. The naming convention tells you the quantization: Q4 = 4-bit, Q5 = 5-bit, Q8 = 8-bit, with K_M/K_S indicating the specific quantization strategy.
Safetensors
A file format by Hugging Face for storing model weights safely and efficiently. Unlike Python’s pickle format (which can execute arbitrary code on load), safetensors is designed to be safe from code injection attacks, fast to load (supports memory mapping), and framework-agnostic. It’s the default format for models on Hugging Face Hub.
Hallucination
When a model confidently generates information that is factually wrong or completely fabricated: citing papers that don’t exist, inventing historical events, producing plausible-sounding but incorrect technical details. Every technique in this guide, from RAG to reasoning chains to human feedback, helps reduce hallucination to some degree. None of them eliminate it. If someone tells you their model doesn’t hallucinate, they’re selling something.
Context Window Utilization / “Needle in a Haystack”
Having a 128K context window doesn’t mean the model uses all of it equally well. The “Needle in a Haystack” test places a random fact at various positions in a long document and checks if the model can retrieve it. Models often perform worst for facts buried in the middle of long contexts. When a model release shows NIAH results, they’re demonstrating that the model can actually use its full context, not just accept it.
MCP (Model Context Protocol)
An open protocol (introduced by Anthropic) that standardizes how LLMs connect to external tools, data sources, and services. Instead of every app building custom tool integrations, MCP provides a universal interface (think USB-C for AI models). An MCP server exposes capabilities (search a database, read files, call an API), and any MCP-compatible model can use them without custom integration code.
Tokens vs. Words
A quick conversion note that helps interpret every other metric: 1 token ≈ 0.75 English words (or about 4 characters). So a 128K-token context window ≈ 96,000 words ≈ a 300-page book. Non-English languages and code typically use more tokens per word. When model cards quote speed in “tokens per second” or costs in “dollars per million tokens,” this rough ratio is how you translate it to human-scale quantities.
Last updated April 2026. Some of these terms will be obsolete by next quarter, and new ones will have replaced them. That’s just how this space works.