LLM parameters: what they are and how they actually work
You’ve seen the numbers. 7B. 70B. 405B.
Everyone talks about parameter counts. But what are they? Why does size matter? And what actually happens when you hit “Generate”?
This post covers the mechanics: what parameters are, where they live in the model architecture, how scaling affects them, and what that means if you’re running or choosing a model.
Table of Contents
- What is a parameter?
- The anatomy: weights, biases, and neurons
- Where parameters live
- The “billion” game: scaling and diminishing returns
- MoE: the modern shift
- Quantization: compressing the weights
- Key analogies
- Practical implications: VRAM, speed, and cost
- Conclusion
What is a parameter?
A parameter is just a number — specifically, a floating-point number the model learns during training.
An LLM doesn’t “know” facts the way a database does. Instead, it compresses all the text it has ever seen into a massive matrix of numbers. Those numbers define how the model processes language.
The synthesizer analogy
Imagine a synthesizer with millions of knobs.
- Each knob is a parameter.
- Turning a knob changes the output.
- During training, the model listened to everything and adjusted every knob until it could reproduce any style.
When you write a prompt, you aren’t turning the knobs. You’re playing a melody. The melody causes them to respond in a specific way. The knobs themselves never move during inference — they’re frozen.
The anatomy: weights, biases, and neurons
Weights
Weights determine the importance of an input. If a connection between two neurons is strong, the weight is high. If it’s weak, the weight is near zero.
Biases
Biases act as a threshold. They let the model shift its activation up or down regardless of input — a baseline calibration built into each neuron.
Neurons
Neurons are the computational units. Each takes inputs, multiplies them by weights, adds a bias, and applies a non-linear activation function (like ReLU or SwiGLU).
The recipe analogy
Think of a layer in an LLM as a chef following a recipe.
- Inputs: ingredients (words/tokens).
- Weights: how much of each ingredient to use.
- Bias: the base stock added regardless of what’s in the dish.
- Activation: the cooking process that transforms raw ingredients into something edible.
A model with more parameters is like a chef who has memorized millions of recipes. More parameters means more capacity for nuanced combinations.
Where parameters live
An LLM isn’t one giant block of numbers. It’s an architecture of layers, and parameters are distributed across all of them.
The embedding layer
Converts tokens into vectors — dense representations in high-dimensional space. The parameters here are lookup tables mapping token IDs to coordinates.
Attention layers
Determines which parts of the input are relevant to which other parts. Self-attention lets the model weigh relationships between words.
Parameters here include:
- Q, K, V matrices: queries, keys, and values used to compute attention scores
- Output projection: compresses attention results back into a vector
Feed-forward networks (MLPs)
The “thinking” layer. Processes information from attention to perform transformations and feature extraction.
This is where roughly 60–70% of total parameters live, distributed across two large weight matrices (expand and compress).
# Where parameters sit in a single layer
class Layer:
def __init__(self):
self.q_proj = Parameter() # attention
self.k_proj = Parameter() # attention
self.v_proj = Parameter() # attention
self.o_proj = Parameter() # output projection
self.ffn_gate = Parameter() # feed-forward
self.ffn_up = Parameter() # feed-forward
The “billion” game: scaling and diminishing returns
When you see Llama-3.1-8B, the “8B” means 8 billion parameters.
Does more mean better?
Generally yes, but not linearly.
- Capacity. More parameters let the model learn more complex patterns and handle longer contexts.
- Data efficiency. For a fixed dataset, a larger model will generally outperform a smaller one — it has more capacity to organize and connect what it reads.
- The wall. Adding parameters yields diminishing returns if training data quality plateaus. A model trained on poor data learns to compress poor patterns more efficiently.
The library analogy
- Parameters are the model’s reasoning ability, not its library of facts.
- A 7B model is a sharp thinker with decent memory.
- A 70B model is a sharp thinker with more bandwidth to connect ideas.
- Training data is the library. Given the same library, the 70B will generally do more with it.
MoE: the modern shift
Models like Mixtral and Grok moved away from “dense” architectures to Mixture of Experts (MoE).
In a dense model, every parameter processes every token. In MoE, the model has multiple “expert” sub-networks and a gating mechanism. For each token, only a few experts activate.
The consultant team analogy
- Dense model: one super-generalist who handles everything. Gets slow and expensive to scale.
- MoE model: a firm of specialists. A coding question goes to the Python expert and the C++ expert. A poetry question goes to the literature expert.
- You get the capacity of 100 experts but only pay the compute cost of activating a few.
MoE models often have trillions of total parameters but only activate billions per token. Mixtral-8x7B, for example, has 46.7B total parameters but activates roughly 12.9B per forward pass — which is why it can run faster than a comparably-sized dense model.
Quantization: compressing the weights
Parameters are usually stored in FP16 or BF16 (16-bit floating point), using 2 bytes each. Quantization compresses them further:
- INT8: 8-bit integers — 1 byte per parameter, ~50% reduction vs FP16
- INT4: 4-bit integers — 0.5 bytes per parameter, ~75% reduction vs FP16
This reduces VRAM usage and speeds up inference, with acceptable quality loss for most use cases. Tools like llama.cpp and its GGUF format have made INT4 quantization practical for running capable models on consumer hardware.
The photo analogy
- FP16: an uncompressed RAW photo. Full detail, large file.
- INT4: a well-optimized JPEG. Much smaller, looks nearly identical.
Modern quantization tools are selective — they preserve important patterns and smooth out the parts that matter less.
Key analogies
| Concept | Analogy |
|---|---|
| Parameters | Frozen knobs on a synthesizer |
| Weights | Gain on a specific signal |
| Bias | Baseline shift or threshold |
| Training | Adjusting knobs until output matches reality |
| Inference | Playing a melody through the fixed knobs |
| Parameters vs. data | Chef’s skill vs. pantry ingredients |
| MoE | Specialist firm — only relevant experts activated |
| Quantization | Smart JPEG compression |
Practical implications: VRAM, speed, and cost
VRAM
FP16/BF16 uses 2 bytes per parameter. INT8 uses 1 byte. INT4 uses 0.5 bytes.
Quick calculation: 7B parameters in BF16 = 7 × 10⁹ × 2 bytes ≈ 14 GB
You also need VRAM for the KV cache (stored activations for your context window). A long context can add 20–50% on top. Running a 7B model comfortably with decent context usually requires an A100 (80GB) or dual A6000s.
Inference speed
Dense models scale predictably: fewer parameters, faster generation. MoE models can run faster than their total parameter count suggests because only a subset of weights activate per token — though the routing mechanism adds some overhead.
Cost
More parameters means higher training cost and typically higher serving cost, unless MoE or quantization brings that down.
Conclusion
Parameters are the compressed form of everything the model learned — patterns, relationships, and reasoning shortcuts encoded as billions of floating-point numbers.
A few things worth keeping in mind when choosing or deploying a model:
- More parameters generally means higher capacity, but only if the training data is there to support it. A large model trained on poor data is still a poor model.
- Architecture matters as much as count. An MoE model with 46B total parameters can outperform a dense 13B model while costing less to run.
- Quantization lets you run capable models on consumer hardware with acceptable quality loss. INT4 at 7B is often more practical than FP16 at 3B.
- Parameters and data are separate dimensions. Don’t conflate model size with model knowledge.