LLM parameters: what they are and how they actually work

You’ve seen the numbers. 7B. 70B. 405B.

Everyone talks about parameter counts. But what are they? Why does size matter? And what actually happens when you hit “Generate”?

This post covers the mechanics: what parameters are, where they live in the model architecture, how scaling affects them, and what that means if you’re running or choosing a model.

What is a parameter?
The anatomy: weights, biases, and neurons
Where parameters live
The “billion” game: scaling and diminishing returns
MoE: the modern shift
Quantization: compressing the weights
Key analogies
Practical implications: VRAM, speed, and cost
Conclusion

What is a parameter?

A parameter is just a number — specifically, a floating-point number the model learns during training.

An LLM doesn’t “know” facts the way a database does. Instead, it compresses all the text it has ever seen into a massive matrix of numbers. Those numbers define how the model processes language.

The synthesizer analogy
Imagine a synthesizer with millions of knobs.

Each knob is a parameter.
Turning a knob changes the output.
During training, the model listened to everything and adjusted every knob until it could reproduce any style.
When you write a prompt, you aren’t turning the knobs. You’re playing a melody. The melody causes them to respond in a specific way. The knobs themselves never move during inference — they’re frozen.

The anatomy: weights, biases, and neurons

Weights

Weights determine the importance of an input. If a connection between two neurons is strong, the weight is high. If it’s weak, the weight is near zero.

Biases

Biases act as a threshold. They let the model shift its activation up or down regardless of input — a baseline calibration built into each neuron.

Neurons

Neurons are the computational units. Each takes inputs, multiplies them by weights, adds a bias, and applies a non-linear activation function (like ReLU or SwiGLU).

The recipe analogy
Think of a layer in an LLM as a chef following a recipe.

Inputs: ingredients (words/tokens).
Weights: how much of each ingredient to use.
Bias: the base stock added regardless of what’s in the dish.
Activation: the cooking process that transforms raw ingredients into something edible.
A model with more parameters is like a chef who has memorized millions of recipes. More parameters means more capacity for nuanced combinations.

Where parameters live

An LLM isn’t one giant block of numbers. It’s an architecture of layers, and parameters are distributed across all of them.

The embedding layer

Converts tokens into vectors — dense representations in high-dimensional space. The parameters here are lookup tables mapping token IDs to coordinates.

Attention layers

Determines which parts of the input are relevant to which other parts. Self-attention lets the model weigh relationships between words.

Parameters here include:

Q, K, V matrices: queries, keys, and values used to compute attention scores
Output projection: compresses attention results back into a vector

Feed-forward networks (MLPs)

The “thinking” layer. Processes information from attention to perform transformations and feature extraction.

This is where roughly 60–70% of total parameters live, distributed across two large weight matrices (expand and compress).

# Where parameters sit in a single layer
class Layer:
    def __init__(self):
        self.q_proj = Parameter()   # attention
        self.k_proj = Parameter()   # attention
        self.v_proj = Parameter()   # attention
        self.o_proj = Parameter()   # output projection
        self.ffn_gate = Parameter() # feed-forward
        self.ffn_up = Parameter()   # feed-forward

The “billion” game: scaling and diminishing returns

When you see Llama-3.1-8B, the “8B” means 8 billion parameters.

Does more mean better?

Generally yes, but not linearly.

Capacity. More parameters let the model learn more complex patterns and handle longer contexts.
Data efficiency. For a fixed dataset, a larger model will generally outperform a smaller one — it has more capacity to organize and connect what it reads.
The wall. Adding parameters yields diminishing returns if training data quality plateaus. A model trained on poor data learns to compress poor patterns more efficiently.

The library analogy

Parameters are the model’s reasoning ability, not its library of facts.
A 7B model is a sharp thinker with decent memory.
A 70B model is a sharp thinker with more bandwidth to connect ideas.
Training data is the library. Given the same library, the 70B will generally do more with it.

MoE: the modern shift

Models like Mixtral and Grok moved away from “dense” architectures to Mixture of Experts (MoE).

In a dense model, every parameter processes every token. In MoE, the model has multiple “expert” sub-networks and a gating mechanism. For each token, only a few experts activate.

The consultant team analogy

Dense model: one super-generalist who handles everything. Gets slow and expensive to scale.
MoE model: a firm of specialists. A coding question goes to the Python expert and the C++ expert. A poetry question goes to the literature expert.
You get the capacity of 100 experts but only pay the compute cost of activating a few.

MoE models often have trillions of total parameters but only activate billions per token. Mixtral-8x7B, for example, has 46.7B total parameters but activates roughly 12.9B per forward pass — which is why it can run faster than a comparably-sized dense model.

Quantization: compressing the weights

Parameters are usually stored in FP16 or BF16 (16-bit floating point), using 2 bytes each. Quantization compresses them further:

INT8: 8-bit integers — 1 byte per parameter, ~50% reduction vs FP16
INT4: 4-bit integers — 0.5 bytes per parameter, ~75% reduction vs FP16

This reduces VRAM usage and speeds up inference, with acceptable quality loss for most use cases. Tools like llama.cpp and its GGUF format have made INT4 quantization practical for running capable models on consumer hardware.

The photo analogy

FP16: an uncompressed RAW photo. Full detail, large file.
INT4: a well-optimized JPEG. Much smaller, looks nearly identical.
Modern quantization tools are selective — they preserve important patterns and smooth out the parts that matter less.

Key analogies

Concept	Analogy
Parameters	Frozen knobs on a synthesizer
Weights	Gain on a specific signal
Bias	Baseline shift or threshold
Training	Adjusting knobs until output matches reality
Inference	Playing a melody through the fixed knobs
Parameters vs. data	Chef’s skill vs. pantry ingredients
MoE	Specialist firm — only relevant experts activated
Quantization	Smart JPEG compression

Practical implications: VRAM, speed, and cost

VRAM

FP16/BF16 uses 2 bytes per parameter. INT8 uses 1 byte. INT4 uses 0.5 bytes.

Quick calculation: 7B parameters in BF16 = 7 × 10⁹ × 2 bytes ≈ 14 GB
You also need VRAM for the KV cache (stored activations for your context window). A long context can add 20–50% on top. Running a 7B model comfortably with decent context usually requires an A100 (80GB) or dual A6000s.

Inference speed

Dense models scale predictably: fewer parameters, faster generation. MoE models can run faster than their total parameter count suggests because only a subset of weights activate per token — though the routing mechanism adds some overhead.

Cost

More parameters means higher training cost and typically higher serving cost, unless MoE or quantization brings that down.

Conclusion

Parameters are the compressed form of everything the model learned — patterns, relationships, and reasoning shortcuts encoded as billions of floating-point numbers.

A few things worth keeping in mind when choosing or deploying a model:

More parameters generally means higher capacity, but only if the training data is there to support it. A large model trained on poor data is still a poor model.
Architecture matters as much as count. An MoE model with 46B total parameters can outperform a dense 13B model while costing less to run.
Quantization lets you run capable models on consumer hardware with acceptable quality loss. INT4 at 7B is often more practical than FP16 at 3B.
Parameters and data are separate dimensions. Don’t conflate model size with model knowledge.

“The wisest algorithm knows the boundaries of its training data.”-Anon

Rushi's

Ctrl+AI+Ship