How to Pick the Right Model to Run on Your Local Machine
You’ve heard the pitch: run AI privately, offline, on your own hardware — no API keys, no usage limits, no data leaving your machine. You open Hugging Face, find a model called Qwen3-30B-A3B-GGUF, download 20GB, try to run it, and your laptop grinds to a halt or produces nothing at all.
The problem isn’t that local AI is hard. It’s that the model ecosystem has a lot of implicit knowledge baked into it — naming conventions, file formats, and hardware math that nobody explained to you. This post will fix that.
A filename like Llama-3.2-8B-Instruct-Q4_K_M.gguf should make sense by the end. You’ll know which hardware specs actually matter, and there’s a tool called llmfit that does most of the matching work for you.
Table Of Contents
- What the Numbers Mean: Parameters
- Decoding Model Names
- File Format and Quantization: The Most Misunderstood Part
- Your Hardware: What Actually Matters
- The Decision Framework: Mapping Hardware to Model
- Tools to Help You Decide
- Trade-offs to Know Before You Start
- Putting It Together: A Quick Checklist
- What to Try First
What the Numbers Mean: Parameters
The first number you’ll see in any model name — 7B, 13B, 70B — is the parameter count, measured in billions. Parameters are the learned weights inside the neural network: numerical values the model adjusts during training to get better at predicting the next word.
More parameters generally means a more capable model, but the relationship isn’t linear. A well-trained 8B model can outperform a poorly-trained 13B model on many tasks. Parameter count is better understood as a proxy for capacity — how much the model can learn — not a direct quality guarantee.
Here’s a rough mental model for current (mid-2026) models:
- 1–3B: Fast, cheap to run, useful for simple tasks like classification, summarizing short text, and basic Q&A. Will struggle with complex reasoning or long multi-step instructions.
- 7–8B: The sweet spot for most use cases on consumer hardware. Models like Llama 3.2 8B and Qwen3 8B punch well above their size class.
- 13–14B: Noticeably better at nuanced reasoning and following complex instructions. Requires a bit more hardware headroom.
- 30–34B: Strong performance across most benchmarks, but starts requiring a dedicated GPU with 24GB+ VRAM for comfortable inference speeds.
- 70B+: Near frontier quality for many tasks. Requires either very high-end consumer hardware (RTX 4090, M3 Max) or CPU-only inference, which is slow.
There’s also a newer architecture category — Mixture of Experts (MoE) — that changes this math. A model labeled Qwen3-30B-A3B has 30B total parameters, but only 3B are active per token (A3B means “3B active”). The rest are dormant experts that activate selectively depending on the input. This means a 30B MoE model can behave like a much larger model in terms of quality while requiring similar VRAM to a 7B dense model. More on this in the hardware section.
Decoding Model Names
Model names follow loose conventions rather than strict standards. Here’s a breakdown of the parts you’ll commonly see:
Family and Base Name
Llama, Qwen, Mistral, Gemma, Phi, DeepSeek — these identify the architecture lineage and who trained the base model. They matter because different families have different strengths. As of mid-2026, Qwen3 and Llama 3.x are strong general-purpose choices; Qwen2.5-Coder and DeepSeek-Coder are particularly good for code; Phi-4 models are designed for efficiency at small sizes.
Version Numbers
Llama-3.2 vs Llama-3.1 — minor version bumps often indicate updated training data, fine-tuning improvements, or context window expansion. Check the release notes rather than assuming newer is always better for your task. Sometimes older versions are better-documented and more compatible with existing tooling.
Instruct vs. Base
This distinction matters more than most people realize.
Base models are trained to predict the next token. They’re good at completion tasks but won’t follow instructions the way you expect a chat assistant to. If you prompt a base model with “Write me a poem about the ocean,” it might respond with “Write me a poem about the sky” — continuing the style, not answering you.
Instruct models (also called Chat models) are fine-tuned with reinforcement learning from human feedback (RLHF) or similar techniques to follow instructions and have conversations. They expect a specific prompt format, which the runtime usually handles for you.
For most beginners, use an Instruct model. Base models are useful for specific research or fine-tuning workflows.
Other Suffixes
- Vision — Multimodal; can process images alongside text
- Coder / Code — Fine-tuned on code; better for programming tasks
- Math — Fine-tuned on mathematical reasoning problems
- v2, v3, Turbo, Mini — Usually indicate capability tiers or speed-optimized variants within a family
File Format and Quantization: The Most Misunderstood Part
Once you know which model you want, you’ll typically choose between a few different files representing the same model. This is where quantization enters the picture.
What GGUF Is
GGUF is a file format for storing model weights, introduced by the llama.cpp project. It’s the format you’ll encounter most often for running models locally on CPU or with partial GPU offloading. An alternative format, GGML, is its predecessor and is now largely deprecated. On Apple Silicon, you’ll also see MLX format files, which are optimized for Apple’s Metal GPU framework and generally offer better performance on Macs.
What Quantization Is
Neural networks are trained with 32-bit or 16-bit floating-point weights — high-precision numbers that take up a lot of memory. Quantization converts those weights to lower precision (8-bit, 4-bit, etc.) at inference time. A quantized model uses significantly less RAM and VRAM, runs faster, and accepts only a small quality penalty.
Here’s a concrete sense of scale: a 7B parameter model in its original F16 (16-bit float) format weighs about 13.5GB. The same model quantized to 4-bit (Q4_K_M) weighs about 4.1GB — 70% smaller — while retaining approximately 92–95% of the full-precision output quality on most benchmarks.
Reading the GGUF Naming Scheme
A filename like Mistral-7B-Instruct-Q4_K_M.gguf breaks down as:
[Model Family]-[Size]-[Type]-[Quantization].[Format]
Mistral 7B Instruct Q4_K_M .gguf
The quantization tag decodes as follows:
The number (Q2, Q3, Q4, Q5, Q6, Q8) is bits per weight on average. Q4 = 4 bits average, Q8 = 8 bits average. Higher is better quality, larger file size.
The _K suffix means “K-quantization” — a smarter compression approach that allocates precision unevenly. Instead of quantizing every weight the same way, K-quants identify layers that are more sensitive to precision loss (typically attention layers and the final output layer) and keep those at higher precision while compressing less-critical weights more aggressively. The result is better quality at the same average bit-width compared to older legacy quants.
The _M, _S, _L suffix after K:
_S(Small) — More aggressive compression, smallest file, slightly lower quality_M(Medium) — The balanced choice. Slightly larger than_S, noticeably better quality_L(Large) — Less common; even better quality preservation, closer to the next quant tier
So Q4_K_M means: average 4 bits per weight, K-quant smart allocation, medium size/quality variant. Q5_K_S means: 5 bits per weight, K-quant, small variant.
Q8_0 is 8-bit quantization — the nearly lossless option. Quality is effectively indistinguishable from F16 on most tasks, but the file is twice the size of Q4. Use this if you have the VRAM and want maximum quality.
For most beginners, Q4_K_M is the right default. If you’re doing work where quality matters a lot (coding, complex reasoning), consider Q5_K_M. Only reach for Q8_0 if you’ve confirmed your hardware has comfortable headroom.
IQ Quants (iQ3_XS, iQ4_NL, etc.)
You’ll also see a newer “importance matrix” quantization family prefixed with iQ or IQ. These use a calibration dataset during quantization to identify which weights matter most, squeezing out slightly better quality at the same bit-width compared to standard K-quants. They take longer to generate and aren’t always available, but if you see an iQ4_XS file alongside the standard Q4_K_M, the IQ version will usually be higher quality at a similar or smaller size.
Your Hardware: What Actually Matters
Running a model locally comes down to three hardware constraints: VRAM, system RAM, and CPU. They interact, but VRAM is where most decisions get made.
VRAM — The Primary Bottleneck
VRAM is the memory on your GPU (graphics card). GPUs are fast at the matrix math LLMs need, and the model has to fit in VRAM to run at GPU speed. If your model is 5GB and your GPU has 8GB VRAM, it fits. If the model is 9GB, it doesn’t — and you fall back to a slower mode (CPU offloading or full CPU inference).
Approximate VRAM needed by model size (assuming Q4_K_M quantization):
| Model Size | VRAM Needed (Q4_K_M) | Example Models |
|---|---|---|
| 3B | ~2.5 GB | Phi-4-mini, Llama 3.2 3B |
| 7–8B | ~4.5–5.5 GB | Llama 3.2 8B, Mistral 7B, Qwen3 8B |
| 13–14B | ~8–10 GB | Qwen2.5 14B, Phi-4 14B |
| 30B (dense) | ~18–20 GB | Qwen2.5 32B |
| 30B (MoE, A3B active) | ~3–4 GB | Qwen3-30B-A3B |
| 70B | ~40–45 GB | Llama 3.3 70B |
The MoE row is not a typo. Qwen3-30B-A3B has 30B total parameters but only 3B active at inference time, so it fits in the same VRAM budget as a 3B dense model while delivering quality closer to a 30B model.
Common GPU VRAM tiers:
- 6GB (RTX 3060, GTX 1660 Super): 7B models at Q4_K_M, or small MoE models
- 8GB (RTX 3070, RTX 4060): 7–8B comfortably, 13B with some compromises
- 12GB (RTX 3080 12GB, RTX 4070): 13B comfortably, some 20B with smaller quants
- 16GB (RTX 4080, M2/M3 Max 16GB): up to ~20B comfortably
- 24GB (RTX 3090, RTX 4090): 30B dense, or 70B with significant CPU offloading
- Apple Silicon unified memory: The entire system RAM is available to the GPU. An M3 Pro with 36GB can run 30B+ models that a 24GB discrete GPU couldn’t.
System RAM — The Overflow Buffer
When a model doesn’t fully fit in VRAM, inference runtimes like llama.cpp and Ollama can offload layers to system RAM, running some layers on CPU and others on GPU. This hybrid mode works, but it’s slower — typically 30–70% of full-GPU speed depending on how much got offloaded.
Keep at least twice your model’s file size available as free RAM. A 5GB model should ideally have 10GB+ free. That headroom covers the model weights plus the KV cache (the working memory for the current conversation context).
On machines with no GPU, you can run models in pure CPU mode. Expect roughly 5–20 tokens/second for a 7B Q4 model on a modern 8-core laptop CPU — usable for chat, slow for generating long documents.
Context Length — The Hidden Cost
Context length is the maximum amount of text a model can “see” at once: your conversation history, attached documents, and the model’s own prior output. Common context windows range from 8K to 128K tokens (a token is roughly 0.75 words).
Longer context costs memory. A model’s base weight size is fixed, but the KV cache — which stores intermediate computations for everything in the current context — grows linearly with context length. Running a 7B model at 128K context requires roughly 6–8GB of additional VRAM on top of the model weights.
If you’re memory-constrained, cap your context. Most chat use cases don’t need more than 8K–16K tokens. Tools like llmfit let you set a --max-context flag to estimate memory at a specific context length rather than the model’s advertised maximum.
Storage and Disk Speed
A 70B Q4 model is ~40GB on disk. Load time from an NVMe SSD is seconds; from a spinning hard drive, it can be minutes. Keep frequently-used models on your fastest drive.
The Decision Framework: Mapping Hardware to Model
Here’s how to work through a model selection, step by step.
- Determine your VRAM (or RAM, if no GPU). On a Mac: Apple menu → About This Mac → Memory. All of it is usable as VRAM. On Linux/Windows with NVIDIA:
nvidia-smiin a terminal. Look forTotalunder memory. No GPU: use your free system RAM as the constraint. - Apply the memory budget. Take your available VRAM and multiply by 0.85 — leave 15% headroom for the OS, KV cache, and other overhead. That’s your usable budget.
- Pick a model size that fits at Q4_K_M. Use the table above. If your budget is 6GB, a 7B model at Q4_K_M (~4.5GB) fits. A 13B model (~8.5GB) doesn’t.
- Consider going up to Q5_K_M if you have extra headroom. If the Q4 model leaves you 2–3GB to spare, the Q5 version is often worth the extra ~1GB for noticeably better output quality, especially for reasoning-heavy tasks.
- Check if there’s a MoE option. For any model size you’re targeting, check if there’s a MoE variant. A
Qwen3-30B-A3Bthat needs only ~4GB VRAM may outperform a dense 7B model while fitting in the same memory budget. - Verify with a tool, not your intuition. Memory math is easy to get wrong. Use
llmfit(covered below) to let hardware detection and scoring do the work.
Tools to Help You Decide
llmfit — Hardware-Aware Model Matching
llmfit (26.8k stars as of mid-2026) is a terminal tool that auto-detects your hardware and scores every model in its database across quality, speed, memory fit, and context dimensions. It’s the fastest way to answer “what can I actually run?”
Install it:
# macOS / Linux (Homebrew)
brew install llmfit
# macOS / Linux (quick install)
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
# Windows (Scoop)
scoop install llmfit
Run it:
llmfit
This launches an interactive TUI showing your detected hardware at the top (CPU, RAM, GPU name, VRAM) and a sortable table of models ranked by composite fit score. Each row shows the model’s estimated tokens/second, best quantization for your hardware, run mode (GPU, CPU+GPU, CPU), memory usage percentage, and use case category.
You can filter by use case from the command line:
# Get the top 5 models for coding, as JSON
llmfit recommend --json --use-case coding --limit 5
# See what hardware a specific model would need
llmfit plan "Qwen/Qwen3-30B-A3B" --context 8192
# Show your detected hardware specs
llmfit system
A few things llmfit does that aren’t obvious from the description: it handles MoE architectures correctly (most calculators don’t), selects quantization dynamically based on what actually fits rather than assuming a fixed quant, and has a Community Leaderboard (b key in the TUI) showing real-world tokens/second from other users on the same hardware — useful for sanity-checking its estimates.
One caveat: llmfit’s database is embedded at compile time, so the model list updates when you upgrade the tool (brew upgrade llmfit), not in real time. Newly released models may not appear until the next release.
Ollama — The Easiest Runtime
Ollama is a runtime that handles model downloading, quantization selection, and serving a local API. You pull a model by name and run it — no manual GGUF downloads required.
ollama pull llama3.2:8b
ollama run llama3.2:8b
Ollama automatically chooses a quantization level based on your hardware and serves an API compatible with OpenAI’s format at localhost:11434. llmfit integrates with Ollama — if Ollama is running, you’ll see which models you already have installed in the TUI, and pressing d on any model will pull it directly.
The downside: Ollama’s built-in quantization selection is less granular than manually choosing a GGUF file. If you want fine-grained control (e.g., specifically Q5_K_M rather than whatever Ollama picks), you may want llama.cpp directly.
LM Studio — The GUI Option
LM Studio provides a desktop GUI for searching, downloading, and running models. It’s the lowest barrier to entry if you’re not comfortable with terminals. It shows VRAM usage estimates in the download interface, which helps you avoid downloading models that won’t fit.
LM Studio also exposes a local API server, and llmfit can detect models you’ve installed through it.
llama.cpp — Maximum Control
llama.cpp is the underlying inference engine that most of the above tools use. Running it directly gives you the most control over quantization, context length, GPU layer offloading, and other parameters — at the cost of more setup. Worth knowing about, but not where most beginners should start.
Trade-offs to Know Before You Start
Quality vs. memory: every step down in quantization (Q8 → Q5 → Q4 → Q3) shrinks the model but costs capability. The drop from Q8 to Q4 is usually acceptable. The drop from Q4 to Q3 is where you may start noticing meaningful degradation on complex tasks. Q2 should be a last resort.
Speed vs. quality at the same memory budget: you could run a 7B model at Q8 (~7GB) or a 13B model at Q4_K_M (~8.5GB) in roughly the same memory. The 13B model will often produce better output, but the 7B Q8 model will be faster. The right call depends on your task and how much latency matters to you.
GPU inference vs. CPU inference: running on a GPU is 5–20x faster than CPU-only for most consumer hardware. If inference speed matters to you (real-time chat, interactive coding), prioritize getting the model into VRAM completely. Partial offloading is a reasonable middle ground, not an afterthought.
Context length ambition vs. memory reality: advertising a 128K context window doesn’t mean running at 128K is practical on your hardware. On a 16GB VRAM GPU, a 7B model at full 128K context may consume nearly all available memory, leaving nothing for other applications. In practice, 16K–32K working context is sufficient for most use cases and much cheaper on memory.
Newer isn’t always better for your hardware: frontier models released in 2026 target high-end hardware. A well-tuned 7B model from 2024 may outperform a newer 3B model in quality while running at similar speed. Check benchmarks that match your use case rather than assuming the latest release is the right choice.
Putting It Together: A Quick Checklist
Before downloading anything:
- Know your VRAM or free RAM. Don’t skip this step.
- Run
llmfitor checkllmfit recommend --json --use-case [your task]. Let it tell you what fits. - Prefer Instruct models unless you specifically know you need a base model.
- Default to Q4_K_M unless you have headroom for Q5_K_M or a specific reason to go lower.
- Check for MoE options —
Qwen3-30B-A3Band similar models can give you much more capability per GB of VRAM than dense models. - Cap context to what you’ll actually use — 8K–16K for most chat use cases.
- Start small and work up — a 7B model you can run fluidly is more useful than a 13B model that chokes your machine.
What to Try First
If you have 8GB VRAM or 16GB unified memory (Apple Silicon M-series): start with llama3.2:8b via Ollama or find the Llama-3.2-8B-Instruct-Q5_K_M.gguf on Hugging Face. This model is well-documented, broadly capable, and runs comfortably on that hardware.
If you have 6GB VRAM: try Qwen3-8B at Q4_K_M (~4.8GB) — it’s slightly larger than 7B but fits, and Qwen3 8B is a strong performer for its size. Alternatively, look at Qwen3-30B-A3B-Q4_K_M if you want to experiment with MoE efficiency.
If you have no GPU: Phi-4-mini (3.8B, Q4_K_M, ~2.5GB) or Llama-3.2-3B-Instruct-Q5_K_M are usable at CPU speeds and won’t take all day to respond.
The best model is the one you can iterate with quickly. Start there.