quantization - Rushi's

Jun 02

2026

You’ve seen the numbers. 7B. 70B. 405B. Everyone talks about parameter counts. But what are they? Why does size matter? And what actually happens when you hit “Generate”? This post covers the mechanics: what parameters are, where they live in the model architecture, how scaling affects them, and what that means if you’re running or choosing […]

May 31

2026

You’ve heard the pitch: run AI privately, offline, on your own hardware — no API keys, no usage limits, no data leaving your machine. You open Hugging Face, find a model called Qwen3-30B-A3B-GGUF, download 20GB, try to run it, and your laptop grinds to a halt or produces nothing at all. The problem isn’t that local […]

May 29

2026

Sharing local LLM models between Ollama and llama.cpp seems like a niche concern until you’ve burned through tens of GB of disk space on duplicate copies of the same model. The two tools use completely different storage formats by default, but you can configure them to share one file. Table of contents The problem: data […]

May 27

2026

Running LLMs locally has become a normal part of how developers work. Two tools dominate this space: llama.cpp and Ollama. They look like competitors, but the relationship is more direct — Ollama is built on top of llama.cpp. This post covers the technical differences, where each performs better, and when to use one versus the other. Table of […]

Apr 05

2026

A plain-English reference guide covering the jargon that shows up every time a new language model drops, from parameter counts to quantization methods. Contents 01 · Architecture & Model Design — Transformer · Dense Model · Mixture of Experts · Active Parameters · Feed-Forward Network · Layers · Hidden Dimension · Attention Heads 02 · Attention Mechanisms — Multi-Head Attention · Multi-Query Attention · Grouped-Query Attention · KV Cache · Sliding Window Attention · RoPE · RoPE Theta 03 · Sizing, Scale & Counting — Parameters · Embedding Parameters · Non-Embedding […]

Rushi's

Ctrl+AI+Ship

Tag: quantization

LLM parameters: what they are and how they actually work

How to Pick the Right Model to Run on Your Local Machine

Sharing Local LLM Models Between Ollama and llama.cpp

Ollama vs. llama.cpp: a technical deep dive for developers

The LLM Vocabulary Sheet