Local RAG Without the Storage Tax: A Hands-On Guide to LEANN

You’ve got 100 GB of PDFs, notes, and exported chat logs you’d love to query with natural language. So you reach for the standard RAG playbook: chunk everything, embed it, store the vectors in FAISS or a hosted vector DB. Then you check the index size and it’s 150–700 GB — larger than the data itself. On a laptop, that’s a non-starter.

This is the exact problem LEANN was built to solve. It’s an approximate nearest neighbor (ANN) index out of UC Berkeley’s Sky Computing Lab that keeps the index under ~5% of your raw data size by not storing most of the embeddings at all. Instead, it recomputes them on the fly during search.

This guide explains how that works, then walks through building a fully local, private RAG pipeline over your own documents — no API keys, nothing leaving your machine. It assumes you’re comfortable with Python and have a working mental model of RAG (chunk → embed → retrieve → generate). It does not cover training embedding models or production multi-tenant deployments; LEANN is aimed at personal-scale, single-device search.

A note on versions: everything below was checked against the public LEANN repository and CLI as of June 2026. The project is moving quickly, so pin versions and check the README if a command behaves differently.

The core idea: store the graph, recompute the vectors
What you’ll build
Step 1: Install prerequisites
Step 2: Create an environment and install LEANN
Step 3: Install a local LLM and embedding model with Ollama
Step 4: Build an index over your documents
Step 5: Search and chat with your data
Step 6 (optional): The Python API
Trade-offs, limitations, and when not to reach for LEANN
Troubleshooting the usual suspects
Where to go from here

The core idea: store the graph, recompute the vectors

Most vector databases store one dense embedding per chunk. With a few million chunks at 768 or 1536 dimensions in float32, that’s where the gigabytes go — and the proximity-graph metadata on top of it often doubles the bill.

LEANN makes a different trade. It keeps the graph structure that makes search fast, but throws away the stored embedding vectors. When you run a query, it recomputes the embeddings for only the handful of nodes the search actually visits.

Two techniques make this practical, per the project’s paper and documentation:

Graph-based selective recomputation. A graph traversal for a single query touches a small fraction of all nodes. LEANN only computes embeddings for nodes on that search path, not the whole corpus.
High-degree preserving pruning. Real proximity graphs have a few “hub” nodes that most paths route through. LEANN keeps those hubs and prunes redundant edges elsewhere, shrinking the stored graph (held in a compact CSR format) without wrecking reachability.

The reported result: index size under 5% of the raw data — the project cites roughly 97% storage savings on several datasets — while maintaining around 90% top-3 recall and sub-2-second latency on QA benchmarks. Their published storage comparison looks like this:

Dataset	Traditional DB (e.g. FAISS)	LEANN	Savings
DPR (2.1M)	3.8 GB	324 MB	91%
Wiki (60M)	201 GB	6 GB	97%
Chat (400K)	1.8 GB	64 MB	97%
Email (780K)	2.4 GB	79 MB	97%

Be clear-eyed about what these numbers are and aren’t. They’re the authors’ own benchmarks on their chosen datasets and hardware, comparing against a full-precision FAISS index. The savings are real, but the honest framing of the trade is: you’re paying compute at query time to save storage at rest. Whether that’s a good deal depends entirely on your workload, which is exactly what the trade-offs section below digs into.

What you’ll build

A local RAG pipeline that indexes a folder of documents and answers questions about them, with both the embedding model and the LLM running on your machine via Ollama. The only things touching the network are the initial model downloads.

If you’d rather use a cloud LLM like OpenAI, LEANN supports that too (set OPENAI_API_KEY and pass --llm openai) — but this guide stays on the fully local path, since that’s where the storage and privacy benefits actually matter.

Step 1: Install prerequisites

LEANN uses uv for environment management. Install it first:

curl -LsSf https://astral.sh/uv/install.sh | sh

Expected checkpoint: uv --version prints a version string. If the command isn’t found, restart your shell so the updated PATH takes effect.

On Linux, LEANN’s native components need a couple of system libraries. Install them before the Python package:

sudo apt-get update && sudo apt-get install -y libomp-dev libboost-all-dev

On macOS, the equivalent is:

brew install libomp boost protobuf zeromq pkgconf

A common mistake here is skipping these and hitting a cryptic build or import error later. If import leann fails with a missing-symbol message, an absent system library is the usual cause.

Step 2: Create an environment and install LEANN

Clone the repo (it carries the example apps and sample data you’ll use in a moment), then install LEANN from PyPI into a fresh virtual environment:

git clone https://github.com/yichuan-w/LEANN.git leann
cd leann

uv venv
source .venv/bin/activate
uv pip install leann

On a CPU-only Linux box, install the cpu extra instead — uv pip install "leann[cpu]" — to pull the right native build.

Expected checkpoint:

python -c "import leann; print('ok')"

should print ok. If it does, the native library loaded correctly and you’re past the hardest part of setup.

Step 3: Install a local LLM and embedding model with Ollama

For a fully local pipeline you need two models: one to embed text and one to generate answers. Ollama serves both behind an OpenAI-compatible API.

Install Ollama (download the macOS app from ollama.com, or on Linux run curl -fsSL https://ollama.ai/install.sh | sh), then pull a small generation model and an embedding model:

# On Linux, start the server if it isn't already running:
ollama serve &

ollama pull llama3.2:1b        # lightweight LLM, fine for consumer hardware
ollama pull nomic-embed-text   # embedding model

The llama3.2:1b model is deliberately small so it runs on modest hardware. If you have the RAM and patience, a larger model (e.g. llama3.1:8b) will give noticeably better answers — this is a quality-vs-speed decision you get to make, not a fixed default.

Expected checkpoint: ollama list shows both models.

Step 4: Build an index over your documents

LEANN ships a command-line interface that handles loading, chunking, embedding, and index construction in one step. Point it at a folder. It auto-detects common formats (PDF, TXT, MD, DOCX, PPTX, and source-code files).

leann build my-docs \
  --docs ./path/to/your/documents \
  --embedding-mode ollama \
  --embedding-model nomic-embed-text

Here my-docs is the index name you’ll refer to later. A few parameters worth knowing because they directly shape quality and size (all have sensible defaults, so you can omit them at first):

--backend-name hnsw (default) maximizes storage savings via full recomputation. diskann is the alternative — it trades a bit more storage for faster, more scalable search using PQ-compressed traversal with reranking. Start with HNSW.
--chunk-size (default 256) and --chunk-overlap control how documents are split. Larger chunks mean fewer, broader vectors; smaller chunks mean more precise retrieval but a bigger graph.
--graph-degree (default 32) and --build-complexity (default 64) govern index quality vs build time. Higher values build a better graph more slowly.

Expected checkpoint: the command finishes and leann list shows my-docs. Building is the most compute-heavy step because every chunk gets embedded once up front; on a large corpus this can take a while, and that’s expected.

Step 5: Search and chat with your data

Quick sanity check that retrieval works:

leann search my-docs "what does the document say about refund policy"

This returns the top matching chunks — useful for confirming your data was indexed and that semantically relevant passages come back. Then ask a real question, which runs full RAG (retrieve, then generate an answer with the local LLM):

leann ask my-docs "Summarize the refund policy in two sentences" \
  --llm ollama \
  --llm-model llama3.2:1b

Drop the question and add --interactive for a continuous Q&A session (type quit to exit):

leann ask my-docs --interactive --llm ollama --llm-model llama3.2:1b

Expected checkpoint: you get a generated answer grounded in your documents. If answers feel off, jump to troubleshooting below before assuming the index is broken.

Step 6 (optional): The Python API

If you’re embedding LEANN into an application rather than driving it from the shell, the library exposes three objects — LeannBuilder, LeannSearcher, and LeannChat:

from pathlib import Path
from leann import LeannBuilder, LeannSearcher, LeannChat

INDEX_PATH = str(Path("./").resolve() / "my_docs.leann")

# Build
builder = LeannBuilder(backend_name="hnsw")
builder.add_text("LEANN recomputes embeddings at query time instead of storing them.")
builder.add_text("High-degree preserving pruning keeps hub nodes and drops redundant edges.")
builder.build_index(INDEX_PATH)

# Search
searcher = LeannSearcher(INDEX_PATH)
results = searcher.search("how does LEANN save storage", top_k=3)

# Chat (this example uses a local HuggingFace model)
chat = LeannChat(INDEX_PATH, llm_config={"type": "hf", "model": "Qwen/Qwen3-0.6B"})
answer = chat.ask("How does LEANN reduce storage?", top_k=3)
print(answer)

In a real app you’d loop over your own documents with add_text() (after your own chunking) rather than hand-adding strings, and you’d point llm_config at whichever backend you set up — Ollama, HuggingFace, OpenAI, or Anthropic.

Trade-offs, limitations, and when not to reach for LEANN

LEANN’s whole design is built around one deliberate trade-off.

You pay for storage savings with query-time compute. Recomputing embeddings on the fly means every search runs the embedding model over the nodes on its path. On a machine with a capable GPU this is fast; on a CPU-only laptop with a heavier embedding model, latency climbs. If your access pattern is read-heavy and latency-sensitive — say, a high-QPS service — a traditional index that stores vectors and just does a lookup may serve you far better. LEANN is explicitly optimized for resource-constrained personal devices, not throughput-bound servers.

It’s built for personal scale and local use. The design target is one device, one user, your own data. It’s not a distributed vector store with replication, sharding, or concurrent multi-tenant writes. If you need those things, Milvus, Qdrant, Weaviate, or pgvector are the right tools.

The benchmark numbers are the authors’ own. The 97% savings and 90% recall figures come from the project’s evaluation on specific datasets and hardware. They’re a credible signal, not a guarantee for your corpus. Recall in particular depends on your data distribution, chunking, and the search-complexity setting. If accuracy is critical, measure recall on your own queries rather than trusting the headline number — LEANN ships evaluation scripts for exactly this.

Build time is front-loaded. You embed everything once at index-construction time, so building over a large corpus isn’t instant. The payoff is at rest (tiny index) and the cost is up front (a slow-ish build) plus per-query recomputation.

Local answer quality is bounded by your local LLM. A fully private setup with a 1B-parameter model will not match GPT-4-class output. That’s a property of the model you chose to keep things local, not of LEANN’s retrieval — but it’s the most common source of “the answers aren’t great” disappointment, so set expectations accordingly.

When LEANN is a strong fit: personal knowledge bases, searching your own documents/email/chat history, privacy-sensitive data you don’t want leaving the device, and any situation where storage is the binding constraint. When it isn’t: latency-critical high-QPS services, large multi-user deployments, or cases where you have storage to spare and want the simplest possible lookup path.

Troubleshooting the usual suspects

import leann fails right after install. Almost always a missing system dependency — revisit Step 1 (libomp/boost on Linux, the Homebrew equivalents on macOS).
Embedding step is painfully slow. You’re likely on CPU. Use a smaller embedding model, or set LEANN_EMBEDDING_DEVICE=cuda:0 if you have a GPU.
Irrelevant chunks coming back. Tune chunking (--chunk-size, --chunk-overlap) and raise --search-complexity (default 32) to traverse more of the graph for higher recall at some latency cost.
Good chunks, bad answers. That’s a generation problem, not a retrieval one. Try a larger --llm-model or raise --top-k so the LLM sees more context.

Where to go from here

You now have a private RAG system whose index is a small fraction of the data it searches, running entirely on your own hardware. It comes down to the trade-off baked into the architecture: are you storage-constrained and willing to spend compute at query time to fix that? If yes — personal devices, private data, big corpora on small disks — it fits well. If your bottleneck is query latency at scale, a conventional vector store is the better call, and that’s not a knock on either.

Concrete next steps: run LEANN’s own evaluation scripts against your data to measure recall on queries you actually care about; compare the HNSW and DiskANN backends on your hardware; and if you live in an editor, look at the LEANN MCP server, which wires this index into agentic coding tools for local semantic code search.

“Don’t try to find the best design in software architecture, instead, strive for the least worst combination of trade-offs.”-Anon

Rushi's

Ctrl+AI+Ship