A developer’s guide to Gemma 4 and Google’s open model play
Google DeepMind released Gemma 4 on April 2, 2026 under Apache 2.0. It’s their fourth-generation open model family, and it runs locally with surprisingly little friction. Here are three ways to get it going, depending on what hardware you have in front of you.
Table of contents
- Howto run Gemma4 locally
- What’s in the model family
- Architecture details
- Benchmarks
- Why this matters practically
- Google’s two-track AI strategy
Option 1: On your phone
- Download Google AI Edge Gallery from the Play Store
- Select Gemma 4 E2B or E4B
- It downloads and runs entirely offline
No account, no API key, no internet needed after the initial download. The E2B and E4B models are small (2.3B and 4.5B effective parameters respectively) but they handle text, image, video, and audio input natively.
Option 2: On your laptop
Install Ollama or LM Studio, then pull the model:
ollama pull gemma-4-27b
This is the 26B MoE variant. 25.2B total parameters, but only 3.8B active during inference. It runs on a MacBook with 16GB RAM and gives you a 256K context window.
Option 3: For production and agentic workflows
- Open Google AI Studio
- Select Gemma 4 31B (the dense flagship, 30.7B parameters, 60 layers)
- Use the function-calling API for agentic workflows, or deploy on Vertex AI for production
The 26B MoE variant is the sweet spot for most people. Enough capability to be useful, small enough to run on consumer hardware.
What’s in the model family
Gemma 4 is four models, not one. Each targets a different use case:
| Model | Parameters | Active params | Layers | Context | Modalities |
|---|---|---|---|---|---|
| E2B | 5.1B (w/ embeddings) | 2.3B | 35 | 128K | text, image, audio, video |
| E4B | 8B (w/ embeddings) | 4.5B | 42 | 128K | text, image, audio, video |
| 26B A4B (MoE) | 25.2B | 3.8B | 30 | 256K | text, image, video |
| 31B (Dense) | 30.7B | 30.7B | 60 | 256K | text, image |
The E2B and E4B models use Per-Layer Embeddings (PLE) and are the only variants with native audio support. The 26B MoE and 31B dense models trade audio for longer context and stronger reasoning.
Architecture details
All four models are decoder-only transformers with a hybrid attention design. They alternate between local sliding-window attention and global full-context attention in a fixed ratio (5:1 or 4:1 depending on the variant).
The two layer types are structurally different:
- Sliding (local) layers:
head_dim=256, more KV heads, standard RoPE withtheta=10K, window sizes of 512-1024 tokens - Full (global) layers:
head_dim=512, fewer KV heads, partial RoPE withtheta=1Mandpartial=0.25, K=V weight sharing
The final layer is always global attention, so the model gets a full-context pass before generating output.
The MoE design
The 26B A4B variant does something unusual. Each layer runs a dense GeGLU feed-forward network in parallel with a 128-expert MoE layer (top-8 routing). The outputs are summed. So you get always-on dense capacity plus sparse expert specialization in every layer. This is why 3.8B active parameters can punch well above their weight.
At 4-bit quantization, capability loss is under 2.8%, which makes it practical to run quantized on consumer GPUs without much degradation.
Benchmarks
The numbers are a significant jump from Gemma 3. Here’s what the 31B dense model scores:
Reasoning
- MMLU-Pro: 82.4% (+17.6 pts over Gemma 3)
- GPQA Diamond: 68.7% (+41.9 pts over Gemma 3)
- BigBench Extra Hard: 74.4% (+55.1 pts over Gemma 3)
Math
- AIME 2026: 89.2% (+68.4 pts over Gemma 3)
- GSM8K: 94.1%
Code
- LiveCodeBench: 80.0% (+50.9 pts over Gemma 3)
- HumanEval: 78.6%
- Codeforces ELO: 2150 (Gemma 3 scored 110)
Multilingual
- FLORES-200: 89.4% across 200+ languages
- MGSM: 88.7%
The 26B MoE variant scores an Arena AI rating of 1441 with only 3.8B active parameters. The dense 31B hits 1452. For context, the MoE model is competitive with previous-generation 70B+ open models while using about 60% less VRAM.
A built-in thinking mode enables 4,000+ tokens of step-by-step reasoning, which is what drives those AIME math scores.
Why this matters practically
A few things stand out:
It outperforms larger models at a fraction of the compute. The MoE architecture means you’re getting 70B-class performance from a model that fits in 16GB of RAM. That changes who can run capable models and where.
Multimodal out of the box. The smaller models handle text, image, video, and audio. No separate pipelines, no stitching models together. You feed it an image or audio clip and it reasons over it.
256K context on the larger models. Long document analysis, large codebases, multi-turn agent conversations — the context window is big enough that you’re not constantly fighting truncation.
Apache 2.0 with no usage restrictions. Full commercial use. Fine-tune it, deploy it, ship products with it. No “open but actually not” licensing games.
22% lower latency than Gemma 3. Faster inference on top of better results.
Google’s two-track AI strategy
Nobody is framing this correctly.
Track 1: Gemini — closed, paid, cloud-only.
Track 2: Gemma — open, free, runs anywhere.
Both are built from the same research. Gemma 4 literally uses Gemini 3’s architecture. This isn’t a coincidence, it’s the strategy.
Gemini captures enterprise revenue. Gemma captures the entire open-source ecosystem. Developers build on Gemma, get comfortable with Google’s toolchain (AI Studio, Edge Gallery, Firebase, Vertex), and the ones who outgrow local hardware graduate to Gemini.
Apple did something similar with Swift. Give away the language, sell the hardware and services. Google is doing it with the intelligence itself. The monetization layer is everything around it: compute, hosting, API calls, and a developer ecosystem that defaults to Google when it’s time to scale.