Google DeepMind released Gemma 4 on April 2, 2026 under Apache 2.0. It’s their fourth-generation open model family, and it runs locally with surprisingly little friction. Here are three ways to get it going, depending on what hardware you have in front of you.

Table of contents

Option 1: On your phone

  1. Download Google AI Edge Gallery from the Play Store
  2. Select Gemma 4 E2B or E4B
  3. It downloads and runs entirely offline

No account, no API key, no internet needed after the initial download. The E2B and E4B models are small (2.3B and 4.5B effective parameters respectively) but they handle text, image, video, and audio input natively.

Option 2: On your laptop

Install Ollama or LM Studio, then pull the model:

ollama pull gemma-4-27b

This is the 26B MoE variant. 25.2B total parameters, but only 3.8B active during inference. It runs on a MacBook with 16GB RAM and gives you a 256K context window.

Option 3: For production and agentic workflows

  1. Open Google AI Studio
  2. Select Gemma 4 31B (the dense flagship, 30.7B parameters, 60 layers)
  3. Use the function-calling API for agentic workflows, or deploy on Vertex AI for production

The 26B MoE variant is the sweet spot for most people. Enough capability to be useful, small enough to run on consumer hardware.

What’s in the model family

Gemma 4 is four models, not one. Each targets a different use case:

ModelParametersActive paramsLayersContextModalities
E2B5.1B (w/ embeddings)2.3B35128Ktext, image, audio, video
E4B8B (w/ embeddings)4.5B42128Ktext, image, audio, video
26B A4B (MoE)25.2B3.8B30256Ktext, image, video
31B (Dense)30.7B30.7B60256Ktext, image

The E2B and E4B models use Per-Layer Embeddings (PLE) and are the only variants with native audio support. The 26B MoE and 31B dense models trade audio for longer context and stronger reasoning.

Architecture details

All four models are decoder-only transformers with a hybrid attention design. They alternate between local sliding-window attention and global full-context attention in a fixed ratio (5:1 or 4:1 depending on the variant).

The two layer types are structurally different:

  • Sliding (local) layershead_dim=256, more KV heads, standard RoPE with theta=10K, window sizes of 512-1024 tokens
  • Full (global) layershead_dim=512, fewer KV heads, partial RoPE with theta=1M and partial=0.25, K=V weight sharing

The final layer is always global attention, so the model gets a full-context pass before generating output.

The MoE design

The 26B A4B variant does something unusual. Each layer runs a dense GeGLU feed-forward network in parallel with a 128-expert MoE layer (top-8 routing). The outputs are summed. So you get always-on dense capacity plus sparse expert specialization in every layer. This is why 3.8B active parameters can punch well above their weight.

At 4-bit quantization, capability loss is under 2.8%, which makes it practical to run quantized on consumer GPUs without much degradation.

Benchmarks

The numbers are a significant jump from Gemma 3. Here’s what the 31B dense model scores:

Reasoning

  • MMLU-Pro: 82.4% (+17.6 pts over Gemma 3)
  • GPQA Diamond: 68.7% (+41.9 pts over Gemma 3)
  • BigBench Extra Hard: 74.4% (+55.1 pts over Gemma 3)

Math

  • AIME 2026: 89.2% (+68.4 pts over Gemma 3)
  • GSM8K: 94.1%

Code

  • LiveCodeBench: 80.0% (+50.9 pts over Gemma 3)
  • HumanEval: 78.6%
  • Codeforces ELO: 2150 (Gemma 3 scored 110)

Multilingual

  • FLORES-200: 89.4% across 200+ languages
  • MGSM: 88.7%

The 26B MoE variant scores an Arena AI rating of 1441 with only 3.8B active parameters. The dense 31B hits 1452. For context, the MoE model is competitive with previous-generation 70B+ open models while using about 60% less VRAM.

A built-in thinking mode enables 4,000+ tokens of step-by-step reasoning, which is what drives those AIME math scores.

Why this matters practically

A few things stand out:

It outperforms larger models at a fraction of the compute. The MoE architecture means you’re getting 70B-class performance from a model that fits in 16GB of RAM. That changes who can run capable models and where.

Multimodal out of the box. The smaller models handle text, image, video, and audio. No separate pipelines, no stitching models together. You feed it an image or audio clip and it reasons over it.

256K context on the larger models. Long document analysis, large codebases, multi-turn agent conversations — the context window is big enough that you’re not constantly fighting truncation.

Apache 2.0 with no usage restrictions. Full commercial use. Fine-tune it, deploy it, ship products with it. No “open but actually not” licensing games.

22% lower latency than Gemma 3. Faster inference on top of better results.

Google’s two-track AI strategy

Nobody is framing this correctly.

Track 1: Gemini — closed, paid, cloud-only.

Track 2: Gemma — open, free, runs anywhere.

Both are built from the same research. Gemma 4 literally uses Gemini 3’s architecture. This isn’t a coincidence, it’s the strategy.

Gemini captures enterprise revenue. Gemma captures the entire open-source ecosystem. Developers build on Gemma, get comfortable with Google’s toolchain (AI Studio, Edge Gallery, Firebase, Vertex), and the ones who outgrow local hardware graduate to Gemini.

Apple did something similar with Swift. Give away the language, sell the hardware and services. Google is doing it with the intelligence itself. The monetization layer is everything around it: compute, hosting, API calls, and a developer ecosystem that defaults to Google when it’s time to scale.

“All this knowledge and data at your fingertips, what do you do with it?”-Rushi

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>