What Is Google Gemma 4?
Google released Gemma 4 on April 2, 2026 — a family of four open-weight models built from the same research that powers Gemini 3. Gemma 4 is described as "byte for byte, the most capable open models" ever released.
The family includes four sizes:
| Model | Effective Parameters | Total Parameters | Context Window | Type |
|---|---|---|---|---|
| E2B | 2.3B | 5.1B | 128K | Dense |
| E4B | 4.5B | 8B | 128K | Dense |
| 26B A4B | 3.8B active | 26B total | 256K | Mixture of Experts |
| 31B | 31B | 31B | 256K | Dense |
Every model is released under the Apache 2.0 license — a first for Google, which previously used a custom Gemma license. Apache 2.0 means truly open: commercial use, modification, and redistribution with no restrictions.
What is new compared to Gemma 3:
- Apache 2.0 license (replacing the restrictive custom Gemma license)
- 4 model sizes instead of one (Gemma 3 only had 27B)
- Native audio processing on E2B and E4B models
- Per-Layer Embeddings (PLE) — a new architecture innovation
- Shared KV Cache for reduced memory usage
- Codeforces ELO jumped from 110 to 2150 (nearly 20x improvement in competitive coding)
- AIME 2026 math score jumped from 20.8% to 89.2%
- Up to 4x faster inference and 60% less battery consumption
- Context window increased to 256K (from 128K on Gemma 3)
Benchmarks and Performance
Gemma 4 delivers exceptional benchmark results across reasoning, coding, vision, and multilingual tasks.
Reasoning and Knowledge:
| Benchmark | Gemma 4 31B | Gemma 4 26B MoE | Gemma 4 E4B | Gemma 3 27B |
|---|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 69.4% | 67.6% |
| AIME 2026 | 89.2% | 88.3% | 42.5% | 20.8% |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 42.4% |
| BigBench Extra Hard | 74.4% | 64.8% | 33.1% | 19.3% |
| MMMLU (multilingual) | 88.4% | 86.3% | 76.6% | 70.7% |
Coding Performance:
| Benchmark | Gemma 4 31B | Gemma 4 26B MoE | Gemma 4 E4B | Gemma 4 E2B |
|---|---|---|---|---|
| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% |
| Codeforces ELO | 2150 | 1718 | 940 | 633 |
The Codeforces jump from 110 (Gemma 3) to 2150 (Gemma 4 31B) is the largest coding performance leap between two generations of any open-source model.
Arena Rankings:
- Gemma 4 31B ranks #3 on the LMArena text leaderboard with an estimated 1452 ELO
- Gemma 4 26B A4B ranks #6 with 1441 ELO — remarkable given it only activates 3.8B parameters per inference
Vision Benchmarks:
| Benchmark | Gemma 4 31B | Gemma 4 26B MoE | Gemma 3 27B |
|---|---|---|---|
| MMMU Pro | 76.9% | 73.8% | 49.7% |
| MATH-Vision | 85.6% | 82.4% | 46.0% |
Long Context:
- MRCR v2 8-needle 128K: 66.4% (31B) vs 13.5% (Gemma 3 27B) — a 5x improvement in long context retrieval accuracy
Gemma 4 vs Other Open Models
Here is how Gemma 4 compares to the other leading open models in 2026:
| Feature | Gemma 4 31B | Llama 4 Maverick | Qwen 3 235B | Mistral Small 4 |
|---|---|---|---|---|
| License | Apache 2.0 | Llama 4 Community | Apache 2.0 | Apache 2.0 |
| Total Parameters | 31B | 400B (17B active) | 235B (22B active) | 24B |
| Context Window | 256K | 1M | 128K | 128K |
| Modalities | Text, Image, Video | Text, Image | Text, Image | Text |
| Arena ELO | ~1452 | ~1417 | ~1440 | ~1380 |
| Coding (Codeforces) | 2150 | ~1600 | ~1800 | ~1200 |
| MMLU Pro | 85.2% | ~82% | ~84% | ~75% |
| Edge/Mobile | Yes (E2B/E4B) | No | No | Limited |
| Audio Input | Yes (E2B/E4B) | No | No | No |
Key takeaways:
- Gemma 4 offers the best parameter efficiency — the 26B MoE model matches models 10-20x its active parameter count
- Llama 4 Maverick has the longest context (1M tokens) but requires vastly more compute
- Gemma 4 is the only model with native audio processing in smaller variants
- Apache 2.0 license makes Gemma 4 and Qwen 3 the most permissive options
- For edge/mobile deployment, Gemma 4 E2B and E4B are unmatched — no other frontier model runs well on phones
How to Download and Run Gemma 4 Locally
Gemma 4 can be run locally using several methods. Here is a step-by-step guide for each.
Using Ollama (Easiest Method)
Ollama is the simplest way to run Gemma 4 locally. You need Ollama v0.20.0 or later.
# Install or update Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run the default model (E4B)
ollama run gemma4
# Run specific model sizes
ollama run gemma4:e2b # Smallest, fastest
ollama run gemma4:e4b # Best balance
ollama run gemma4:26b # MoE, surprisingly fast
ollama run gemma4:31b # Most capableDownload sizes: E2B (7.2 GB), E4B (9.6 GB), 26B (18 GB), 31B (20 GB)
Which model to choose:
- E2B — Best for phones, Raspberry Pi, or machines with less than 8GB RAM
- E4B — Recommended for most users. Beats Gemma 3 27B in benchmarks while being 6x smaller
- 26B MoE — Great if you have 12+ GB VRAM. Only 3.8B parameters activate per token, so it is fast
- 31B — Maximum capability. Needs a beefy GPU (16+ GB VRAM)
Running Gemma 4 with Hugging Face and Python
Using Hugging Face Transformers
from transformers import pipeline
# Text generation
pipe = pipeline('text-generation', model='google/gemma-4-e4b-it')
result = pipe('Write a Python function to sort a list:')
print(result[0]['generated_text'])With multimodal input (image + text):
from transformers import pipeline
pipe = pipeline('any-to-any', model='google/gemma-4-e4b-it')
messages = [{
'role': 'user',
'content': [
{'type': 'image', 'image': 'https://example.com/photo.jpg'},
{'type': 'text', 'text': 'Describe this image'}
]
}]
output = pipe(messages, max_new_tokens=200)With quantization for lower VRAM:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
'google/gemma-4-31b-it',
quantization_config=quant_config,
device_map='auto'
)Using llama.cpp (GGUF)
# Download and run directly from Hugging Face
llama-server -hf ggml-org/gemma-4-E4B-it-GGUF
# This starts an OpenAI-compatible API on localhost:8080
curl http://localhost:8080/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "Hello"}]}'GGUF files are available on Hugging Face under ggml-org/gemma-4-*-GGUF and unsloth/gemma-4-*-GGUF repositories with multiple quantization levels (Q4_K_M, Q5_K_M, Q8_0).
Using vLLM (Production Serving)
from vllm import LLM, SamplingParams
llm = LLM(model='google/gemma-4-31b-it')
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(['Write a REST API in Python:'], params)
print(outputs[0].outputs[0].text)vLLM provides high-throughput serving with continuous batching, making it ideal for production deployments.
Using Gemma 4 with OpenClaw and Coding Agents
OpenClaw is an open-source AI coding agent that works with local LLMs. Gemma 4 has native function calling and structured JSON output, making it an excellent backend for agent frameworks.
All Gemma 4 models are verified compatible with:
- OpenClaw — open-source coding agent
- Hermes — agent framework
- Pi — local AI agent
Setting Up Gemma 4 with OpenClaw
- First, run Gemma 4 locally with Ollama:
ollama run gemma4:26b- Point OpenClaw to your local Ollama endpoint:
{
"model": "gemma4:26b",
"api_base": "http://localhost:11434/v1",
"context_length": 256000
}3. OpenClaw can now use Gemma 4 for:
- Code generation and completion
- File editing and refactoring
- Running terminal commands
- Multi-step agentic workflows
Performance Expectations
- Gemma 4 31B is highly capable for coding — Codeforces ELO of 2150 puts it near expert human competitive programmer level
- Gemma 4 26B MoE offers the best speed/quality ratio for coding agents since only 3.8B parameters activate per token
- E4B works surprisingly well for simple code tasks and runs fast on consumer hardware
- For complex multi-file refactoring, the 31B model is recommended
Tips for Coding with Gemma 4
- Enable thinking mode for complex problems (Gemma 4 supports chain-of-thought reasoning)
- Use the 256K context window to provide full project context
- The native function calling makes tool use reliable without custom prompting
- For latency-sensitive tasks, the 26B MoE model is faster than the 31B dense model
Hardware Requirements
Here is what you need to run each Gemma 4 model:
| Model | VRAM (Quantized) | VRAM (FP16) | RAM (CPU Only) | Download Size | Best For |
|---|---|---|---|---|---|
| E2B | 4-6 GB | ~10 GB | 8 GB | 7.2 GB | Mobile, Raspberry Pi, low-end laptops |
| E4B | 6-8 GB | ~16 GB | 16 GB | 9.6 GB | Consumer GPUs (RTX 3060+), Apple M1+ |
| 26B MoE | 8-12 GB | ~28 GB | 32 GB | 18 GB | Mid-range GPUs (RTX 3090, RTX 4070 Ti) |
| 31B Dense | 16-20 GB | ~32 GB | 64 GB | 20 GB | High-end GPUs (RTX 4090, A100) |
Recommended GPU per model:
- E2B: Any GPU with 4+ GB VRAM, or CPU-only on 8+ GB RAM
- E4B: NVIDIA RTX 3060 12GB, RTX 4060, Apple M1/M2 with 16GB unified memory
- 26B MoE: NVIDIA RTX 3090 24GB, RTX 4070 Ti Super 16GB, Apple M2 Pro/Max
- 31B Dense: NVIDIA RTX 4090 24GB, A100 40GB, Apple M2 Max/Ultra with 64GB+
Apple Silicon note: Gemma 4 works excellently on Apple Silicon with MLX. The unified memory architecture means M2 Max with 32GB can run the 26B MoE at good speeds. Use the MLX-optimized models from mlx-community on Hugging Face.
Quantization matters: A 4-bit quantized 31B model uses roughly half the memory of FP16, with minimal quality loss. Start with Q4_K_M for the best balance of quality and memory.
Use Cases for Developers
Gemma 4 opens up powerful use cases that were previously limited to cloud-only models:
- Local AI coding assistant — Run Gemma 4 with OpenClaw or Hermes for a private, zero-cost alternative to GitHub Copilot or Claude. No API keys, no data leaving your machine, no monthly subscription.
- Code review and refactoring — Feed entire files or PRs into the 256K context window for comprehensive reviews.
- Documentation generation — Generate README files, API docs, JSDoc comments, and architectural documentation from code.
- Test case generation — Describe behavior and let Gemma 4 write unit tests, integration tests, and edge case coverage.
- Summarizing PRs and issues — The long context window can digest entire PR diffs and provide summaries for team standup or review.
- Running in CI/CD pipelines — Use vLLM to serve Gemma 4 in your CI pipeline for automated code review, PR summaries, or commit message generation.
- Building AI-powered features — Embed Gemma 4 in your application for features like smart search, content generation, or data extraction without paying per-token API costs.
- Multimodal applications — Process images, screenshots, and video alongside text. Generate code from UI mockups. Extract data from documents with OCR.
- Edge deployment — Run E2B or E4B on mobile devices, IoT hardware, or embedded systems for offline AI capabilities.
Limitations and Known Issues
While Gemma 4 is impressive, it has limitations developers should be aware of:
- Not a frontier model — For the most demanding tasks (novel research, complex multi-step planning), cloud models like Claude, GPT-4, and Gemini still outperform Gemma 4 31B.
- Audio only on smaller models — Audio input is only available on E2B and E4B, not on the 26B MoE or 31B dense models.
- MoE memory footprint — While the 26B MoE model only activates 3.8B parameters, you still need to load all 26B parameters into memory. The active parameter count helps speed, not total memory usage.
- Chinese open-source competition — As some reviewers note, Chinese models like Qwen 3.5 and DeepSeek V3 compete closely with Gemma 4 on certain benchmarks, especially at larger parameter counts.
- Fine-tuning complexity — While fine-tuning is supported via TRL and Unsloth, MoE models are harder to fine-tune than dense models and require more expertise.
- Hallucination risk — Like all LLMs, Gemma 4 can generate plausible-sounding but incorrect information. Always verify critical outputs, especially in coding and factual tasks.
- Long context performance degrades — While the 256K context window is impressive, performance on retrieval tasks drops at the far end of the context. The 128K needle-in-haystack benchmark shows 66.4% accuracy for 31B — good but not perfect.
Frequently Asked Questions
Find answers to the most common questions about Google Gemma 4 below.
Start Building with Gemma 4
Gemma 4 represents a major leap for open-source AI — Apache 2.0 licensing, frontier-class performance at small sizes, and genuine edge deployment capability. Whether you are building a local coding assistant, embedding AI in a mobile app, or serving models in production, Gemma 4 has a variant that fits.
At DevPik, we build tools developers love — all running locally in your browser, just like Gemma 4 runs locally on your machine. Check out our 30+ free developer tools including JSON tools, Code Share, and Regex Tester.




