DiffusionGemma: Google's Open Diffusion LLM (2026)

Published June 12, 2026

DiffusionGemma is Google's open-weight text diffusion model, released on June 10, 2026 under the Apache 2.0 license. Instead of writing one token at a time like a normal LLM, it generates blocks of tokens in parallel by denoising them, which Google says makes generation up to roughly 4x faster. This is what it actually is, how it differs from a normal model and from Gemini Diffusion, the real speed-versus-quality trade, and how to run it. If you are picking a model to run locally, see Best Local LLM 2026 and LM Studio vs Ollama.

What is DiffusionGemma?

A normal LLM is autoregressive: it predicts the next token, appends it, and repeats, strictly left to right. DiffusionGemma is a text diffusion model: it starts from noise and denoises a block of tokens in parallel, refining the whole block over a handful of steps instead of emitting one token at a time. Google's implementation works on blocks of 256 tokens with a small number of denoising steps.

The practical upshot of generating in parallel is speed, and because the model can attend to a whole block at once (bidirectionally), it is well suited to tasks that are not strictly left-to-right, like filling in the middle of code, in-line editing, and some math. It is built on the Gemma 4 family and draws on Google's earlier Gemini Diffusion research.

DiffusionGemma vs Gemini Diffusion

These are two different things and people are already conflating them:

  • Gemini Diffusion is Google's closed, frontier diffusion research model, first shown at Google I/O in May 2025. It is a demo, not open weights, no download.
  • DiffusionGemma is the open-weight Gemma variant that just launched (June 10, 2026), Apache 2.0, weights public on Hugging Face. This is the one you can actually run.

If someone says "Google's diffusion model is out," they mean DiffusionGemma.

How fast is it, and the catch

The headline is speed. Google reports DiffusionGemma generating over 1,000 tokens/second on a single H100 and 700+ tokens/second on an RTX 5090, and frames it as up to about 4x faster than the comparable standard Gemma model (independent reporting put the speedup lower, closer to ~2x against the 12B model, so treat the exact multiplier as approximate).

The honest catch, and Google says this itself: DiffusionGemma is below standard Gemma 4 on quality. Google explicitly recommends using standard Gemma 4 "for applications that demand maximum quality." So this is a speed-first, experimental model, not a quality upgrade. That trade is the whole point of it.

Specs

  • Model: google/diffusiongemma-26B-A4B-it
  • Architecture: Mixture-of-Experts, roughly 26B total parameters with about 3.8B active (8 of 128 experts per token)
  • Context: up to 256K tokens
  • Input: multimodal (text, image, video) to text
  • License: Apache 2.0, open weights
  • Status: Google labels it an experimental open model

How to run DiffusionGemma

The weights are on Hugging Face (google/diffusiongemma-26B-A4B-it). For inference today:

  • Hugging Face Transformers (a dedicated diffusion generation class), vLLM, and MLX support it.
  • Community quantized builds already exist (GGUF and NVIDIA NVFP4), which lower the memory bar.
  • llama.cpp support is listed as coming, and once it lands, DiffusionGemma should be runnable through Ollama and LM Studio like other GGUF models. Until then, verify before assuming it loads in those tools.

Hardware: full precision wants a large GPU (Google references needing more than 60GB), while quantized builds have been run in roughly 18GB of VRAM. One important caveat from Google: on unified-memory Apple Silicon the speed advantage may not show up, because diffusion generation is memory-bandwidth bound there.

Should you use it?

  • Yes, if you want raw generation speed, or you do a lot of code infilling, in-line editing, or other non-left-to-right tasks where parallel block generation helps. It is also simply interesting as the first widely-runnable open diffusion LLM.
  • No, if you need maximum answer quality. Google itself points you to standard Gemma 4 for that. DiffusionGemma is experimental and trades quality for speed.

For most people running a model locally, a conventional model like Qwen3 or Gemma 3 is still the better default today (see Best Local LLM 2026). DiffusionGemma is the one to watch and to reach for when speed is the priority.

Frequently asked questions

What is DiffusionGemma? DiffusionGemma is Google's open-weight text diffusion model, released June 10, 2026 under Apache 2.0. Instead of generating one token at a time, it denoises blocks of tokens in parallel, which Google says makes it up to about 4x faster than the comparable standard Gemma model. It is built on Gemma 4.

How is DiffusionGemma different from a normal LLM? A normal LLM is autoregressive and writes left to right, one token at a time. DiffusionGemma generates a whole block of tokens at once by denoising it over a few steps. That parallelism is faster and helps with non-linear tasks like code infilling and editing, at some cost to quality.

Is DiffusionGemma the same as Gemini Diffusion? No. Gemini Diffusion is Google's closed, frontier diffusion research model (demo only, shown at Google I/O 2025). DiffusionGemma is the open-weight Gemma variant released in June 2026 that you can download and run. They are related but separate.

Can I run DiffusionGemma locally? Yes. The weights are open on Hugging Face and it runs via Transformers, vLLM, and MLX, with quantized GGUF and NVFP4 builds available. llama.cpp support is coming, which would make it usable in Ollama and LM Studio. Full precision needs a large GPU; quantized builds have run in about 18GB of VRAM.

Is DiffusionGemma better than Gemma 4? No, not on quality. Google itself says DiffusionGemma is below standard Gemma 4 and recommends standard Gemma 4 when you need maximum quality. DiffusionGemma's advantage is speed, not accuracy.

enjoyed this?

Follow me for more on AI agents, dev tools, and building with LLMs.

X / Twitter LinkedIn GitHub
← Back to blog