Google Launches DiffusionGemma: Open AI Model with 4x Faster Text Generation

11 June 2026 · 18:00 · Claude (Anthropic) · claude-sonnet-4-6

Google DeepMind has launched DiffusionGemma, an open-weight AI language model that generates text up to four times faster than comparable models. The 26-billion-parameter Mixture of Experts model uses a diffusion mechanism instead of the traditional autoregressive approach and is immediately available under the Apache 2.0 license.

Google DiffusionGemma is the newest open AI model from Google DeepMind and promises a revolution in text generation: the model generates text up to four times faster than existing Gemma models. On June 10, 2026, Google announced the release of this remarkable model, which takes a fundamentally different approach from most language models and is immediately available as open-source software to developers worldwide.

What is DiffusionGemma?

DiffusionGemma is Google DeepMind's first open-weight text diffusion model. Whereas traditional large language models generate text word by word — a so-called autoregressive approach — DiffusionGemma works fundamentally differently: it generates entire blocks of text simultaneously and in parallel. This diffusion mechanism, which previously proved its value in image generators such as Stable Diffusion and Imagen, is now applied at large scale to text generation for the first time by one of the world's largest AI labs.

Technically, the model is a 26-billion-parameter Mixture of Experts (MoE) architecture. During inference, however, only 3.8 billion parameters are activated, which makes DiffusionGemma surprisingly efficient for its size.

The 4x speed boost: what does that mean in practice?

The most striking feature of DiffusionGemma is the impressive speed gain compared with other Gemma models. In practice the model achieves:

More than 700 tokens per second on an NVIDIA GeForce RTX 5090
More than 1,000 tokens per second on a single NVIDIA H100

For comparison: most traditional language models typically achieve 200 to 300 tokens per second on comparable hardware. Parallel text generation makes this possible because the model does not have to wait until each previous word is generated before continuing.

This speed comes at a price, however: the overall output quality of DiffusionGemma is lower than that of Gemma 4, Google's current top model. The model is therefore not intended as a direct replacement, but as a specialized option for applications where speed takes priority over maximum accuracy.

Who is DiffusionGemma for?

DiffusionGemma is specifically optimized for local inference with low concurrency. This makes it particularly attractive for:

Developers who want to experiment quickly and build prototypes
Researchers who need to process large volumes of text
Companies with edge applications where cloud latency is a bottleneck
Content creators who want to generate AI-assisted text quickly

For large-scale cloud serving environments with many concurrent requests, DiffusionGemma offers fewer advantages. In such scenarios, traditional autoregressive models already make efficient use of available compute, so the parallel approach yields little additional gain.

Technical specifications and availability

DiffusionGemma is released under the Apache 2.0 license, which makes the model entirely free to use and modify — for both commercial and non-commercial purposes. When quantized, the model fits in 18 GB of VRAM, making it deployable on high-end consumer graphics cards.

The model has day-one support in the most widely used AI frameworks:

vLLM — for efficient model serving
Hugging Face Transformers — the most widely used library for language models
MLX — for Apple Silicon users
Unsloth — for fine-tuning and optimized inference

NVIDIA has also announced support for DiffusionGemma via the RTX AI Garage infrastructure, which further simplifies local deployment on RTX GPUs for a broad audience of developers.

A new direction for open AI models

The launch of DiffusionGemma fits a broader trend in which large AI companies increasingly opt for open-weight models. By making the model available under Apache 2.0, Google lowers the barrier for developers worldwide to work with advanced AI technology. The choice of a diffusion architecture is also strikingly innovative: although text diffusion models have long been theoretically interesting, no major AI lab had previously open-sourced such a model at this scale. With this, Google DeepMind takes a notable step in the history of artificial intelligence. Want to know more about what you can do with such models? Check out our page on AI applications.

Conclusion: speed as a new trump card

With DiffusionGemma, Google proves once again that innovation in AI is not only about higher quality, but also about efficiency and accessibility. A fourfold speed gain is not a marginal improvement — it is a game changer for applications where response time is crucial. Although model quality currently still lags somewhat behind Gemma 4, future versions of DiffusionGemma are expected to gradually close that gap.

For developers and companies looking for a fast, open and locally deployable language model, DiffusionGemma is already a serious option to explore. Follow more AI news via Stersoftware, or dive deeper into the technology via our knowledge base.

Ars Technica