Google's DiffusionGemma AI Model Offers Speed Boost for Local Processing
A new AI model from Google DeepMind, called DiffusionGemma, has been released as part of the Gemma 4 open model family. Unlike other models in this lineup, it doesn't generate text linearly but instead produces entire blocks of text in parallel.
The approach used by DiffusionGemma is similar to that employed by image generation models, which start with static and then denoise it to create the desired content. This process involves multiple passes over a field of placeholder tokens to generate likely tokens and improve estimation of others.
DiffusionGemma has 26 billion parameters but only activates 3.8 billion during inference, making it suitable for high-end GPUs with limited RAM. In testing on an RTX 5090 GPU, the model produced around 700 tokens per second, while a single Nvidia H100 AI accelerator achieved over 1,000+ tokens per second.
This represents a significant speed boost of about four times compared to similarly sized autoregressive Gemma models. The parallel processing capability also shifts the bottleneck from memory bandwidth to compute, allowing for up to 256 tokens to be generated in one go.
Google claims that this approach offers measurable benefits for non-linear tasks such as in-line editing, molecular sequencing, and mathematical graphing. An animation provided by Google demonstrates how DiffusionGemma was used to solve Sudoku puzzles, a notoriously challenging task for standard autoregressive AI models.
The model's ability to continuously self-correct large sets of tokens makes it easier to tackle complex problems like Sudoku. However, there are some drawbacks to text diffusion, including a higher error rate compared to traditional autoregressive models.
Google has experimented with using diffusion in cloud-based Gemini models but found that the increased error rate and resource waste make it less suitable for this application. Instead, DiffusionGemma is designed for local processing where memory bandwidth limitations can be overcome by parallel processing.
The efficiency gain of DiffusionGemma makes it an attractive option for researchers looking to experiment with new approaches to AI processing. Google has also been working on other techniques such as Multi-Token Prediction (MTP) drafters, which use wasted compute cycles to predict possible tokens and increase speed.