NVIDIA Optimizes Google DeepMind's DiffusionGemma for Local AI

NVIDIA has accelerated the performance of Google DeepMind's experimental open model, DiffusionGemma, which is designed for fast text generation. The optimized version can run on NVIDIA GeForce RTX GPUs, the NVIDIA RTX PRO platform, and NVIDIA DGX Spark systems, from local PCs to cloud environments.

Unlike traditional large language models (LLMs), which generate text one word at a time, DiffusionGemma produces multiple words in parallel, allowing for whole blocks of text to be output simultaneously. This approach enables low-latency performance for single-user workloads, such as interactive chat and on-device assistants.

The model's architecture is based on the Gemma 4 26B mixture-of-experts design, which generates text by refining a block of text at once, rather than sequentially. Each step denoises up to 256 tokens in parallel, making it a compute-bound workload that leverages NVIDIA GPUs' strengths.

NVIDIA Tensor Cores accelerate the dense parallel math required for DiffusionGemma's operations, and the CUDA software stack enables efficient model execution without bespoke tuning. This results in significant performance gains compared to traditional LLMs.

According to benchmarks, DiffusionGemma delivers 1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU, outperforming equivalent autoregressive models by roughly four times. The optimized version also achieves faster local inference on NVIDIA DGX Station and runs across the full range of NVIDIA's lineup with improved performance.

Developers can access DiffusionGemma through various tools, including Hugging Face Transformers, which provides day-zero serving support for GeForce RTX 5090 or DGX Spark systems. Fine-tuning is available via Unsloth and NVIDIA NeMo framework, along with ready-made playbooks for adapting the model to specific tasks or domains.

Users can try DiffusionGemma on Hugging Face or test it using NVIDIA-hosted application programming interfaces at build.nvidia.com.