Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder
Interfaze released diffusion-gemma-asr-small, an open-source automatic speech recognition model using diffusion-based parallel denoising to transcribe six languages.

- diffusion-gemma-asr-small is the first open-source ASR model to use diffusion-based parallel denoising for multilingual transcription.
- The model relies on a 42M-parameter adapter to integrate with Google's frozen DiffusionGemma, avoiding full model retraining.
- Transcription cost scales with denoising steps rather than transcript length, potentially improving efficiency for long audio.
- Supports six languages with a single adapter, reducing deployment complexity for multilingual ASR systems.
Interfaze has open-sourced diffusion-gemma-asr-small, a multilingual automatic speech recognition (ASR) model that departs from traditional autoregressive approaches by using diffusion-based parallel denoising. The model integrates with Google's frozen DiffusionGemma architecture via a lightweight adapter (~42M parameters), enabling transcription across six languages without language-specific fine-tuning.
Unlike conventional ASR systems where computational cost scales with transcript length, diffusion-gemma-asr-small's cost is determined by the number of denoising steps, offering potential efficiency gains for long-form audio. The adapter's design allows a single model to handle multiple languages, simplifying deployment and reducing resource overhead for multilingual applications.
The release underscores growing interest in diffusion-based methods for speech processing, aligning with broader trends in generative AI where diffusion models are being explored for non-autoregressive sequence generation tasks.
Source: Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder. Read the full piece at the source.
Provides a novel, open-source approach to ASR with diffusion models, enabling experimentation with non-autoregressive speech processing.
Offers a cost-efficient, multilingual ASR solution that could reduce infrastructure costs for long-form audio transcription.
Demonstrates practical applications of diffusion models in speech processing, bridging generative AI and ASR research.
- DiffusionGemma
- A diffusion-based language model architecture from Google, repurposed here for ASR via an adapter.
- Parallel denoising decoder
- A diffusion mechanism that processes audio tokens simultaneously rather than sequentially, improving efficiency.

The Untaught Lessons of RAG Retrieval: Cosine Is Not the Foundation

A behind-the-scenes look at Midjourney’s medical scanner leaves many questions unanswered
