NVIDIA Unveils Largest Open-Source European Speech AI Dataset & Models
Nvidia has unveiled a significant advancement in multilingual speech AI, introducing Granary, an expansive open-source speech dataset for European languages, alongside two cutting-edge models: Canary-1b-v2 and Parakeet-tdt-0.6b-v3. This comprehensive release establishes a new benchmark for accessible, high-quality resources in automatic speech recognition (ASR) and automatic speech translation (AST), particularly benefiting European languages that have historically been underrepresented in AI development.
At the core of this initiative is Granary, a massive multilingual dataset developed in collaboration with Carnegie Mellon University and Fondazione Bruno Kessler. This corpus encompasses approximately one million hours of audio, with 650,000 hours dedicated to speech recognition tasks and 350,000 hours for speech translation. Granary covers 25 European languages, including nearly all official EU languages, plus Russian and Ukrainian, with a deliberate focus on those with limited annotated data, such as Croatian, Estonian, and Maltese. A key innovation behind Granary is its pseudo-labeling pipeline, which processes unlabeled public audio data using Nvidia NeMo’s Speech Data Processor. This technique automatically adds structure and enhances data quality, significantly reducing the laborious and resource-intensive need for manual annotation. By leveraging this clean, high-quality data, Granary enables remarkably faster model convergence, with research indicating that developers can achieve target accuracies using half as much Granary data compared to competing datasets, proving especially valuable for resource-constrained languages and rapid prototyping.
Building upon the Granary dataset, Nvidia has introduced Canary-1b-v2, a billion-parameter encoder-decoder model engineered for high-quality transcription and translation between English and 24 other supported European languages. This model doubles the language coverage of its predecessor, demonstrating state-of-the-art performance comparable to models three times its size, yet achieving up to ten times faster inference speeds. Canary-1b-v2 excels in multitask capabilities, robustly handling both ASR and AST, and features automatic punctuation, capitalization, and precise word and segment-level timestamps, even for translated outputs. Its architecture, combining a FastConformer Encoder with a Transformer Decoder and a unified vocabulary via a SentencePiece tokenizer, ensures strong performance even under noisy conditions and resilience against AI-generated hallucinations. Evaluation highlights underscore its accuracy, with a Word Error Rate (WER) of 7.15% on the AMI dataset for ASR and impressive COMET scores of 79.3 for X-to-English and 84.56 for English-to-X in AST. Available under a CC BY 4.0 license and optimized for Nvidia GPU-accelerated systems, Canary-1b-v2 is designed for scalable production use.
Complementing Canary-1b-v2 is Parakeet-tdt-0.6b-v3, a 600-million-parameter multilingual ASR model optimized for high-throughput or large-volume transcription across all 25 supported languages. This model expands the Parakeet family, previously focused on English, to encompass full European coverage. It boasts automatic language detection, capable of transcribing input audio without requiring explicit prompts, and offers real-time processing, efficiently transcribing up to 24-minute audio segments in a single inference pass. Parakeet-tdt-0.6b-v3 prioritizes low latency, efficient batch processing, and accurate outputs, complete with word-level timestamps, punctuation, and capitalization, proving reliable even with complex content like numbers or lyrics and in challenging audio environments.
Nvidia’s release of the Granary dataset and its accompanying model suite marks a significant step towards democratizing speech AI for Europe. By providing open-source, high-quality resources, these tools empower developers, researchers, and businesses to build inclusive and high-performing applications that support linguistic diversity. The advancements pave the way for scalable development of next-generation multilingual chatbots, sophisticated customer service voice agents, and near-real-time translation services, fostering innovation across a wide array of industries.