Developing powerful AI language models requires massive amounts of training data. So far, English-language datasets and proprietary systems from large technology companies have dominated. An international research team wants to change this: With MOSEL (Massive Open-source compliant Speech data for European Languages), they have compiled an extensive collection of open-source speech data for the 24 official languages of the European Union.
The project aims to advance the development of open AI language models in Europe. The initiative is particularly important because previous language models are often based on data that is either copyrighted or whose use is restricted. MOSEL, on the other hand, represents a freely accessible and usable resource that enables the development of transparent and accessible AI systems for all.
The data collected comes from 18 different sources, including projects like CommonVoice, LibriSpeech and VoxPopuli. It includes both transcribed speech recordings and unlabeled audio data. The 505,000 hours of transcribed data are particularly valuable.
However, the distribution between languages is very uneven. While there are over 437,000 hours of annotated data for English, there are only a few hours for languages like Maltese or Irish.
To improve the data situation for low-resource languages, the researchers also automatically transcribed an additional 441,000 hours of previously unlabeled audio data. They used OpenAI's AI model Whisper for this purpose.
The team explains that while automatic transcription is not perfect, it does provide large amounts of training material even for languages with little manually transcribed data. The generated transcripts are published under the Creative Commons CC-BY license, which allows free use with attribution.
The challenges of automatic transcription are particularly evident in the case of Maltese. Here, the Whisper model achieved a word error rate of over 80 percent – on average, every fifth word was recognized incorrectly.
Much work remains to be done for such languages - but the automated transcripts could serve as a starting point for further improvements. The team also plans to collect more data for underrepresented languages.
The entire dataset is freely available on GitHub and is intended to provide researchers and developers with access to extensive language data for European languages.
MOSEL represents an important step towards a European AI landscape that is not dominated by large technology companies. The open and freely accessible nature of the data collection enables small businesses, start-ups and research institutions to develop innovative AI applications for European languages.
The MOSEL data can be used for a variety of applications, including:
- Development of speech recognition systems for European languages - Improvement of machine translation systems - Development of language assistants and chatbots for European languages - Research in the field of language processing and machine learningThe MOSEL initiative is an important step towards a more open and inclusive AI landscape in Europe. The freely available language data will drive the development of innovative AI applications for European languages and help strengthen Europe's digital sovereignty.