MOSEL Dataset Aims to Empower Open Source AI for European Languages

A European Language Model: Researchers Collect 950,000 Hours of Open-Source Speech Data for EU Languages

Developing powerful AI language models requires enormous amounts of training data. Until now, English-language datasets and proprietary systems from large technology companies have dominated. An international research team wants to change this: With MOSEL (Massive Open-source compliant Speech data for European Languages), they have compiled an extensive collection of open-source speech data for the 24 official languages of the European Union.

The MOSEL Initiative: A Step Towards Open AI Language Models

The project aims to advance the development of open AI language models in Europe. The initiative is particularly important because previous language models are often based on data that is either copyrighted or whose use is restricted. MOSEL, on the other hand, represents a freely accessible and usable resource that enables the development of transparent and accessible AI systems for all.

Composition and Scope of the Data Collection

The collected data comes from 18 different sources, including projects such as CommonVoice, LibriSpeech, and VoxPopuli. It includes both transcribed speech recordings and unlabeled audio data. The 505,000 hours of transcribed data are particularly valuable.

Challenges and Uneven Data Distribution

However, the distribution between languages is very uneven. While there are over 437,000 hours of labeled data for English, there are only a few hours for languages like Maltese or Irish.

AI-Supported Transcription Expands the Database

To improve the data situation for low-resource languages, the researchers additionally transcribed 441,000 hours of previously unlabeled audio data automatically. They used OpenAI's Whisper AI model for this purpose.

Limitations of Automatic Transcription

The team explains that while automatic transcription is not perfect, it does provide large amounts of training material even for languages with little manually transcribed data. The generated transcripts are published under the Creative Commons CC-BY license, which allows free use with attribution.

The challenges of automatic transcription are particularly evident in the case of Maltese. Here, the Whisper model achieved a word error rate of over 80 percent - on average, every fifth word was recognized incorrectly.

Future Development and Data Availability

There is still a lot of work to be done for such languages - but the automated transcriptions could serve as a starting point for further improvements. The team also plans to collect additional data for underrepresented languages.

The entire dataset is freely available on GitHub and is intended to provide researchers and developers with access to extensive language data for European languages.

Significance for the European AI Landscape

MOSEL represents an important step towards a European AI landscape that is not dominated by large technology companies. The open and freely accessible nature of the dataset allows small businesses, startups and research institutions to develop innovative AI applications for European languages.

Possible Applications of the MOSEL Data

The MOSEL data can be used for a variety of applications, including:

- Development of speech recognition systems for European languages - Improvement of machine translation systems - Development of voice assistants and chatbots for European languages - Research in the field of speech processing and machine learning

Conclusion

The MOSEL initiative is an important step towards a more open and inclusive AI landscape in Europe. The freely available language data will drive the development of innovative AI applications for European languages and help strengthen Europe's digital sovereignty.

Bibliography

Gaido, M., Papi, S., Bentivogli, L., Brutti, A., Cettolo, M., Gretter, R., Matassoni, M., Nabih, M., & Negri, M. (2024). MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. https://arxiv.org/abs/2410.01036

Schreiner, M. (2024, 7. Oktober). Researchers collect 950,000 hours of open source speech data for EU languages. THE DECODER. https://the-decoder.com/researchers-collect-950000-hours-of-open-source-speech-data-for-eu-languages/

MOSEL Dataset Aims to Empower Open Source AI for European Languages