The rapid development of Artificial Intelligence (AI) is increasingly shaping our daily lives and opening up new possibilities in many areas. Language models are among the most exciting developments, but they also present challenges. This article examines current developments in the field of AI language models with a focus on European languages and dialects.
A recent study from Harvard University shows interesting parallels between large language models and crowdsourcing. Instead of relying on the expertise of individuals, AI systems analyze massive amounts of data from the internet to generate the most likely answer to a question. This approach, based on a consensus principle, leads to astonishingly accurate results for general topics.
However, these models reach their limits with complex or controversial questions. In particular, the tendency toward "hallucination", i.e., the generation of false or misleading information, is a known problem. This is particularly evident in the citation of scientific papers, which are often reproduced incorrectly by AI models.
The study authors, therefore, recommend treating AI-generated content with the same caution as crowdsourcing results. While they can provide valuable insights on general topics, critical examination is essential for specific questions. The quality of the results depends largely on the quantity and quality of the training data.
The Danish company Corti has now introduced its AI platform for the healthcare sector in Germany. The highlight: Corti uses its own AI assistant, which has been specially trained in medical terminology. This "co-pilot" is designed to support doctors in documenting patient consultations and thus reduce administrative effort.
Similar to Microsoft's Copilot or AI assistants from companies like Jameda, Corti's AI system listens in on conversations between doctors and patients and assigns the information to the corresponding categories. This not only facilitates documentation but is also intended to reduce the error rate in clinical documentation.
The Eichsfeld Clinic in Thuringia is one of the first clinics to implement Corti's AI platform. The chief physician of the emergency room, Dušan Trifunović, is enthusiastic about the technology: "While the doctor is talking to the patient, the co-pilot captures, organizes and assigns information to the correct areas such as diagnostics and laboratory." The clinic plans to implement Corti in all departments to optimize workflows and improve medical care.
An international research team has published an impressive collection of speech data for all 24 official EU languages with MOSEL (Massive Open-source compliant Speech data for European Languages). This freely available database contains over 500,000 hours of transcribed speech recordings and 440,000 hours of raw audio data.
However, a closer look reveals an uneven distribution of the data. While there are over 430,000 hours of annotated data available for the English language, there are significantly fewer for languages such as Maltese or Irish. To close this gap, the researchers automatically transcribed the raw audio data using the AI model Whisper from OpenAI.
With MOSEL, research and development now have a valuable tool at their disposal to map the linguistic diversity of Europe in AI systems. The freely available data is intended to promote the development of language assistants, translation programs, and other AI applications for European languages.
Developing AI systems that capture the linguistic diversity of Europe in its entirety poses a major challenge. Jan Wolter, Head of Product and Managing Director of Applause EU, a company that tests language assistance systems, emphasizes the importance of training data that also takes dialects and regional characteristics into account.
Factors such as age, gender, and social background influence the way people speak and pose challenges for AI systems. Youth language and the constant evolution of language over time must also be considered. Accents, different word meanings, and regional pronunciation variants further complicate understanding.
To overcome these challenges, AI systems need to be able to capture context and identify the specific language or dialect. This requires diverse and representative training data, the acquisition of which is often complex and time-consuming. This is the only way to develop AI systems that are equally accessible and understandable to all people.
The development of AI language models is progressing rapidly and holds enormous potential for the European economy and society. Taking into account the linguistic diversity of Europe is essential in order to avoid discrimination and to enable all people to access AI-based technologies.
Open datasets such as MOSEL, the development of specialized AI systems such as Corti, and the continuous improvement of speech recognition by companies such as Applause EU are making an important contribution to shaping an inclusive digital future.