The digital world is becoming increasingly multilingual, and the development of powerful language models that reflect this diversity is crucial. Sailor2, a family of state-of-the-art multilingual language models, focuses specifically on the languages of Southeast Asia (SEA) and offers a promising solution for this linguistically diverse region. The models are available in sizes 1B, 8B, and 20B to cater to various use cases.
Sailor2 builds upon the foundation of Qwen2.5 and has been further trained with 500 billion tokens. Of these, 400 billion come from SEA-specific data and 100 billion from repeated tokens. This extensive training enables Sailor2 to support 13 SEA languages while maintaining proficiency in Chinese and English. The developers emphasize that the Sailor2-20B model achieves a 50-50 win rate against GPT-4o in most SEA languages.
A special feature of the Sailor2 project is the comprehensive "cookbook" that details the development process. This manual covers five key areas: data curation, pre-training, post-training, model adaptation, and evaluation. The developers hope that this "cookbook" will inspire other researchers to also develop inclusive LLMs for previously underserved languages.
The development of Sailor2 involved a series of technological innovations to ensure optimal performance and efficiency. These include:
Model Expansion: The base architecture of Qwen2.5 was expanded to meet the specific requirements of SEA languages.
Optimized Data Mixing Strategies: The training data was carefully selected and mixed to ensure a balanced representation of the different languages.
Multi-stage Pre-training Protocols: A multi-stage approach to pre-training optimized model performance and efficiency.
Advanced Multilingual Post-training: Specific post-training techniques refined the model's abilities in the individual SEA languages.
Sailor2 has the potential to significantly advance language technology in Southeast Asia. The open models and the detailed "cookbook" offer valuable resources for researchers and developers. By focusing on inclusive multilingualism, Sailor2 contributes to bridging the digital divide and enables wider access to information and technologies in the region.
The Apache-2.0 license, under which Sailor2 was released, promotes the open use and further development of the model. This encourages the community to build upon the existing foundation and develop innovative applications for the SEA languages. The combination of powerful models and a transparent development process makes Sailor2 an important contribution to the future of multilingual language processing.
Bibliographie sea-sailor.github.io/blog/sailor2/ github.com/sail-sg/sailor2 sea-sailor.github.io/ carnegieendowment.org/research/2025/01/speaking-in-code-contextualizing-large-language-models-in-southeast-asia?lang=en¢er=russia-eurasia huggingface.co/sail/Sailor2-20B carnegie-production-assets.s3.amazonaws.com/static/files/Noor_LLMs_final.pdf arxiv.org/abs/2404.03608 huggingface.co/collections/sail/sailor2-language-models-674d7c9e6b4dbbd9a869906b www.linkedin.com/posts/mati-matichon_sailor2-sailing-in-south-east-asia-with-activity-7270685210327826433-kv3q arxiv.org/html/2404.03608v1