November 22, 2024

UnifiedCrawl Improves LLM Performance in Low-Resource Languages

Listen to this article as Podcast
0:00 / 0:00
UnifiedCrawl Improves LLM Performance in Low-Resource Languages

UnifiedCrawl: Unlocking Common Crawl for Cost-Effective Adaptation of LLMs to Low-Resource Languages

Large language models (LLMs) often exhibit weaker performance in low-resource languages due to limited training data. This article highlights a method for efficiently collecting text data for such languages from the extensive Common Crawl corpus. The presented approach, UnifiedCrawl, filters and extracts data from Common Crawl with minimal computational overhead, generating monolingual datasets significantly larger than previously available sources. Using this data to fine-tune multilingual LLMs via efficient adapter methods (QLoRA) leads to significant performance improvements in the low-resource language while minimizing VRAM usage. Experiments demonstrate significant improvements in perplexity for language modeling and an increase in few-shot prompting scores. This work and the published source code offer a cost-effective approach to improving LLMs for low-resource languages using commodity hardware.

The Challenge of Data Scarcity

The impressive performance of LLMs in generating coherent and contextually relevant text is based on training with massive amounts of data, which, however, often consists mainly of commonly used languages. In low-resource languages, on the other hand, sufficiently large training datasets are lacking. This leads to difficulties in generating meaningful and coherent texts in these languages. Even when LLMs are prompted in a commonly used language to respond in a low-resource language, the results are often inadequate.

This limitation is rooted in the original training database of LLMs, which is heavily focused on commonly used languages, especially English. Adapting LLMs to low-resource languages is therefore crucial for democratizing access and expanding practical applicability. However, training LLMs is enormously resource-intensive and presents a significant hurdle.

UnifiedCrawl: An Efficient Approach to Data Collection

UnifiedCrawl addresses the challenges of data scarcity and resource intensity. The method enables the efficient and cost-effective collection of language-specific text data from the entire Common Crawl corpus. By optimizing memory, compute, and network consumption in each stage of the data acquisition pipeline, UnifiedCrawl can be run entirely on commodity hardware. Processing the entire Common Crawl dataset is possible within a few days, requiring less than 10 GB of RAM and storage space.

The result is a carefully curated dataset that is significantly larger than previously available collections for low-resource languages. Subsequently, quantization and the use of lightweight low-rank adapters (QLoRA) are employed to fine-tune multilingual LLMs on the collected dataset. This technique allows the use of very large models on commodity GPUs, thus increasing the accessibility and affordability of training.

Experimental Results and Outlook

Experimental results show that fine-tuning with UnifiedCrawl data leads to significant improvements in perplexity for language modeling and an increase in few-shot prompting scores. This demonstrates the effectiveness of the approach in improving LLM performance in low-resource languages.

UnifiedCrawl offers a promising way to improve the performance of LLMs in low-resource languages while minimizing costs and computational effort. Future research could focus on extending the approach to more languages and investigating its combination with other techniques for improving LLM performance. Providing more extensive and high-quality training data for low-resource languages is an important step towards democratizing access to powerful language models and promoting linguistic diversity in the field of Artificial Intelligence.

Sources: - Tessema, B. M., Kedia, A., & Chung, T.-S. (2024). UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages. arXiv preprint arXiv:2411.14343. - https://www.chatpaper.com/chatpaper/zh-CN/paper/84176 - https://chatpaper.com/chatpaper/ja/paper/84176 - Joshi, R., Singla, K., Kamath, A., Kalani, R., Paul, R., Vaidya, U., ... & Long, E. (2024). Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus. arXiv preprint arXiv:2410.14815. - Gurgurov, D., Hartmann, M., & Ostermann, S. (2024). Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters. arXiv preprint arXiv:2407.01406. - https://www.youtube.com/watch?v=IrIqKRMJCwc - https://www.researchgate.net/publication/381882630_Adapting_Multilingual_LLMs_to_Low-Resource_Languages_with_Knowledge_Graphs_via_Adapters - https://paperswithcode.com/datasets?task=language-modelling&mod=texts - https://github.com/orgs/commoncrawl/repositories