December 9, 2024

APOLLO: A Memory-Efficient Optimizer for Large Language Model Training

Listen to this article as Podcast
0:00 / 0:00
APOLLO: A Memory-Efficient Optimizer for Large Language Model Training

Memory-Efficient Training of Large Language Models with APOLLO

Training large language models (LLMs) is notoriously memory-intensive, especially when using the common AdamW optimizer. This high memory demand necessitates the use of more or more powerful GPUs or the reduction of batch sizes, which limits the scalability and throughput of training. To counteract this, various memory-efficient optimizers have been developed to reduce the optimizer's memory consumption.

However, these optimizers face challenges:

They depend on computationally intensive SVD operations. Often, memory efficiency comes at the expense of performance compared to AdamW. To achieve competitive performance, they still require significant memory.

A new research paper now presents a promising approach: APOLLO (Approximated Gradient Scaling for Memory-Efficient LLM Optimization). The core idea of APOLLO is based on the observation that the learning rate adaptation rule of AdamW can be effectively simplified into a structured learning rate update.

APOLLO: Functionality and Advantages

APOLLO approximates the learning rate scaling using a low-rank auxiliary optimizer based on purely random projection. This structured learning rate update rule makes APOLLO very tolerant to further memory reductions while achieving comparable pre-training performance. Even the rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with memory costs at the SGD level.

Extensive experiments show that the APOLLO series performs equally or better than AdamW while achieving greater memory savings by virtually eliminating AdamW's optimization states. These savings offer significant system-level advantages:

Increased Throughput: Compared to AdamW, throughput can be tripled on an 8xA100-80GB setup by supporting 4x larger batch sizes. Improved Model Scalability: Enables pre-training of LLaMA-13B with naive DDP on A100-80GB GPUs without system-side optimizations. Pre-Training on Low-End GPUs: Enables pre-training of LLaMA-7B on a single GPU with less than 12GB of memory through weight quantization.

Outlook and Significance for AI Development

APOLLO effectively addresses the memory issues in training large language models while offering high performance. The results achieved are promising and could significantly facilitate the development and deployment of LLMs, especially in the context of resource-constrained environments. The improved scalability and increased throughput can shorten training times and accelerate the development of new, even larger models. This is an important step towards broader accessibility and application of AI technologies.

Bibliography: https://www.reddit.com/r/MachineLearning/comments/16cgukc/rd_hey_lomo_paper_authors_does_sgd_have_optimizer/ https://openreview.net/forum?id=WwKv20NrsfB https://arxiv.org/abs/2009.13586 https://github.com/XuezheMax/apollo/issues/1 https://arxiv.org/html/2412.00071v1 https://openreview.net/pdf?id=rnFOPhTMB0Y https://aclanthology.org/2024.lrec-main.122.pdf https://www.researchgate.net/publication/372201106_The_performance_analysis_of_Adam_and_SGD_in_image_classification_and_generation_tasks https://www.chatpaper.com/chatpaper/zh-CN?id=5&date=1733673600&page=1 https://github.com/XuezheMax/apollo