The release of GPT-4 in March 2023 marked another milestone in the development of large language models. However, while the performance of GPT-4 impressed, details about the training process remained hidden for a long time. Leaked information and expert opinions now paint a more complex picture and suggest the enormous resources required to train such a model.
A widespread myth claims that training a model the size of GPT-4 inevitably requires state-of-the-art hardware. However, simulations by Epoch AI show that even with graphics cards from 2012, like the GTX 580, training would have been possible – albeit at a significantly higher price. The decisive factor is the efficiency with which a model utilizes floating-point operations per second (FLOPS) in relation to the required computing power.
Epoch AI’s research findings show that efficiency on the same hardware tends to decrease with increasing model size. Newer architectures like the H100 can maintain higher efficiency rates over longer periods, while older GPUs like the V100 show a stronger loss of efficiency with increasing training size. The simulations suggest that training GPT-4 with 2012 technology would have been about ten times as expensive as with modern hardware.
Estimates suggest that training GPT-4 required between 1e25 and 1e26 FLOPS. Reports indicate that the training took place on around 25,000 A100 GPUs over a period of 90 to 100 days. Assuming costs of about 1 US dollar per A100 GPU per hour in the cloud, the total cost for this training run amounts to about 63 million US dollars.
The use of the Mixture-of-Experts (MoE) architecture, where different "experts" specialize in different aspects of the data, contributes to efficiency but also leads to complexities. During inference, i.e., the application of the trained model, only the relevant experts are used, while the rest remain inactive. This lowers hardware utilization and increases inference costs compared to a smaller model like Davinci by about a factor of three.
GPT-4 was trained on a massive dataset of approximately 13 trillion tokens. Two epochs were run for text data, and four epochs for code data. The batch size, meaning the number of tokens processed simultaneously, was gradually increased to 60 million during training. However, the MoE architecture reduces the effective batch size per expert to 7.5 million tokens.
Scaling AI models is associated with enormous costs. The question of how future performance improvements can be achieved without exceeding budgets remains open. The development of more efficient algorithms, the exploration of new data sources, and the optimization of hardware utilization are central challenges for the future of AI research.
The development of GPT-4 illustrates that progress in AI cannot be achieved solely through ever-larger models. Optimizing the training process, efficient use of resources, and consideration of ethical aspects are equally important to responsibly harness the potential of artificial intelligence.
Bibliographie: https://www.reddit.com/r/singularity/comments/1bi8rme/jensen_huang_just_gave_us_some_numbers_for_the/ https://medium.com/codex/gpt-4-will-be-500x-smaller-than-people-think-here-is-why-3556816f8ff2 https://www.itaintboring.com/ai/i-got-a-lot-of-things-wrong-about-ai-along-the-way-its-time-to-start-putting-things-straight/ https://openai.com/index/ai-and-compute/ https://www.tomshardware.com/tech-industry/artificial-intelligence/chinese-company-trained-gpt-4-rival-with-just-2-000-gpus-01-ai-spent-usd3m-compared-to-openais-usd80m-to-usd100m https://semianalysis.com/2024/09/04/multi-datacenter-training-openais/ https://forum.effectivealtruism.org/posts/bL3riEPKqZKjdHmFg/when-will-we-spend-enough-to-train-transformative-ai https://maxluo.me/the-future-of-ai-is-expensive https://en.wikipedia.org/wiki/Generative_pre-trained_transformer https://www.ikangai.com/the-secrets-of-gpt-4-leaked/