Scaling compute at inference time for large language models (LLMs) is a crucial factor for their performance, especially regarding complex tasks like action planning for language assistants. The question arises how to achieve the best results with limited inference compute. This article highlights various approaches to optimizing inference processes and discusses their advantages and disadvantages.
A promising approach involves using the LLM itself as a world model. For example, GPT-4 can be used to predict the effects of actions on a webpage. This approach not only enables improved performance but also enhances safety and efficiency. By simulating various scenarios within the world model, the language assistant can make more informed decisions and minimize potential risks.
Optimizing inference computation can be divided into two main mechanisms: modifying the proposal distribution (Proposer) and selecting the best completion (Verifier). The Proposer generates multiple response candidates, while the Verifier evaluates them and selects the best answer.
A simple method for scaling compute is best-of-N sampling. Here, N outputs are generated in parallel by the LLM, and the one with the highest rating by a Verifier is selected. However, this approach can be improved by more complex strategies.
An alternative to best-of-N sampling is iterative self-refinement. Here, the LLM is prompted to revise its initial response step by step. This sequential approach can be more efficient than parallel sampling for simpler problems where the LLM already provides good initial answers. Iterative self-refinement allows the model to learn from its own mistakes and gradually improve its responses.
For more complex problems that require searching across different solution approaches, process-based reward models (PRM) and search algorithms can be more effective. PRMs evaluate the individual steps of a response, and search algorithms like tree search can be used to efficiently explore the solution space. This approach enables the LLM to evaluate different strategies and select the most promising one.
The effectiveness of the various approaches depends heavily on the difficulty of the problem. This requires an adaptive, "compute-optimal" scaling strategy where the optimal approach is selected depending on the problem. The difficulty of a question can be assessed from the LLM's perspective to choose the best strategy for inference computation. By adaptively allocating compute, the efficiency of scaling can be significantly improved.
It is important to understand to what extent inference compute can replace pretraining. Comparisons between smaller models with additional inference compute and larger models without additional computation show that for simpler questions and a limited inference budget, scaling inference compute can be more beneficial. However, for more complex questions and a larger inference budget, pretraining a larger model can be more efficient.
Optimizing inference processes in LLMs is a complex and dynamic field of research. The choice of the optimal strategy depends on various factors, including the difficulty of the problem, the available compute budget, and the specific requirements of the application. By combining advanced techniques such as iterative self-refinement, process-based reward models, and compute-optimal scaling, the performance of LLMs can be significantly enhanced while optimizing computational costs.
Bibliography: Snell, C., Lee, J., Xu, K., Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv preprint arXiv:2408.03314. Vetterle, J. (2024, October 20). Scaling LLM Test Time Compute. [Blog post]. https://www.jonvet.com/blog/llm-test-time-compute