Artificial intelligence (AI) has made remarkable progress in the field of reasoning in recent years. Large language models (LLMs) demonstrate impressive capabilities in solving mathematical problems and programming. Despite these advances, they face challenges with more complex tasks, such as combinatorial problems from the International Mathematical Olympiad (IMO), puzzles from the Abstraction and Reasoning Corpus (ARC), and questions from "Humanity's Last Exam" (HLE).
A new research approach based on diverse inference and verification promises to overcome these hurdles. This approach combines multiple models and methods at test time to improve the accuracy and efficiency of reasoning. A core aspect of this approach is the verification of solutions for mathematical and programming tasks. For IMO problems, the correctness of the solutions is verified by the proof assistant system Lean, while for ARC puzzles, verification is done through code.
The application of this approach shows promising results. The accuracy on IMO combinatorics problems was increased from 33.3% to 77.8%. For HLE questions, the accuracy rose from 8% to 37%. Furthermore, 80% of the ARC puzzles that 948 humans could not solve were solved, and 26.5% of the ARC puzzles that even powerful LLMs like o3 could not solve were also solved. These results underscore the potential of diverse inference and verification to expand the boundaries of machine reasoning.
The researchers are also investigating the use of test-time simulations, reinforcement learning, and meta-learning with inference feedback to improve the generalization ability of the models. By adapting graph representations and varying prompts, code, and datasets, the models can be adapted to new and unknown problems. This adaptive approach allows the models to learn from their experiences and improve their performance over time.
The approach of diverse inference and verification is characterized by its reliability, robustness, and scalability. The combination of different models and methods makes it possible to leverage the strengths of individual approaches and compensate for their weaknesses. This leads to a more robust and reliable system that delivers good results even for complex and demanding tasks.
In the spirit of reproducible research, the authors plan to make their approach publicly available after publication. This will allow other researchers to verify the results, further develop the approach, and apply it to new areas of application. The public availability of the approach contributes to the transparency and progress of AI research.
Diverse inference and verification represents a promising approach for advanced reasoning. By combining different models and methods, verifying solutions, and applying learning techniques, complex tasks that were previously inaccessible to AI systems can be solved. This approach has the potential to advance the development of AI systems and open up new possibilities in various application areas.
Bibliography: - https://huggingface.co/papers/2502.09955 - https://arxiv.org/abs/2502.09955 - https://arxiv.org/abs/2501.11651 - https://arxiv.org/abs/2410.05318 - https://huggingface.co/akhaliq/activity/all - https://openreview.net/forum?id=ZsP3YbYeE9 - https://github.com/zchuz/CoT-Reasoning-Survey - https://www.pnas.org/doi/10.1073/pnas.0403723101 - https://aclanthology.org/2024.naacl-long.52.pdf - https://github.com/dair-ai/ML-Papers-of-the-Week - https://www.researchgate.net/publication/362969232_Diversity-driven_automated_formal_verification