OpenAI has developed a new benchmark called MLE-bench to evaluate how well AI agents can develop machine learning solutions. The benchmark includes 75 Kaggle competitions and aims to measure the progress of autonomous AI systems in the field of ML engineering.
MLE-bench focuses on two core areas: selecting challenging tasks that represent current ML development, and comparing AI results with human performance.
The 75 competitions cover various areas, including Natural Language Processing, Computer Vision, and Signal Processing. Many tasks have real-world applications, such as predicting the degradation of COVID-19 mRNA vaccines or decoding ancient scrolls.
OpenAI tested various AI models and agent frameworks on MLE-bench. The o1-preview model with the AIDE framework performed best, achieving at least a bronze medal in 16.9% of the competitions. This result surpassed Anthropic's Claude 3.5 Sonnet.
The researchers also investigated how different scaling methods affect the performance of AI agents. More attempts per competition significantly improved success rates. With 8 attempts, the medal rate of o1-preview doubled to 34.1%. Longer processing times led to better results. GPT-4o increased its medal rate from 8.7% to 11.8% when the processing time was extended from 24 to 100 hours. However, additional GPU power had little impact on performance.
When creating MLE-bench, OpenAI faced challenges such as potential contamination from publicly available Kaggle competitions. To counteract this, the company used a plagiarism detector to compare the agents' submissions with the best Kaggle solutions and conducted experiments to check the impact of contamination.
OpenAI acknowledges that MLE-bench does not cover all aspects of AI research and development. The benchmark focuses on tasks with clear problem statements, clean datasets, and simple evaluation metrics. Challenges in the real world are often less clearly defined.
Despite these limitations, OpenAI sees MLE-bench as a valuable tool for evaluating core competencies in ML engineering. These include preparing large multimodal datasets, managing lengthy training processes, and debugging underperforming models.
The MLE-bench benchmark is available on GitHub.
o1-preview is an advanced AI model developed by OpenAI. It is a so-called "Large Reasoning Model" (LRM) that is characterized by its ability to solve complex problems using "Chain-of-Thought" (CoT) reasoning. CoT allows the model to generate and evaluate intermediate steps in solving a problem, similar to how a human would. This leads to greater accuracy and problem-solving ability compared to traditional language models.
o1-preview has achieved impressive results in various benchmarks and tests, demonstrating its capabilities in coding, math, and science. It even surpasses the performance of human experts in some tasks. However, it is important to note that o1-preview is still under development and has limitations. For example, it is slower and more expensive to use than other models and does not yet support all the features offered by models such as GPT-4o.
MLE-bench is a promising new benchmark for evaluating the capabilities of AI agents in the field of ML engineering. Initial results show that OpenAI's o1-preview is a leader in this area, but also that there is still much room for improvement. MLE-bench will help drive the development of autonomous AI systems that can solve complex problems in the real world.