A new study has uncovered significant weaknesses in the reasoning abilities of AI language models, particularly in smaller, more cost-effective models. These models struggle with chained mathematical problems at the elementary school level.
Researchers from the Mila Institute, Google DeepMind, and Microsoft Research investigated how well various AI language models could solve interconnected text problems from elementary school mathematics. They developed a test called "Compositional GSM," which combines two problems from the GSM8K dataset, using the answer from the first problem as a variable in the second problem.
The results show that many models performed significantly worse than expected on these more complex reasoning tasks. This "reasoning gap" is particularly pronounced in smaller, cheaper models, and even in those specialized for mathematics.
"Our results demonstrate a significant reasoning gap in most LLMs, i.e., a performance difference between solving the composed pairs and solving each question independently," explain the authors, led by Arian Hosseini of the Mila Institute.
While smaller models often perform similarly to larger models on standard math tests like GSM8K, they exhibit a two to twelve times larger logic gap on the new Compositional GSM test. For example, GPT-4o mini lags far behind GPT-4o on the new test, despite being almost on par in the original benchmark. Similar patterns were also observed in other model families like Gemini and LLAMA3.
The researchers suspect that smaller models, while able to recognize superficial patterns in common tasks, struggle to apply this knowledge in new contexts. Current training methods for these models may be too focused on optimizing for standard benchmarks at the expense of general reasoning ability.
Even specialized mathematics models showed weaknesses. For instance, Qwen2.5-Math-7B-IT achieves an accuracy of over 80% on difficult high school level problems but solves less than 60% of the chained elementary school problems correctly.
The study also investigated the impact of instruction tuning, a method for refining language models. For small models, this significantly improved performance on the original GSM8K test but only marginally on the Compositional GSM. Larger models did not show this discrepancy, suggesting fundamental differences in the way smaller models learn and generalize.
The study is not entirely up-to-date, as OpenAI's new logic-optimized o1 model was not tested. A recent planning benchmark showed that o1, while much better at planning, still makes gross errors.
A mathematics professor recently demonstrated that while o1 was able to complete a mathematical proof that other LLMs had previously failed, a human solved the problem faster and more elegantly. Google's Gemini models are also said to perform better on math problems after recent updates.
The researchers emphasize that current evaluation methods have obscured these systematic differences, potentially leading to an overestimation of the capabilities of smaller models. They call for a reassessment of development strategies for cost-effective AI systems and question whether these models have inherent limitations in complex reasoning and generalization. This could have significant implications for their practical applications.
The results also challenge recent claims about efficiency gains in AI. While some argue that language models have become more efficient rather than more capable, and that scaling these efficient models could lead to significant performance improvements, this study suggests otherwise.
The authors emphasize that their goal was not to create another benchmark. Instead, they consider their work a case study that provides deeper insights into the workings and limitations of current AI systems. By chaining tasks, they test whether models can flexibly apply and combine learned knowledge - a crucial distinction between true understanding and superficial pattern recognition.
The researchers hope that their methodology can also be applied to other areas and benchmarks to obtain a more comprehensive picture of AI capabilities. This approach could reveal hidden weaknesses in AI systems that might go unnoticed in simpler, isolated tests.
The study adds to the existing evidence for logical weaknesses in language models. Previous research has shown that LLMs struggle with basic logical inferences and simple planning puzzles, despite achieving high scores on popular logic and math benchmarks.