A new study from University College London (UCL) shows that large language models (LLMs) can predict the outcomes of neuroscience studies more accurately than human experts. The research, published in Nature Human Behaviour, highlights the potential of LLMs to accelerate scientific discovery and optimize research planning.
The study compared the prediction accuracy of 15 different LLMs with that of 171 human experts in the field of neuroscience. The LLMs achieved an average accuracy of 81.4 percent, while the human experts averaged only 63.4 percent. Even the top-performing human experts – the best 20 percent – only reached an accuracy of 66.2 percent. Notably, the study used older, open-source AI models, not the latest versions from companies like Anthropic, Meta, or OpenAI. This suggests that current models like GPT-4 or Gemini might achieve even better results on these tasks.
For the study, the researchers developed a new evaluation tool called "BrainBench." BrainBench consists of pairs of abstracts from neuroscience studies. One abstract in each pair corresponds to the original, while the other has been modified to present a plausible but incorrect result. Both the LLMs and the human experts were tasked with selecting the correct (i.e., original) abstract from the two options.
The researchers additionally trained an existing LLM (Mistral 7B) specifically with neuroscience literature to create a specialized model called "BrainGPT." BrainGPT further increased prediction accuracy, reaching 86 percent. This demonstrates that fine-tuning LLMs on specific scientific domains can further improve their performance.
The AI systems demonstrated superior performance across all tested areas of neuroscience. They performed particularly well when integrating information beyond the abstracts, linking methodology and background with the results. The researchers ensured that the AI wasn't simply memorizing answers. They used specific testing methods to check whether the models had already seen the test cases during training by comparing the results with known training data. The results suggest that AI models process scientific articles similarly to humans – they form general patterns and frameworks rather than memorizing details.
The study suggests a significant shift in the planning and execution of future scientific research. LLMs could help researchers predict the likelihood of different outcomes for planned experiments, thus improving decision-making in experimental design. However, the researchers also point to potential drawbacks. Scientists might be tempted to skip studies where AI predictions deviate from their hypotheses – even though unexpected results often lead to important breakthroughs. They also caution that results predicted by the AI with high certainty might be dismissed as obvious or uninteresting.
The ability of LLMs to accurately predict neuroscience research outcomes opens new possibilities for scientific discovery. While further research is necessary to understand the long-term implications of this technology, the results suggest that LLMs could play a valuable role in accelerating scientific progress.