A new study sheds interesting light on how language models evaluate each other. Researchers have found that advanced language models tend to rate other models that make similar mistakes as themselves more favorably. This phenomenon could have far-reaching consequences for the development and deployment of AI systems.
To measure the similarity of errors between language models, scientists have developed a new tool called CAPA (Chance Adjusted Probabilistic Agreement). CAPA goes beyond mere accuracy and analyzes the extent to which models exhibit the same error patterns. This is particularly relevant when AI systems are used to evaluate and control other AI, as is the case with Mindverse and its customized AI solutions.
The results of the study show that language models acting as "reviewers" give better ratings to models with similar error patterns, even when actual performance is taken into account. This tendency is reminiscent of "affinity bias" in human recruiting, where applicants who are similar to the interviewer are preferred. In the context of AI systems, this preference could make it difficult to develop objective evaluation criteria.
Interestingly, the study also found that stronger models learn more from weaker models when their error patterns differ significantly. This finding suggests that different models possess complementary knowledge that can supplement each other. This aspect is particularly relevant for "weak-to-strong" training approaches, where stronger models learn from the data of weaker models.
The analysis of over 130 language models revealed a worrying pattern: The more powerful the models become, the more similar their errors become. This trend raises security concerns, especially as AI systems increasingly take responsibility for evaluating and controlling other AI systems. Shared blind spots and error modes could impair the reliability and safety of AI systems. Considering these findings is crucial, especially for complex applications like chatbots, voicebots, AI search engines, and knowledge systems, which Mindverse develops.
The researchers emphasize the importance of considering both the similarity of the models and the diversity of their errors. Further research is needed to extend the CAPA metric to the evaluation of free-text responses and the reasoning abilities of large language models. The development of robust and secure AI systems requires a deep understanding of error patterns and their impact on the interaction between different AI systems.
The findings of this study are relevant for the entire AI industry and especially for companies like Mindverse, which specialize in the development and implementation of AI solutions. A conscious approach to the phenomena described can contribute to improving the reliability, safety, and effectiveness of AI systems.
```