The performance of large language models (LLMs) is constantly improving, but objectively assessing this progress is often complex. How do you measure the "quality" of a response? Traditional benchmarks with fixed solutions reach their limits here, as they often cannot capture nuances and context. A new approach, gaining increasing importance in the AI community, is the use of LLMs as evaluators – the so-called "LLM-as-a-Judge" method.
An innovative platform for evaluating LLMs as evaluators is Judge Arena. Similar to LMSys' Chatbot Arena, where users directly vote on the quality of LLM-generated texts, Judge Arena focuses on evaluating the evaluation. Users receive two evaluations of the same text by different LLMs, including justification and scoring. They then vote on which evaluation best matches their own preferences.
The platform integrates a variety of models, both open-source and proprietary, including:
- OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo) - Anthropic (Claude 3.5 Sonnet / Haiku, Claude 3 Opus / Sonnet / Haiku) - Meta (Llama 3.1 Instruct Turbo 405B / 70B / 8B) - Alibaba (Qwen 2.5 Instruct Turbo 7B / 72B, Qwen 2 Instruct 72B) - Google (Gemma 2 9B / 27B) - Mistral (Instruct v0.3 7B, Instruct v0.1 7B)The results are displayed on a public leaderboard, based on an Elo ranking system and updated hourly. Initial results show that open-source models can compete with proprietary models and that smaller models, such as Qwen 2.5 7B, achieve surprisingly good performance.
Besides Judge Arena, there are other benchmarks that deal with the evaluation of LLMs as evaluators. JudgeBench, for example, tests models based on 350 questions from the fields of knowledge, logic, mathematics, and programming. The answers are compared pairwise to minimize positional bias, and inconsistencies are counted as errors.
JudgeBench emphasizes the importance of:
- Pairwise evaluation to avoid positional bias - Evaluation hierarchy: Following instructions, checking facts, evaluating style - Prioritizing facts and logic over styleResults from JudgeBench show, among other things, that fine-tuning can significantly improve the performance of models and that there is a strong correlation between a model's ability to solve problems and its ability to evaluate solutions.
MixEval takes a different approach. This benchmark combines existing benchmarks with real user queries from the web to bridge the gap between academic tests and practice. MixEval achieves a high correlation (0.96) with the Chatbot Arena ranking while being significantly more cost-effective and faster to execute.
The advantages of MixEval:
- High correlation with Chatbot Arena - Cost-effective - Dynamic data updates - Comprehensive query distribution - Unbiased evaluationMixEval exists in two versions: a standard version and a more challenging "Hard" version, which better differentiates stronger models.
Choosing the right evaluation method is crucial for the development and deployment of LLMs. Leaderboards like Judge Arena offer good guidance in selecting suitable evaluation models. However, it is important to adapt the chosen models to the specific requirements of the use case and, if necessary, train them with few-shot examples. The combination of different benchmarks and methods provides the most comprehensive insight into the strengths and weaknesses of LLMs as evaluators and contributes to the development of more powerful and reliable AI systems.
Bibliographie: https://twitter.com/_philschmid/status/1862402583673082003 https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_which-llm-is-the-best-judge-judgebench-activity-7259228608693420037-xcmJ https://twitter.com/MaziyarPanahi/status/1862554231967801420 https://www.philschmid.de/evaluate-llm-mixeval https://huggingface.co/blog/arena-atla https://www.philschmid.de/llm-evaluation https://www.linkedin.com/posts/a-roucher_%3F%3F%3F-%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F-%3F%3F%3F%3F%3F-activity-7265393341729574912-g35D https://medium.com/@bnjmn_marie/judge-arena-a-new-leaderboard-for-llms-as-evaluators-20610962af17