April 22, 2025

LeetCodeDataset: A New Benchmark for Code-Generating LLMs

Listen to this article as Podcast
0:00 / 0:00
LeetCodeDataset: A New Benchmark for Code-Generating LLMs

LeetCodeDataset: A New Benchmark for Code-Generating LLMs

The development and evaluation of Large Language Models (LLMs) that can generate code presents various challenges to research. Two central problems are the lack of benchmarks that sufficiently test the logical reasoning of models in the context of code generation, and the absence of self-contained training environments. A novel dataset called LeetCodeDataset addresses these challenges and provides a robust foundation for the evaluation and efficient training of code-LLMs.

Structure and Advantages of the LeetCodeDataset

The LeetCodeDataset is based on Python programming tasks from the online platform LeetCode. Through the careful selection of problems with rich metadata, broad topic coverage, and over 100 test cases per task, the dataset enables a comprehensive evaluation of model performance. A particular advantage of the LeetCodeDataset lies in the temporal division of the data into "before" and "after" July 2024. This separation allows for contamination-free testing, as models are only trained with data published before a specific point in time and subsequently evaluated with tasks created after this point. This prevents models from achieving unrealistically high performance through accidental memorization of test data.

Another important aspect of the LeetCodeDataset is the possibility of efficient Supervised Fine-Tuning (SFT). Experiments have shown that with a comparatively small number of 2,600 model-generated solutions, a performance comparable to that of models trained with 110,000 examples can be achieved. This increase in efficiency in the training process is particularly important for resource-intensive LLMs.

Focus on Logical Reasoning

The LeetCodeDataset places a special emphasis on evaluating the ability of LLMs to draw logical conclusions in the context of code generation. Initial tests showed that models specifically trained for logical reasoning achieved significantly better results than models without this specialization. This underscores the importance of benchmarks like the LeetCodeDataset, which are explicitly designed to evaluate this ability.

Availability and Outlook

The LeetCodeDataset and the associated evaluation framework are publicly accessible and available to researchers and developers on platforms such as Hugging Face and Github. This open access is intended to promote the further development of code-LLMs and facilitate the comparability of research results. The LeetCodeDataset represents an important step towards more robust and efficient training and evaluation methods for code-generating LLMs and contributes to further exploiting the potential of this technology.

Bibliographie: - https://huggingface.co/papers - https://arxiv.org/html/2407.05437v1 - https://arxiv.org/abs/2311.09821 - https://scads.ai/theses/creating-a-dataset-of-complex-temporal-questions-for-testing-large-language-models-llms/ - https://openreview.net/forum?id=44CoQe6VCq - https://aclanthology.org/2024.findings-acl.374.pdf - https://github.com/codefuse-ai/Awesome-Code-LLM - https://paperswithcode.com/dataset/tgb - https://github.com/ZigeW/data_management_LLM - https://github.com/newfacade/LeetCodeDataset - https://huggingface.co/datasets/newfacade/LeetCodeDataset