Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
The paper "**Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks**" introduces an innovative approach to evaluating and improving code-related LLM capabilities. The research presents a *systematic framework* for automatically generating high-quality programming benchmarks and developing reliable judgment methods. The authors demonstrate how **controlled benchmark generation** can create diverse, realistic coding challenges that test specific programming concepts and edge cases. The study employs a novel **multi-model consensus approach** for evaluating code solutions, where multiple LLMs collaborate to provide more reliable judgments than single-model evaluations. The framework includes *automated test case generation*, *solution verification*, and *difficulty calibration* components. Results show significant improvements in benchmark quality and evaluation reliability compared to traditional methods. This work addresses critical challenges in assessing LLM coding abilities and provides valuable tools for developing more robust code-generation models.