Benchmark

Anti-cheating has long been a critical consideration when designing the rules for leaderboards, but this remains unexplored in the context of LLM benchmarks Citation: [1]Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates X. Zheng, T. Pang, C. Du, Q. Liu, J. Jiang, M. Lin, (2024) Link . Introduction There are many well-known LLM benchmarks, such as AlpacaEval 2.0 Citation: [2]Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators Y. Dubois, B. Galambosi, P. Liang, T. Hashimoto, (2024) DOI , Arena-Hard-Auto Citation: [3]From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline T. Li, W. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. Gonzalez, I. Stoica, (2024) DOI ; Citation: [4]Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. Gonzalez, I. Stoica, (2023) DOI , and MTBench Citation: [4]Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. Gonzalez, I. Stoica, (2023) DOI . They are widely used in the research community to evaluate the performance of LLMs. ...