Paper Reading: Cheating Popular LLM Benchmarks

Anti-cheating has long been a critical consideration when designing the rules for leaderboards, but this remains unexplored in the context of LLM benchmarks. ( Citation: Zheng , & al., 2024 Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates.Retrieved from http://arxiv.org/abs/2410.07137 ) . Introduction There are many well-known LLM benchmarks, such as AlpacaEval 2.0 ( Citation: Dubois , & al., 2024 Dubois, Y., Galambosi, B., Liang, P., Hashimoto, T.(2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.https://doi.org/10.48550/arXiv.2404.04475 ) , Arena-Hard-Auto ( Citation: Li , & al., 2024 Li, T., Chiang, W., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J., Stoica, I.(2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline.https://doi.org/10.48550/arXiv.2406.11939 ; Citation: Zheng , & al., 2023 Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J., Stoica, I.(2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://doi.org/10.48550/arXiv.2306.05685 ) , and MTBench ( Citation: Zheng , & al., 2023 Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J., Stoica, I.(2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://doi.org/10.48550/arXiv.2306.05685 ) . They are widely used in the research community to evaluate the performance of LLMs. ...

October 11, 2024 · 8 min · 1663 words · Benhao Huang