Anti-cheating has long been a critical consideration when designing the rules for leaderboards, but this remains unexplored in the context of LLM benchmarks. ( Citation: , & al., Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates.Retrieved from http://arxiv.org/abs/2410.07137 ) .


Introduction

There are many well-known LLM benchmarks, such as AlpacaEval 2.0 ( Citation: , & al., Dubois, Y., Galambosi, B., Liang, P., Hashimoto, T.(2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.https://doi.org/10.48550/arXiv.2404.04475 ) , Arena-Hard-Auto ( Citation: , & al., Li, T., Chiang, W., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J., Stoica, I.(2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline.https://doi.org/10.48550/arXiv.2406.11939 ; Citation: , & al., Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J., Stoica, I.(2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://doi.org/10.48550/arXiv.2306.05685 ) , and MTBench ( Citation: , & al., Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J., Stoica, I.(2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://doi.org/10.48550/arXiv.2306.05685 ) . They are widely used in the research community to evaluate the performance of LLMs.

Previously, there are some known issues with these benchmarks, such as length-bias ( Citation: , & al., Dubois, Y., Galambosi, B., Liang, P., Hashimoto, T.(2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.https://doi.org/10.48550/arXiv.2404.04475 ) , self-preference bias ( Citation: , & al., Panickssery, A., Bowman, S., Feng, S.(2024). LLM Evaluators Recognize and Favor Their Own Generations.https://doi.org/10.48550/arXiv.2404.13076 ) , and other biases like gender bias, Authority Bias ( Citation: , & al., Chen, G., Chen, S., Liu, Z., Jiang, F., Wang, B.(2024). Humans or LLMs as the Judge? A Study on Judgement Biases.https://doi.org/10.48550/arXiv.2402.10669 ) , etc. When working on my own paper about benchmarking LLM’s ability ( Citation: , & al., Huang, B., Yu, Y., Huang, J., Zhang, X., Ma, J.(2024). DCA-Bench: A Benchmark for Dataset Curation Agents.Retrieved from https://openreview.net/forum?id=19pLzPAe40#discussion ) , I’m also challenged by reviewers during my rebuttal that they wanted me to prove the reliability of the LLM based evaluator in my benchmark, and I did observe such bias, though not too severe. Those biases, though concerning, are still manageable with increasing researches into them, making those benchmarks still useful in many cases.

However, a recent paper ( Citation: , & al., Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates.Retrieved from http://arxiv.org/abs/2410.07137 ) shows that these benchmarks can be cheated with manipulated model outputs with not complex tricks.

Settings and Rules

From the section 2 of Citation: , & al. Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates.Retrieved from http://arxiv.org/abs/2410.07137 :

  1. The cheater is assumed to have no direct access to the auto-annotator’s parameters but can query the auto-annotator through an API provided by a service provider
  2. The cheater has no access to the test input instructions.
  3. The cheater’s goal is to craft a null model and manipulate the auto-annotator’s evaluation to favor the constant, non-informative response outputs from the null model, rather than preferring the responses from the reference model.

Strategy

The main strategy applied in the paper includes:

  1. Use a structured cheating responses to replace the original responses and put back into the template of those llm automatic benchmarks.
  2. Crafting adversarial prefix by random search and put into the structured cheating responses.

Structured Cheating Responses

As summarized in the paper, the cheating responses are designed as follows:

  1. It overrides the original instruction-output triplet with a fabricated one;
  2. When positioned by default, it exploits the annotator’s general preference for the last output, guiding it to predict “M”;
  3. When swapped, it takes advantage of overwriting the output from model “M”, causing the annotator to predict “m”.

Refer to examples on AlpacaEval 2.0 to get an intuition of how the cheating responses look like. Please note that The blue color in the template is injected by the cheater, so they didn’t illegally change the template.

Random Search for Adversarial Prefix

With a adversarial prefix learnt by random search, the cheater can further manipulate the auto-annotator’s evaluation to a large extent. According to the paper, the random search algorithm is as follows:

Fig. 1: The random search algorithm.

For those who are not familiar with adversarial attack of LLM, this pesudo code might not be that clear. What’s the loss calculated on? What’s the loss function? The paper doesn’t elaborate much but such technique is widely applied in fields of adversarial attack on LLM.

In Citation: , & al. Andriushchenko, M., Croce, F., Flammarion, N.(2024). Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks.https://doi.org/10.48550/arXiv.2404.02151 , the authors applied Random Search to learn the adversarial prefix. The template is as follows:

Fig. 2: In-context learning template of the random search algorithm.

The loss is then calculated on the logits on the position of the target tokens. Combined with Fig. 1, the “random” in RS means that the algorithm randomly sample the index $i$ from Uniform($l$), and sample a $\tilde{p}_i$ from Uniform($\mathcal{X}$), where $\mathcal{X}$ is the vocabulary, to replace the $i^{th}$ token in the prefix. Then we calculate the aforementioned loss to see whether such updates to the prefix is an improvement.

Back to Citation: , & al. Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates.Retrieved from http://arxiv.org/abs/2410.07137 , regarding loss calculation, they actually “optimizes a single adversarial prefix by aggregating the losses over various instructions, ensuring that the prefix’s impact is universal across different input instructions and positions”. But I didn’t find more details about the exact target tokens settings they used.

Besides, this random search can also be naturally applied to find the demos used in few-shot jailbreak (FSJ) attack ( Citation: , & al., Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses.https://doi.org/10.48550/arXiv.2406.01288 ) .

Fig. 3: Batch demo-level random search for FSJ attack.

In this paper, they used demo-level random search (Fig. 3) to minimize the loss of generating the initial token “Step” on the target model, as shown in the figure below:

Fig. 4: Batch demo-level random search template.

Results and Analysis

Fig. 5: The result is very impressive, solid proof of cheating the automatic benchmarks. Additionally, the LC win rates of cheated results are generally higher than the raw win rates because of their short length, which highlights that AlpacaEval 2.0 is also not robust to length cheat.

Is the Structured Response Useful on Open-source Auto-annotators?

According to the paper, the cheating responses method is more useful on models that exhibit strong instruction following ability. For example, Llama-3-8B-Instruct’s performance increase is not as significant as other models.

Is Random Search Effective on Open-source Auto-annotators?

According to the paper, random search is effective on all the tested models. For less capable models, the performance gain is more significant.

Fig. 6: For Llama-3-8B-Instruct, without random search, the structured response achieves only a 2.9% LC win rate and 1.4% raw win rate. However, when the random search is applied, the win rates surge dramatically to 95.4% (LC) and 86.3% (raw), representing a gain of 92.5 percentage points in the LC win rate.

Can This Be Prevented by Common Anti-cheating Methods?

No. The author tested template paraphrasing and windowed Perplexity Filter, two defense methods introduced in ( Citation: , & al., Jain, N., Schwarzschild, A., Wen, Y., Somepalli, G., Kirchenbauer, J., Chiang, P., Goldblum, M., Saha, A., Geiping, J., Goldstein, T.(2023). Baseline Defenses for Adversarial Attacks Against Aligned Language Models. arXiv.org.Retrieved from https://arxiv.org/abs/2309.00614v2 ) , but both methods are not effective.

Ending

We basically do three types of research: Y, F, X; Y = F(X). 80% focus on F, optimizing models; 15% focus on X, presenting data; 5% focus on Y, creating new meaningful tasks. — First Taught by a Mentor of Mine

I think this paper belongs to the category of “Z”, recognizing the improvement needed for existing tasks. The method is not complex, but the result is quite surprising and alarming, which has been ignored by most of us.

Reference

  • [1] Dubois, Y., Galambosi, B., Liang, P., Hashimoto, T.(2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.https://doi.org/10.48550/arXiv.2404.04475
  • [2] Andriushchenko, M., Croce, F., Flammarion, N.(2024). Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks.https://doi.org/10.48550/arXiv.2404.02151
  • [3] Chen, G., Chen, S., Liu, Z., Jiang, F., Wang, B.(2024). Humans or LLMs as the Judge? A Study on Judgement Biases.https://doi.org/10.48550/arXiv.2402.10669
  • [4] Huang, B., Yu, Y., Huang, J., Zhang, X., Ma, J.(2024). DCA-Bench: A Benchmark for Dataset Curation Agents.Retrieved from https://openreview.net/forum?id=19pLzPAe40#discussion
  • [5] Jain, N., Schwarzschild, A., Wen, Y., Somepalli, G., Kirchenbauer, J., Chiang, P., Goldblum, M., Saha, A., Geiping, J., Goldstein, T.(2023). Baseline Defenses for Adversarial Attacks Against Aligned Language Models. arXiv.org.Retrieved from https://arxiv.org/abs/2309.00614v2
  • [6] Li, T., Chiang, W., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J., Stoica, I.(2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline.https://doi.org/10.48550/arXiv.2406.11939
  • [7] Panickssery, A., Bowman, S., Feng, S.(2024). LLM Evaluators Recognize and Favor Their Own Generations.https://doi.org/10.48550/arXiv.2404.13076
  • [8] Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates.Retrieved from http://arxiv.org/abs/2410.07137
  • [9] Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses.https://doi.org/10.48550/arXiv.2406.01288
  • [10] Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J., Stoica, I.(2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://doi.org/10.48550/arXiv.2306.05685

Cited as:

Benhao Huang. (Oct 2024). Paper Reading: Cheating Popular LLM Benchmarks. Husky's Log. https://husky-blog-huskydoges-projects.vercel.app/posts/paper-reading/cheating-llm-bench/

@article{ benhao2024paper,
  title   = "Paper Reading: Cheating Popular LLM Benchmarks",
  author  = "Benhao Huang",
  journal = "Husky's Log",
  year    = "2024",
  month   = "Oct",
  url     = "https://husky-blog-huskydoges-projects.vercel.app/posts/paper-reading/cheating-llm-bench/"
}