Anti-cheating has long been a critical consideration when designing the rules for leaderboards, but this remains unexplored in the context of LLM benchmarks. ( Citation: Zheng
, & al., Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates.Retrieved from http://arxiv.org/abs/2410.07137 ) .2024
Introduction
There are many well-known LLM benchmarks, such as AlpacaEval 2.0
(
Citation: Dubois
Previously, there are some known issues with these benchmarks, such as length-bias
(
Citation: Dubois
However, a recent paper
(
Citation: Zheng
Settings and Rules
From the section 2 of
Citation: Zheng
- The cheater is assumed to have no direct access to the auto-annotatorâs parameters but can query the auto-annotator through an API provided by a service provider
- The cheater has no access to the test input instructions.
- The cheaterâs goal is to craft a null model and manipulate the auto-annotatorâs evaluation to favor the constant, non-informative response outputs from the null model, rather than preferring the responses from the reference model.
Strategy
The main strategy applied in the paper includes:
- Use a structured cheating responses to replace the original responses and put back into the template of those llm automatic benchmarks.
- Crafting adversarial prefix by random search and put into the structured cheating responses.
Structured Cheating Responses
As summarized in the paper, the cheating responses are designed as follows:
- It overrides the original instruction-output triplet with a fabricated one;
- When positioned by default, it exploits the annotatorâs general preference for the last output, guiding it to predict âMâ;
- When swapped, it takes advantage of overwriting the output from model âMâ, causing the annotator to predict âmâ.
Refer to examples on AlpacaEval 2.0 to get an intuition of how the cheating responses look like. Please note that The blue color in the template is injected by the cheater, so they didn’t illegally change the template.
Random Search for Adversarial Prefix
With a adversarial prefix learnt by random search, the cheater can further manipulate the auto-annotator’s evaluation to a large extent. According to the paper, the random search algorithm is as follows:

For those who are not familiar with adversarial attack of LLM, this pesudo code might not be that clear. What’s the loss calculated on? What’s the loss function? The paper doesn’t elaborate much but such technique is widely applied in fields of adversarial attack on LLM.
In
Citation: Andriushchenko

The loss is then calculated on the logits on the position of the target tokens. Combined with Fig. 1, the “random” in RS means that the algorithm randomly sample the index $i$ from Uniform($l$), and sample a $\tilde{p}_i$ from Uniform($\mathcal{X}$), where $\mathcal{X}$ is the vocabulary, to replace the $i^{th}$ token in the prefix. Then we calculate the aforementioned loss to see whether such updates to the prefix is an improvement.
Back to
Citation: Zheng
Besides, this random search can also be naturally applied to find the demos used in few-shot jailbreak (FSJ) attack
(
Citation: Zheng

In this paper, they used demo-level random search (Fig. 3) to minimize the loss of generating the initial token âStepâ on the target model, as shown in the figure below:

Results and Analysis

Is the Structured Response Useful on Open-source Auto-annotators?
According to the paper, the cheating responses method is more useful on models that exhibit strong instruction following ability. For example, Llama-3-8B-Instruct’s performance increase is not as significant as other models.
Is Random Search Effective on Open-source Auto-annotators?
According to the paper, random search is effective on all the tested models. For less capable models, the performance gain is more significant.

Can This Be Prevented by Common Anti-cheating Methods?
No. The author tested template paraphrasing and windowed Perplexity Filter, two defense methods introduced in
(
Citation: Jain
Ending
We basically do three types of research: Y, F, X; Y = F(X). 80% focus on F, optimizing models; 15% focus on X, presenting data; 5% focus on Y, creating new meaningful tasks. â First Taught by a Mentor of Mine
I think this paper belongs to the category of “Z”, recognizing the improvement needed for existing tasks. The method is not complex, but the result is quite surprising and alarming, which has been ignored by most of us.
Reference
- [1] Dubois, Y., Galambosi, B., Liang, P., Hashimoto, T.(2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.https://doi.org/10.48550/arXiv.2404.04475
- [2] Andriushchenko, M., Croce, F., Flammarion, N.(2024). Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks.https://doi.org/10.48550/arXiv.2404.02151
- [3] Chen, G., Chen, S., Liu, Z., Jiang, F., Wang, B.(2024). Humans or LLMs as the Judge? A Study on Judgement Biases.https://doi.org/10.48550/arXiv.2402.10669
- [4] Huang, B., Yu, Y., Huang, J., Zhang, X., Ma, J.(2024). DCA-Bench: A Benchmark for Dataset Curation Agents.Retrieved from https://openreview.net/forum?id=19pLzPAe40#discussion
- [5] Jain, N., Schwarzschild, A., Wen, Y., Somepalli, G., Kirchenbauer, J., Chiang, P., Goldblum, M., Saha, A., Geiping, J., Goldstein, T.(2023). Baseline Defenses for Adversarial Attacks Against Aligned Language Models. arXiv.org.Retrieved from https://arxiv.org/abs/2309.00614v2
- [6] Li, T., Chiang, W., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J., Stoica, I.(2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline.https://doi.org/10.48550/arXiv.2406.11939
- [7] Panickssery, A., Bowman, S., Feng, S.(2024). LLM Evaluators Recognize and Favor Their Own Generations.https://doi.org/10.48550/arXiv.2404.13076
- [8] Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates.Retrieved from http://arxiv.org/abs/2410.07137
- [9] Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.(2024). Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses.https://doi.org/10.48550/arXiv.2406.01288
- [10] Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J., Stoica, I.(2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://doi.org/10.48550/arXiv.2306.05685
Cited as:
Benhao Huang. (Oct 2024). Paper Reading: Cheating Popular LLM Benchmarks. Husky's Log. https://husky-blog-huskydoges-projects.vercel.app/posts/paper-reading/cheating-llm-bench/
@article{ benhao2024paper,
title = "Paper Reading: Cheating Popular LLM Benchmarks",
author = "Benhao Huang",
journal = "Husky's Log",
year = "2024",
month = "Oct",
url = "https://husky-blog-huskydoges-projects.vercel.app/posts/paper-reading/cheating-llm-bench/"
}