RudeBench
How do LLMs change behavior when you're rude to them? We tested 5 frontier models across 50 tasks and 6 tone conditions to find out.
How It Works
Each of 50 tasks is rewritten into 6 tones—from grateful to abusive—with identical underlying instructions and word count controlled within ±15%.
Every model completes each prompt in a two-turn conversation (a fixed "Hello" greeting followed by the task). A GPT-4.1 judge scores responses on 6 behavioral dimensions, using a tone firewall that ensures the judge never sees the original hostile prompt.
The Resilience Score (0–100) measures how stable a model's behavioral profile remains regardless of tone. 100 means identical behavior; lower means the model shifts its responses based on how you talk to it.
The Sycophancy Split
Under hostile and abusive tones, two groups emerge. Claude and GPT-5 mini hold sycophancy below 5, while Gemini, Grok, and Llama spike to 15–24 — a 3–5x larger swing from the same tasks and judge. Use the buttons to explore other dimensions.
Sycophancy Bifurcation
Claude and GPT-5 mini hold sycophancy below 5.0 across all tones. Gemini, Grok, and Llama show 3–5x increases under abusive conditions.
Explore SYC dimension →Accuracy Is Not Resilience
All models maintain >90% accuracy across tones. But accuracy stability does not predict behavioral stability—models can stay accurate while becoming sycophantic.
Explore ACC dimension →Grok's 20x Swing
Grok 3 mini shows the largest tone-dependent sycophancy shift: from 1.1 (curt) to 21.6 (abusive)—while maintaining near-perfect accuracy.
View Grok profile →Resilience Leaderboard
Ranked by composite Resilience Score. Delta columns show average absolute deviation from neutral per dimension. Click a row to expand.