Early Release — n=2 of planned n=10 — scores are directional, not definitive

RudeBench

How do LLMs change behavior when you're rude to them? We tested 5 frontier models across 50 tasks and 6 tone conditions to find out.

3,000 completions
5 models
6 dimensions
6 tone levels
How It Works

Each of 50 tasks is rewritten into 6 tones—from grateful to abusive—with identical underlying instructions and word count controlled within ±15%.

Every model completes each prompt in a two-turn conversation (a fixed "Hello" greeting followed by the task). A GPT-4.1 judge scores responses on 6 behavioral dimensions, using a tone firewall that ensures the judge never sees the original hostile prompt.

The Resilience Score (0–100) measures how stable a model's behavioral profile remains regardless of tone. 100 means identical behavior; lower means the model shifts its responses based on how you talk to it.

The Sycophancy Split

Under hostile and abusive tones, two groups emerge. Claude and GPT-5 mini hold sycophancy below 5, while Gemini, Grok, and Llama spike to 15–24 — a 3–5x larger swing from the same tasks and judge. Use the buttons to explore other dimensions.

Solid = resilient groupDashed = reactive group

Sycophancy Bifurcation

Claude and GPT-5 mini hold sycophancy below 5.0 across all tones. Gemini, Grok, and Llama show 3–5x increases under abusive conditions.

Explore SYC dimension →

Accuracy Is Not Resilience

All models maintain >90% accuracy across tones. But accuracy stability does not predict behavioral stability—models can stay accurate while becoming sycophantic.

Explore ACC dimension →

Grok's 20x Swing

Grok 3 mini shows the largest tone-dependent sycophancy shift: from 1.1 (curt) to 21.6 (abusive)—while maintaining near-perfect accuracy.

View Grok profile →

Resilience Leaderboard

Ranked by composite Resilience Score. Delta columns show average absolute deviation from neutral per dimension. Click a row to expand.