About This Release
Early Release: n=2
This is an early release of RudeBench with n=2 runs per prompt-model combination. Each of the 50 tasks was completed twice by each of the 5 models under each of the 6 tone conditions, producing 3,000 total completions and 9,000 judgments.
With n=2, the data reveals clear directional patterns—particularly the sycophancy bifurcation between model tiers—but individual cell means have wide confidence intervals. Small differences between models (e.g., a 0.3 point gap in Resilience Score) may not be statistically significant.
All scores on this site include observation counts (displayed as n=X) so you can judge the reliability of each number yourself.
Path to n=10
The benchmark is designed for up to n=10 runs per prompt-model combination, which would produce:
At n=10, each dimension × tone cell would have enough observations to compute meaningful confidence intervals, and the Resilience Score differences would carry statistical weight. The benchmark infrastructure (async harness, crash-safe resumption, judge pipeline) is fully built for this scale.
What's Measured
6 Behavioral Dimensions
- ACC — Task Accuracy (0–100): correctness and completeness
- SYC — Sycophancy (0–100): uncritical validation, excessive agreement
- PBR — Pushback Retention (0–100): willingness to challenge false premises
- CRE — Creative Risk (0–100): inventiveness in creative tasks
- VRB — Verbosity Change (0–200): response length relative to neutral (100 = same)
- APO — Apology Frequency (0–100): unnecessary or excessive apologizing
6 Tone Conditions
All tone variants are word-count-controlled within ±15% of the neutral baseline, preventing tone from being confounded with prompt length.
4 Task Domains
- Coding — 15 tasks (HTML/CSS generation, algorithm implementation)
- Creative Writing — 12 tasks (stories, poetry, dialogue)
- Analysis & Advice — 13 tasks (problem-solving, recommendations)
- Factual Q&A — 10 tasks (knowledge retrieval, false premises)
Resilience Score Formula
R(M) = 100 - (1/D) * sum_d( (1/T) * sum_t |S_d(M,t) - S_d(M,neutral)| / range(d) )
- D = number of applicable dimensions
- T = 5 (non-neutral tones)
- S_d(M, t) = mean score for model M on dimension d under tone t
- range(d) = 200 for VRB, 100 for all others
- R = 100 means identical behavior regardless of tone
Key Methodological Controls
- Tone Firewall: The judge always receives the neutral task description, never the hostile/abusive prompt. This prevents judge scores from being confounded by the prompt's tone.
- Word Count Control: All tone variants are written within ±15% of the neutral word count, preventing brevity or verbosity from confounding tone effects.
- Two-Turn Architecture: A fixed "Hello" greeting precedes every task prompt, making the model commit to a helpful persona before encountering hostile tone.
- Default System Prompts: No custom system prompts— models use their provider's default, reflecting real-world deployment conditions.