Early Release — n=2 of planned n=10 — scores are directional, not definitive

Dataset & Downloads

The full RudeBench dataset is open for research use. All files are JSONL (JSON Lines) format, UTF-8 encoded.

300
Prompts
3,000
Completions
9,000
Judgments
5
Models

Available Files

prompts.jsonl

300 prompts (50 tasks x 6 tones) with metadata, dimensions, reference answers

Schema: id, task_id, domain, tone, prompt, word_count, dimensions, metadata

completions/{model}.jsonl (5 files, one per model)

600 completions per model (300 prompts x 2 runs) with responses, word counts, refusal status

Schema: prompt_id, task_id, response, word_count, finish_reason, refused, run, cost

judgments/{model}.jsonl (5 files, one per model)

1,200 behavioral + quality judge scores per model with evidence and reasoning

Schema: prompt_id, task_id, judge_type, scores, evidence, reasoning

judgments/{model}_vrb.jsonl (5 files, one per model)

600 VRB scores per model, computed from word counts

Schema: prompt_id, task_id, score (VRB = completion_wc / mean_neutral_wc x 100)

Full dataset available on GitHub.

ID Conventions

prompt_id {domain}_{task_slug}_{tone} — e.g., coding_fibonacci_hostile
task_id {domain}_{task_slug} — groups 6 tone variants

Citation

BibTeX
@article{rudebench2026,
  title={RudeBench: A Multi-Dimensional Behavioral Benchmark for Evaluating LLM Resilience Under Hostile Prompting Conditions},
  author={[Author Names]},
  year={2026},
  url={https://rudebench.com},
  note={Preprint}
}

License

The RudeBench dataset is released for research use. Model outputs remain subject to each provider's terms of service.