The LLM-as-a-Judge (LLMaJ) paradigm leverages large language models to evaluate responses—whether human or model generated—offering a scalable alternative to manual annotation. As their use expands, it becomes increasingly important to rigorously assess their reliability on complex, real-world tasks.
Debate evaluation presents a uniquely demanding challenge, requiring nuanced judgments of logic, persuasiveness, coherence, and tone in long-form argumentative texts. In this work, we introduce a new benchmark task that tests the limits of LLM judges: evaluating speeches that argue for or against controversial topics.
We ask: How well do LLMs perform in this setting—and how do their evaluations compare to those of human judges?
To support this benchmark, we build on a dataset of over 600 debate speeches, each annotated by multiple human raters, providing a rich foundation for evaluation.
Finally, we explore how speeches generated by LLMs compare in quality to those written by humans.