Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

1The Hebrew University of Jerusalem, 2IBM Research, 3Allen Institute for AI
2025

Abstract

We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

Challenging LLM-as-a-Judge Systems

The LLM-as-a-Judge (LLMaJ) paradigm leverages large language models to evaluate responses—whether human or model generated—offering a scalable alternative to manual annotation. As their use expands, it becomes increasingly important to rigorously assess their reliability on complex, real-world tasks.

Debate evaluation presents a uniquely demanding challenge, requiring nuanced judgments of logic, persuasiveness, coherence, and tone in long-form argumentative texts. In this work, we introduce a new benchmark task that tests the limits of LLM judges: evaluating speeches that argue for or against controversial topics.

We ask: How well do LLMs perform in this setting—and how do their evaluations compare to those of human judges?

To support this benchmark, we build on a dataset of over 600 debate speeches, each annotated by multiple human raters, providing a rich foundation for evaluation.

Finally, we explore how speeches generated by LLMs compare in quality to those written by humans.

Diagram illustrating the evaluation framework for LLM judges on debate tasks.

How Do LLM Judges Perform on Debate Speech Evaluation?

Human vs. LLM Judges

Agreement between human and LLM judges: Kappa and Kendall’s Tau metrics

We assess the alignment between LLMs and human judges using two metrics: inter-rater agreement via Cohen’s Kappa (A) and rank correlation with average human scores via Kendall’s Tau (B). Both reveal a clear trend: performance improves with model scale, with a notable jump at 7B parameters. Notably, some models—such as Qwen-72B—achieve or even surpass human-level agreement.

Enhancing LLM Judges with Better Prompting

Effect of prompting methods on LLM-human agreement

Prompting strategies significantly influence judgment quality. We find that chain-of-thought (CoT) prompting may improve agreement with human annotators — particularly for stronger models.

How Do LLM Judgments Diverge from Those of Humans?

Score Distribution Across Speeches

We find that stronger LLMs (with more than 7B parameters) tend to assign lower absolute scores to speeches compared to humans. This is surprising given our earlier results, which showed that these models closely match human rankings at the instance level.

These findings suggest that while LLMs may rank speeches similarly to humans, their score distributions are not well calibrated—highlighting a key divergence in how scores are assigned across the full range.

Distribution of LLM vs. human scores across speeches

Speech Source Analysis

LLM and human score comparison across speech sources
Ratings of speeches from different sources, with human scores shown in the rightmost panels. Gray text indicates the Pearson correlation between mean LLM and human scores across sources. Figure (A) shows results for models from the Qwen family; Figure (B) compares the strongest models from each family.

As LLMs become more capable, their source-level rankings align more closely with human judgments. However, stronger models still tend to assign lower absolute scores to synthetic (model-generated) speeches.

How do speeches by SOTA LLMs compare to human speeches?

We find that strong judges rate speeches generated by GPT-4.1 higher than those by human-expert debaters.

These results demonstrate the rapid progress of models in recent years. More importantly, the ability of these models to argue controversial topics more effectively than human experts raises concerns about their safe deployment and potential misuse.

Descriptive text for the figure

Analyzing Judges' Reasoning

Key point distribution in Llama-3.3-70B's explanations
Distribution of the top three pro and con key points in Llama-3.3-70B’s chain-of-thought explanations, grouped by speech source. Other (pro) and Other (con) refer to less frequent key points.

To better understand how LLM judges justify their ratings, we apply Key Point Analysis to Llama-3.3-70B’s chain-of-thought explanations—chosen as one of our top-performing judge models.

We find that the distribution of positive key points aligns with overall source-level ratings. Some key points reflect broad, holistic qualities such as persuasiveness, coherence, and relevance (e.g., "Overall, the speech is a coherent opening"), while others focus on specific argument-level content (e.g., "The speech provides strong arguments").

BibTeX

@misc{sternlicht2025debatableintelligencebenchmarkingllm,
      title={Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation},
      author={Noy Sternlicht and Ariel Gera and Roy Bar-Haim and Tom Hope and Noam Slonim},
      year={2025},
      eprint={2506.05062},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.05062},
}