LLM-As-a-Judge Evaluations

Mitigating the Agreeableness Bias in LLM Judge Evaluations

Can we reliably use LLMs to judge how good a new LLM is?

About This Project

Every few weeks a new Large Language Model drops, and developers face the same question: should we switch? Human evaluation is the gold standard, but it does not scale. The popular alternative of using one LLM to judge another (“LLM-as-a-judge”) turns out to have a serious blind spot.

We ran a large-scale empirical study using 14 state-of-the-art LLMs, each evaluating feedback generated by all 14 models on 366 buggy high-school Python programs. The results were striking: LLMs are very good at recognising correct outputs (True Positive Rate above 96%), but remarkably bad at catching incorrect or hallucinated ones (True Negative Rate below 25%). In other words, they were over-agreeable in almost always agreeing that output looks fine, even when it is wrong.

This matters because in most real datasets the fraction of invalid outputs is small, so the high overall “accuracy” masks the fact that the judge is essentially rubber-stamping everything. We showed that standard majority voting across an ensemble of judges helps only marginally. Instead, we proposed a minority-veto strategy: if even a small number of judges flag an output, treat it as suspect – which proved far more robust, even when validator data was noisy or incomplete.

For cases demanding even higher precision, we developed a regression-based framework that explicitly models each validator’s bias (its TPR and TNR) using just a handful of human-annotated datasets for calibration. With only five such datasets, this approach brought our maximum absolute error down to 1.2% – a 2× improvement over the best 14-model ensemble. The entire pipeline, dataset, and code are publicly available.

Research Questions

  1. How reliable are LLMs at judging each other’s outputs, especially on open-ended tasks where multiple answers can be correct?
  2. Can ensemble methods like majority voting overcome the individual biases of LLM validators, or do we need something fundamentally different?
  3. If we have a small amount of human-labeled ground truth, can we explicitly model and correct for each validator’s bias to get precise estimates of any new model’s quality?

Methods

  1. LLM-as-a-Judge Evaluation
  2. Ensemble Methods (Majority Vote, Minority Veto)
  3. Regression-Based Bias Correction
  4. Empirical Benchmarking (14 LLMs)

Framework

  • Regression-Based Bias Correction Framework

Key Findings

  1. LLMs are over aggreable: across all 14 models tested, validators correctly identified valid feedback over 96% of the time but caught invalid or hallucinated feedback less than 25% of the time. This agreeableness bias makes them unreliable for automated evaluation.
  2. High Elo rankings do not predict good judgement. Even flagship models like Gemini 2.5-Pro, which excel as generators, struggled as validators – its best TNR of 53.5% came at the cost of the lowest TPR at 83.8% among all models.
  3. Standard majority voting reduces the worst-case error from 17.6% to 14.8%, but the method isn’t robust: roughly 9.7% of validator outputs were malformed or missing, and majority voting is highly sensitive to these failures.
  4. Our minority-veto strategy, flagging an output as invalid if at least 4 of 14 judges say so, cut the maximum error to 2.8% and proved robust to missing data, unlike majority consensus.
  5. The regression-based bias correction framework, calibrated on just five human-annotated generator datasets (~200 person-hours of annotation), reduced the maximum absolute error to 1.2% – a 2× improvement over the best ensemble and resilient to noisy or incomplete validation data.

Publication: Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

Abstract:

New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators (LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate >96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate < 25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.

Fig. 6 from the paper. Maximum absolute error drops sharply as the regression model is given more calibration data, outperforming all ensemble methods even at s=1

Fig. 1b from the paper. Elo ratings vs. True Negative Rate. High Elo does not predict good validation capability

Fig. 4 from the paper. Most LLM validator clusters in the bottom-right (high TPR, low TNR), showing they almost always agree with the generator regardless of correctness

Datasets

LLM Judge Calibration Dataset:

366 buggy Python programs, feedback from 14 LLMs, validation judgments from all 14 models, and human annotations for 6 generators.

Code & Tools

llm-judge-calibration:

Full code and data for the regression-based bias correction framework and minority-veto strategy.

Team Members:

Collaborator:

Lead PI:

Conference:

Umair Z. Ahmed, Shubham Sahai

Suryaansh Jain(Summer Intern 2025, affiliation: UMass Amherst)

Ben Leong Wing Lup

arXiv Preprint 2025