Skip to main content

LLM-as-Judge: Evaluation Guide

This guide covers how to design LLM-as-judge evaluations in QuickCompare. You’ll learn how to choose scales, write effective rubrics, and select judge models.

Introduction

What is LLM-as-Judge?

LLM-as-judge is an evaluation method where a powerful language model acts as a judge to assess the quality of other model outputs. Think of it as replacing manual review with a consistent, automated reviewer that applies your rubric at scale.

Why Use It?

Traditional metrics like BLEU or ROUGE don’t fully capture quality for many tasks. LLM-as-judge is particularly useful for:

  • Open-ended generation tasks - creative writing, explanations, summaries
  • Complex reasoning assessment - multi-step problems, nuanced arguments
  • Semantic similarity - when meaning matters more than exact wording
  • Subjective quality - clarity, coherence, usefulness

A key advantage: you don’t always need reference answers. Unlike traditional metrics that require gold standards, LLM-as-judge can evaluate quality directly for creative writing, instruction following, and other open-ended tasks. When you do have references, judges can assess semantic equivalence rather than just word overlap.

When you need to evaluate hundreds or thousands of outputs against nuanced criteria, LLM-as-judge provides consistency that manual review can’t match at scale.

Key Concepts

Prediction

The output generated by the model you’re evaluating. This is what the judge will assess.

Example: Your chatbot responds with “Paris is the capital of France” - this is the prediction being evaluated.

Reference

The correct answer or gold standard you’re comparing against. Reference answers are valuable when you have them - they let the judge assess factual accuracy and semantic equivalence against a known correct answer.

Not all tasks require references, though. For creative writing, instruction following, or general quality assessment, the judge can evaluate directly based on your rubric criteria.

Example: Your dataset says the correct answer is “Paris” - this is the reference the judge uses to evaluate accuracy.

Score and Normalisation

A numerical measure of quality, where higher scores are always better.

All scores are automatically normalised to 0.0-1.0 where 1.0 is perfect.

Examples:

  • Score of 3 on 1-5 scale → 0.6 (partially correct)
  • Score of 4 on 1-5 scale → 0.8 (mostly correct)
  • Score of 5 on 1-5 scale → 1.0 (completely correct)
  • Score of 1 on 0-1 scale → 1.0 (correct, binary scale)

This normalisation lets you compare evaluations across different rubrics and scales.

Scales

Why Scale Choice Matters

The scale you choose shapes how judges think about evaluation. Different scales are suited for different tasks.

Binary (0-1)

When to use:

  • Questions with clear right/wrong answers
  • Classification tasks
  • Exact match evaluation

Pros:

  • Simple and unambiguous
  • Fast to evaluate
  • Easy to interpret

Cons:

  • Lose information about how close wrong answers were
  • No middle ground for “partially correct”

3-Point (1-3)

When to use:

  • Want a middle ground between binary and granular
  • “Wrong / Partially Correct / Correct” distinctions

Pros:

  • More information than binary
  • Still relatively unambiguous
  • Quick evaluation

Cons:

  • Limited granularity
  • Middle category can be ambiguous

Most common choice for LLM judges. This is the recommended scale for most evaluation tasks.

When to use:

  • Need granular accuracy assessment
  • Factual correctness evaluation
  • Quality assessment with nuance

Pros:

  • Granular enough to capture meaningful differences
  • Familiar format (this is a Likert scale, commonly used in surveys and product reviews)
  • Good balance between precision and reliability

Cons:

  • More complex than binary
  • Adjacent scores can blur together without clear descriptors

7-Point or Higher

Generally not recommended.

Why avoid:

  • Models struggle to reliably distinguish between adjacent points
  • A score of 6 vs. 7 on a 10-point scale often comes down to random variation
  • Added complexity doesn’t improve evaluation quality

Rare exceptions:

  • Matching an existing evaluation standard that requires it
  • Very specific use case with well-defined distinctions

Templates

Built-in Templates

QuickCompare provides several optimised templates for common evaluation tasks. These are well-tested starting points.

Template Overview

TemplateScaleUse Case
accuracy1-5Evaluate factual correctness and semantic equivalence
binary0-1Binary correct/incorrect classification
quality1-5Reference-based quality evaluation
no_reference_quality1-5Quality assessment without reference answers
hallucination1-5Detect unsupported claims (requires context)

Accuracy (1-5 Scale)

Evaluates how well a prediction matches the reference answer.

Scale:

  • 1: Completely incorrect or irrelevant
  • 2: Mostly incorrect with minor correct elements
  • 3: Partially correct but missing key information
  • 4: Mostly correct with minor errors or omissions
  • 5: Completely correct and accurate

Use for: Factual Q&A, answer accuracy, semantic equivalence

Binary (0-1 Scale)

Evaluates whether a prediction matches the reference.

Scale:

  • 0: Incorrect (prediction does not match reference)
  • 1: Correct (prediction matches reference)

Use for: Binary classification, exact match evaluation, yes/no questions

Quality (1-5 Scale)

Evaluates overall quality considering accuracy, completeness, relevance, and clarity against a reference.

Scale:

  • 1: Poor (incorrect or incomplete)
  • 2: Fair (partially correct with issues)
  • 3: Good (captures main points with minor gaps)
  • 4: Very Good (accurate and complete)
  • 5: Excellent (matches or exceeds reference)

Use for: Summaries, explanations with reference text

No Reference Quality (1-5 Scale)

Assesses quality without a reference answer, focusing on clarity, coherence, and usefulness.

Scale:

  • 1: Unusable (incomprehensible or off-topic)
  • 2: Poor (major issues in clarity/relevance)
  • 3: Fair (understandable but limited)
  • 4: Good (clear and relevant)
  • 5: Excellent (comprehensive and well-structured)

Use for: Creative writing, open-ended generation, general quality assessment

Hallucination (1-5 Scale)

Detects hallucinations by checking if predictions contain information not supported by the provided context.

Scale:

  • 5: No hallucination (all claims from context)
  • 4: Minimal (minor inferences, no false facts)
  • 3: Moderate (mix of supported/unsupported)
  • 2: Significant (multiple fabrications)
  • 1: Severe (mostly fabricated/contradicts context)

Use for: RAG evaluation, context grounding, factual consistency

Note: Scale is inverted (5 is best) to match intuition that “no hallucination” is ideal.

Custom Templates

When to Use Custom Templates

Create custom templates when:

  • Evaluation depends on metadata - Difficulty level, category, topic from your dataset
  • Need to include context - Original question, prompt, or additional information
  • Specialised content - Code, SQL, maths with specific requirements
  • Adjust grading dynamically - Easier questions graded more strictly

Example: You want to include question difficulty in the evaluation so the judge can be stricter with easy questions.

Template Variables

Use Jinja2 syntax to reference data in your custom templates:

  • {{prediction}} - The model output being evaluated (always available)
  • {{reference}} - The ground truth answer (optional, only include if your dataset has reference answers)
  • {{context.column_name}} - Any column from your input dataset (replace column_name with the actual column name, e.g., {{context.question}})

Example template:

Difficulty: {{context.difficulty}}
Question: {{context.question}}

Student Answer: {{prediction}}
Correct Answer: {{reference}}

Rate from 1-5, being stricter for easy questions:
1 - Completely wrong
2 - Mostly wrong
3 - Partially correct
4 - Mostly correct
5 - Exactly right

Rating:

Response Format Instruction

Critical: Your template must end by instructing the judge to respond with only the score. End your template with “Rating:” or “Score:” to make this explicit.

Without this, the judge may provide explanations instead of a parseable numeric score that QuickCompare can process. All the examples above end with “Rating:” for this reason.

Writing Effective Rubrics

A rubric defines what each score means. This is where evaluation quality is won or lost.

Principle 1: Higher Numbers = Better Quality

Always design your scale so that higher numbers represent better quality.

  • Maximum score (e.g., 5/5) = Best possible answer
  • Minimum score (e.g., 1/5) = Worst possible answer

Correct:

5 - Completely correct: Perfect answer
4 - Mostly correct: Minor errors only
3 - Partially correct: Some right, some wrong
2 - Mostly incorrect: Few correct elements
1 - Completely incorrect: Wrong or irrelevant

Wrong (don’t do this):

1 - Completely correct  ← Don't invert the scale
5 - Completely incorrect

Principle 2: Be Explicit and Concrete

Vague descriptors lead to inconsistent scoring.

Vague:

3 - Okay: The prediction is somewhat correct

Explicit:

3 - Partially correct: Contains some correct information but has
significant gaps or errors in key points

Principle 3: Anchor to Observable Features

Reference things the judge can verify in the text. Avoid speculation about intent.

Poor (requires speculation):

4 - The model probably understood the question

Better (observable):

4 - All key points from the reference are present and accurate,
but minor details are missing

Principle 4: Use Parallel Structure

Each descriptor should follow the same template so judges can compare them easily.

Good example:

1 - Completely incorrect: Factually wrong, irrelevant, or does not address the question
2 - Mostly incorrect: Addresses the topic but contains major errors or misses most key points
3 - Partially correct: Contains some correct information but has significant gaps or errors
4 - Mostly correct: All key points present and accurate, but minor details missing
5 - Completely correct: Fully accurate, semantically equivalent, all details correct

Notice the pattern: [Overall judgement]: [Specific criteria]

Principle 5: Define Boundary Cases

Use threshold language to help judges distinguish between adjacent scores:

  • “major errors” vs. “minor errors”
  • “most key points” vs. “all key points”
  • “significant gaps” vs. “trivial differences”

Example: A prediction with major errors gets a 2, while one with minor errors gets a 4.

Principle 6: Semantic vs. Literal Matching

For factual accuracy tasks, specify that you care about meaning, not word-for-word matches:

Focus on semantic meaning rather than exact wording. Minor phrasing
differences are acceptable if the meaning is preserved.

Principle 7: Keep It Simple

Don’t evaluate too many dimensions at once. Each additional dimension makes scoring harder and less reliable.

Too complex: “Rate on accuracy, completeness, clarity, conciseness, creativity, and technical correctness”

Better: Pick the 1-2 dimensions that matter most, or create separate evaluations for different aspects.

Common Custom Template Patterns

Pattern 1: Factual Accuracy with Context

Include the original question for better evaluation:

Question: {{context.question}}
Student Answer: {{prediction}}
Correct Answer: {{reference}}

Rate factual accuracy from 1 to 5:
1 - Completely incorrect: Factually wrong or doesn't address the question
2 - Mostly incorrect: Major errors or misses most key points
3 - Partially correct: Some correct information but significant gaps
4 - Mostly correct: All key points accurate, only minor details missing
5 - Completely correct: Fully accurate and semantically equivalent

Focus on meaning, not exact wording.

Rating:

Pattern 2: Difficulty-Adjusted Grading

Adjust expectations based on metadata:

Difficulty Level: {{context.difficulty}}
Question: {{context.question}}

Student Answer: {{prediction}}
Correct Answer: {{reference}}

Rate from 1-5, being stricter for easy questions:
1 - Completely incorrect
2 - Mostly incorrect with minor correct elements
3 - Partially correct but missing key information
4 - Mostly correct with minor errors
5 - Completely correct

Rating:

Pattern 3: Code Evaluation

Evaluate multiple aspects of generated code:

Reference Code:
{{reference}}

Generated Code:
{{prediction}}

Rate the code quality from 1 to 5:
1 - Incorrect: Doesn't work or has syntax errors
2 - Poor: Works but has major bugs or logical errors
3 - Adequate: Works correctly but inefficient approach
4 - Good: Correct and reasonably efficient, minor style issues
5 - Excellent: Correct, optimal, and well-formatted

Rating:

Pattern 4: Few-Shot Examples

Provide examples to calibrate the judge:

Here are scoring examples:

Example 1:
Prediction: "Paris is in France"
Reference: "Paris is the capital of France"
Rating: 3
Why: Partially correct - Paris is in France, but misses that it's the capital.

Example 2:
Prediction: "Paris is the capital of France"
Reference: "Paris is the capital of France"
Rating: 5
Why: Completely correct and semantically equivalent.

Now evaluate:
Prediction: {{prediction}}
Reference: {{reference}}

Rate from 1-5 following the examples above.

Rating:

Use when: Judges are inconsistent, scoring is subjective, or boundaries are unclear.

Choosing a Model

Which Models Work Best

Not all language models make good judges. You need models with strong reasoning capabilities.

Recommended judge models:

  • GPT-4o-mini (OpenAI) - Best balance of cost and quality for most tasks
  • GPT-4o (OpenAI) - Stronger reasoning for complex evaluations
  • Claude Sonnet 4.5 (Anthropic) - Strong performance, similar cost to GPT-4o
  • Claude Opus 4.5 (Anthropic) - Highest quality, highest cost

Cost vs. Quality Tradeoffs

Use stronger models (GPT-4o, Claude Sonnet 4.5) when:

  • Evaluating subjective qualities like creativity or coherence
  • Working with complex, multi-dimensional rubrics
  • Making high-stakes evaluations where accuracy matters most
  • Dealing with ambiguous cases that need careful reasoning

Use lighter models (GPT-4o-mini) when:

  • Evaluating factual accuracy with clear right/wrong answers
  • Working with simple binary or 3-point scales
  • Running preliminary evaluations before fine-tuning
  • Budget constraints matter and you can accept some noise

GPT-4o-mini is recommended as the default for most evaluation tasks. The performance gap is smaller for straightforward factual evaluation and larger for nuanced quality judgements.

Temperature

Always use temperature 0.0 for evaluations.

Temperature controls randomness in the judge’s responses. At temperature 0.0, you get the most consistent scoring possible - the same input will usually get the same score. Higher temperatures introduce more random variation, making the same prediction score differently across runs and making results harder to interpret.

Testing and Iteration

Test your evaluation on a small sample (5-10 examples) before running at scale:

Check the prompts: Use the template preview to make sure the prompts are clear and readable.

Review the scores: Do the judge’s scores make sense? If everything scores 3-4, your rubric isn’t discriminating well. If everything is 1 or 5, it might be too binary.

Refine as needed: If scores seem off, make your rubric descriptors clearer and more concrete. If adjacent scores seem identical, make the boundaries more distinct.

Evaluation design is iterative - refine based on what you see.