LLM-as-Judge: Evaluation Guide
This guide covers how to design LLM-as-judge evaluations in QuickCompare. You’ll learn how to choose scales, write effective rubrics, and select judge models.
Introduction
What is LLM-as-Judge?
LLM-as-judge is an evaluation method where a powerful language model acts as a judge to assess the quality of other model outputs. Think of it as replacing manual review with a consistent, automated reviewer that applies your rubric at scale.
Why Use It?
Traditional metrics like BLEU or ROUGE don’t fully capture quality for many tasks. LLM-as-judge is particularly useful for:
- Open-ended generation tasks - creative writing, explanations, summaries
- Complex reasoning assessment - multi-step problems, nuanced arguments
- Semantic similarity - when meaning matters more than exact wording
- Subjective quality - clarity, coherence, usefulness
A key advantage: you don’t always need reference answers. Unlike traditional metrics that require gold standards, LLM-as-judge can evaluate quality directly for creative writing, instruction following, and other open-ended tasks. When you do have references, judges can assess semantic equivalence rather than just word overlap.
When you need to evaluate hundreds or thousands of outputs against nuanced criteria, LLM-as-judge provides consistency that manual review can’t match at scale.
Key Concepts
Prediction
The output generated by the model you’re evaluating. This is what the judge will assess.
Example: Your chatbot responds with “Paris is the capital of France” - this is the prediction being evaluated.
Reference
The correct answer or gold standard you’re comparing against. Reference answers are valuable when you have them - they let the judge assess factual accuracy and semantic equivalence against a known correct answer.
Not all tasks require references, though. For creative writing, instruction following, or general quality assessment, the judge can evaluate directly based on your rubric criteria.
Example: Your dataset says the correct answer is “Paris” - this is the reference the judge uses to evaluate accuracy.
Score and Normalisation
A numerical measure of quality, where higher scores are always better.
All scores are automatically normalised to 0.0-1.0 where 1.0 is perfect.
Examples:
- Score of 3 on 1-5 scale → 0.6 (partially correct)
- Score of 4 on 1-5 scale → 0.8 (mostly correct)
- Score of 5 on 1-5 scale → 1.0 (completely correct)
- Score of 1 on 0-1 scale → 1.0 (correct, binary scale)
This normalisation lets you compare evaluations across different rubrics and scales.
Scales
Why Scale Choice Matters
The scale you choose shapes how judges think about evaluation. Different scales are suited for different tasks.
Binary (0-1)
When to use:
- Questions with clear right/wrong answers
- Classification tasks
- Exact match evaluation
Pros:
- Simple and unambiguous
- Fast to evaluate
- Easy to interpret
Cons:
- Lose information about how close wrong answers were
- No middle ground for “partially correct”
3-Point (1-3)
When to use:
- Want a middle ground between binary and granular
- “Wrong / Partially Correct / Correct” distinctions
Pros:
- More information than binary
- Still relatively unambiguous
- Quick evaluation
Cons:
- Limited granularity
- Middle category can be ambiguous
5-Point (1-5) - Recommended
Most common choice for LLM judges. This is the recommended scale for most evaluation tasks.
When to use:
- Need granular accuracy assessment
- Factual correctness evaluation
- Quality assessment with nuance
Pros:
- Granular enough to capture meaningful differences
- Familiar format (this is a Likert scale, commonly used in surveys and product reviews)
- Good balance between precision and reliability
Cons:
- More complex than binary
- Adjacent scores can blur together without clear descriptors
7-Point or Higher
Generally not recommended.
Why avoid:
- Models struggle to reliably distinguish between adjacent points
- A score of 6 vs. 7 on a 10-point scale often comes down to random variation
- Added complexity doesn’t improve evaluation quality
Rare exceptions:
- Matching an existing evaluation standard that requires it
- Very specific use case with well-defined distinctions
Templates
Built-in Templates
QuickCompare provides several optimised templates for common evaluation tasks. These are well-tested starting points.
Template Overview
| Template | Scale | Use Case |
|---|---|---|
accuracy | 1-5 | Evaluate factual correctness and semantic equivalence |
binary | 0-1 | Binary correct/incorrect classification |
quality | 1-5 | Reference-based quality evaluation |
no_reference_quality | 1-5 | Quality assessment without reference answers |
hallucination | 1-5 | Detect unsupported claims (requires context) |
Accuracy (1-5 Scale)
Evaluates how well a prediction matches the reference answer.
Scale:
- 1: Completely incorrect or irrelevant
- 2: Mostly incorrect with minor correct elements
- 3: Partially correct but missing key information
- 4: Mostly correct with minor errors or omissions
- 5: Completely correct and accurate
Use for: Factual Q&A, answer accuracy, semantic equivalence
Binary (0-1 Scale)
Evaluates whether a prediction matches the reference.
Scale:
- 0: Incorrect (prediction does not match reference)
- 1: Correct (prediction matches reference)
Use for: Binary classification, exact match evaluation, yes/no questions
Quality (1-5 Scale)
Evaluates overall quality considering accuracy, completeness, relevance, and clarity against a reference.
Scale:
- 1: Poor (incorrect or incomplete)
- 2: Fair (partially correct with issues)
- 3: Good (captures main points with minor gaps)
- 4: Very Good (accurate and complete)
- 5: Excellent (matches or exceeds reference)
Use for: Summaries, explanations with reference text
No Reference Quality (1-5 Scale)
Assesses quality without a reference answer, focusing on clarity, coherence, and usefulness.
Scale:
- 1: Unusable (incomprehensible or off-topic)
- 2: Poor (major issues in clarity/relevance)
- 3: Fair (understandable but limited)
- 4: Good (clear and relevant)
- 5: Excellent (comprehensive and well-structured)
Use for: Creative writing, open-ended generation, general quality assessment
Hallucination (1-5 Scale)
Detects hallucinations by checking if predictions contain information not supported by the provided context.
Scale:
- 5: No hallucination (all claims from context)
- 4: Minimal (minor inferences, no false facts)
- 3: Moderate (mix of supported/unsupported)
- 2: Significant (multiple fabrications)
- 1: Severe (mostly fabricated/contradicts context)
Use for: RAG evaluation, context grounding, factual consistency
Note: Scale is inverted (5 is best) to match intuition that “no hallucination” is ideal.
Custom Templates
When to Use Custom Templates
Create custom templates when:
- Evaluation depends on metadata - Difficulty level, category, topic from your dataset
- Need to include context - Original question, prompt, or additional information
- Specialised content - Code, SQL, maths with specific requirements
- Adjust grading dynamically - Easier questions graded more strictly
Example: You want to include question difficulty in the evaluation so the judge can be stricter with easy questions.
Template Variables
Use Jinja2 syntax to reference data in your custom templates:
{{prediction}}- The model output being evaluated (always available){{reference}}- The ground truth answer (optional, only include if your dataset has reference answers){{context.column_name}}- Any column from your input dataset (replacecolumn_namewith the actual column name, e.g.,{{context.question}})
Example template:
Difficulty: {{context.difficulty}}
Question: {{context.question}}
Student Answer: {{prediction}}
Correct Answer: {{reference}}
Rate from 1-5, being stricter for easy questions:
1 - Completely wrong
2 - Mostly wrong
3 - Partially correct
4 - Mostly correct
5 - Exactly right
Rating:
Response Format Instruction
Critical: Your template must end by instructing the judge to respond with only the score. End your template with “Rating:” or “Score:” to make this explicit.
Without this, the judge may provide explanations instead of a parseable numeric score that QuickCompare can process. All the examples above end with “Rating:” for this reason.
Writing Effective Rubrics
A rubric defines what each score means. This is where evaluation quality is won or lost.
Principle 1: Higher Numbers = Better Quality
Always design your scale so that higher numbers represent better quality.
- Maximum score (e.g., 5/5) = Best possible answer
- Minimum score (e.g., 1/5) = Worst possible answer
Correct:
5 - Completely correct: Perfect answer
4 - Mostly correct: Minor errors only
3 - Partially correct: Some right, some wrong
2 - Mostly incorrect: Few correct elements
1 - Completely incorrect: Wrong or irrelevant
Wrong (don’t do this):
1 - Completely correct ← Don't invert the scale
5 - Completely incorrect
Principle 2: Be Explicit and Concrete
Vague descriptors lead to inconsistent scoring.
Vague:
3 - Okay: The prediction is somewhat correct
Explicit:
3 - Partially correct: Contains some correct information but has
significant gaps or errors in key points
Principle 3: Anchor to Observable Features
Reference things the judge can verify in the text. Avoid speculation about intent.
Poor (requires speculation):
4 - The model probably understood the question
Better (observable):
4 - All key points from the reference are present and accurate,
but minor details are missing
Principle 4: Use Parallel Structure
Each descriptor should follow the same template so judges can compare them easily.
Good example:
1 - Completely incorrect: Factually wrong, irrelevant, or does not address the question
2 - Mostly incorrect: Addresses the topic but contains major errors or misses most key points
3 - Partially correct: Contains some correct information but has significant gaps or errors
4 - Mostly correct: All key points present and accurate, but minor details missing
5 - Completely correct: Fully accurate, semantically equivalent, all details correct
Notice the pattern: [Overall judgement]: [Specific criteria]
Principle 5: Define Boundary Cases
Use threshold language to help judges distinguish between adjacent scores:
- “major errors” vs. “minor errors”
- “most key points” vs. “all key points”
- “significant gaps” vs. “trivial differences”
Example: A prediction with major errors gets a 2, while one with minor errors gets a 4.
Principle 6: Semantic vs. Literal Matching
For factual accuracy tasks, specify that you care about meaning, not word-for-word matches:
Focus on semantic meaning rather than exact wording. Minor phrasing
differences are acceptable if the meaning is preserved.
Principle 7: Keep It Simple
Don’t evaluate too many dimensions at once. Each additional dimension makes scoring harder and less reliable.
Too complex: “Rate on accuracy, completeness, clarity, conciseness, creativity, and technical correctness”
Better: Pick the 1-2 dimensions that matter most, or create separate evaluations for different aspects.
Common Custom Template Patterns
Pattern 1: Factual Accuracy with Context
Include the original question for better evaluation:
Question: {{context.question}}
Student Answer: {{prediction}}
Correct Answer: {{reference}}
Rate factual accuracy from 1 to 5:
1 - Completely incorrect: Factually wrong or doesn't address the question
2 - Mostly incorrect: Major errors or misses most key points
3 - Partially correct: Some correct information but significant gaps
4 - Mostly correct: All key points accurate, only minor details missing
5 - Completely correct: Fully accurate and semantically equivalent
Focus on meaning, not exact wording.
Rating:
Pattern 2: Difficulty-Adjusted Grading
Adjust expectations based on metadata:
Difficulty Level: {{context.difficulty}}
Question: {{context.question}}
Student Answer: {{prediction}}
Correct Answer: {{reference}}
Rate from 1-5, being stricter for easy questions:
1 - Completely incorrect
2 - Mostly incorrect with minor correct elements
3 - Partially correct but missing key information
4 - Mostly correct with minor errors
5 - Completely correct
Rating:
Pattern 3: Code Evaluation
Evaluate multiple aspects of generated code:
Reference Code:
{{reference}}
Generated Code:
{{prediction}}
Rate the code quality from 1 to 5:
1 - Incorrect: Doesn't work or has syntax errors
2 - Poor: Works but has major bugs or logical errors
3 - Adequate: Works correctly but inefficient approach
4 - Good: Correct and reasonably efficient, minor style issues
5 - Excellent: Correct, optimal, and well-formatted
Rating:
Pattern 4: Few-Shot Examples
Provide examples to calibrate the judge:
Here are scoring examples:
Example 1:
Prediction: "Paris is in France"
Reference: "Paris is the capital of France"
Rating: 3
Why: Partially correct - Paris is in France, but misses that it's the capital.
Example 2:
Prediction: "Paris is the capital of France"
Reference: "Paris is the capital of France"
Rating: 5
Why: Completely correct and semantically equivalent.
Now evaluate:
Prediction: {{prediction}}
Reference: {{reference}}
Rate from 1-5 following the examples above.
Rating:
Use when: Judges are inconsistent, scoring is subjective, or boundaries are unclear.
Choosing a Model
Which Models Work Best
Not all language models make good judges. You need models with strong reasoning capabilities.
Recommended judge models:
- GPT-4o-mini (OpenAI) - Best balance of cost and quality for most tasks
- GPT-4o (OpenAI) - Stronger reasoning for complex evaluations
- Claude Sonnet 4.5 (Anthropic) - Strong performance, similar cost to GPT-4o
- Claude Opus 4.5 (Anthropic) - Highest quality, highest cost
Cost vs. Quality Tradeoffs
Use stronger models (GPT-4o, Claude Sonnet 4.5) when:
- Evaluating subjective qualities like creativity or coherence
- Working with complex, multi-dimensional rubrics
- Making high-stakes evaluations where accuracy matters most
- Dealing with ambiguous cases that need careful reasoning
Use lighter models (GPT-4o-mini) when:
- Evaluating factual accuracy with clear right/wrong answers
- Working with simple binary or 3-point scales
- Running preliminary evaluations before fine-tuning
- Budget constraints matter and you can accept some noise
GPT-4o-mini is recommended as the default for most evaluation tasks. The performance gap is smaller for straightforward factual evaluation and larger for nuanced quality judgements.
Temperature
Always use temperature 0.0 for evaluations.
Temperature controls randomness in the judge’s responses. At temperature 0.0, you get the most consistent scoring possible - the same input will usually get the same score. Higher temperatures introduce more random variation, making the same prediction score differently across runs and making results harder to interpret.
Testing and Iteration
Test your evaluation on a small sample (5-10 examples) before running at scale:
Check the prompts: Use the template preview to make sure the prompts are clear and readable.
Review the scores: Do the judge’s scores make sense? If everything scores 3-4, your rubric isn’t discriminating well. If everything is 1 or 5, it might be too binary.
Refine as needed: If scores seem off, make your rubric descriptors clearer and more concrete. If adjacent scores seem identical, make the boundaries more distinct.
Evaluation design is iterative - refine based on what you see.