LLM as judge
Correctness
The Correctness Evaluator is a specialized tool designed to assess the accuracy of responses by comparing them against expected or ground truth responses. It uses an LLM (Large Language Model) as a judge to determine how well an actual response matches the expected response.
Purpose
The Correctness Evaluator is particularly useful when you need to:
- Verify the factual accuracy of responses
- Ensure responses align with expected answers
- Detect contradictions or misinformation
- Evaluate the semantic similarity between responses
How It Works
The evaluator uses a 5-point scale to rate responses:
- Score: 1 (Direct Contradiction): The actual response directly contradicts the expected response
- Score: 2 (Partial Contradiction): Contains some similar facts but also has direct contradictions
- Score: 3 (Similar but Not Equivalent): Not contradictory but not equivalent
- Score: 4 (Partial Equivalence): Some information is equivalent but not all
- Score: 5 (Fully Equivalent): Both answers are equivalent
Usage Example
The evaluator returns a tuple containing:
- A score (1-5) indicating the level of correctness
- A list of explanations for the given score
When to Use
Use the Correctness Evaluator when you need to:
- Validate factual accuracy in QA systems
- Check response quality in chatbots
- Ensure consistency in information retrieval systems
- Evaluate the reliability of AI-generated content
- Test the accuracy of automated responses