The Correctness Evaluator is a specialized tool designed to assess the accuracy of responses by comparing them against expected or ground truth responses. It uses an LLM (Large Language Model) as a judge to determine how well an actual response matches the expected response.

Purpose

The Correctness Evaluator is particularly useful when you need to:

  • Verify the factual accuracy of responses
  • Ensure responses align with expected answers
  • Detect contradictions or misinformation
  • Evaluate the semantic similarity between responses

How It Works

The evaluator uses a 5-point scale to rate responses:

  • Score: 1 (Direct Contradiction): The actual response directly contradicts the expected response
  • Score: 2 (Partial Contradiction): Contains some similar facts but also has direct contradictions
  • Score: 3 (Similar but Not Equivalent): Not contradictory but not equivalent
  • Score: 4 (Partial Equivalence): Some information is equivalent but not all
  • Score: 5 (Fully Equivalent): Both answers are equivalent

Usage Example

import asyncio

from trusttest.evaluation_contexts import ExpectedResponseContext
from trusttest.evaluators import CorrectnessEvaluator


async def evaluate():
    evaluator = CorrectnessEvaluator()
    result = await evaluator.evaluate(
        response="What is the capital of Osona?",
        context=ExpectedResponseContext(
            expected_response="The capital of Osona is Vic."
        )
    )
    print(result)

if __name__ == "__main__":
    asyncio.run(evaluate())

The evaluator returns a tuple containing:

  • A score (1-5) indicating the level of correctness
  • A list of explanations for the given score

When to Use

Use the Correctness Evaluator when you need to:

  • Validate factual accuracy in QA systems
  • Check response quality in chatbots
  • Ensure consistency in information retrieval systems
  • Evaluate the reliability of AI-generated content
  • Test the accuracy of automated responses