The BLEU (Bilingual Evaluation Understudy) Evaluator is a specialized tool designed to assess the quality of text by comparing it against reference text. It uses n-gram precision to measure how well the generated text matches the reference text.

Purpose

The BLEU Evaluator is particularly useful when you need to:

  • Measure the similarity between generated and reference text
  • Evaluate machine translation quality
  • Assess text generation quality
  • Compare different text generation models
  • Set quality thresholds for text generation

How It Works

The evaluator calculates a BLEU score between 0 and 1 (or 0-100 when converted to percentage), where:

  • Score: 0: The generated text is completely different from the reference
  • Score: 1: The generated text perfectly matches the reference

The score is calculated using:

  • N-gram precision (default: 1-gram)
  • Smoothing method (default: method1)
  • Customizable weights for different n-gram orders
  • Configurable threshold (default: 0.7)

Usage Example

import asyncio

from trusttest.evaluation_contexts import ExpectedResponseContext
from trusttest.evaluators import BleuEvaluator


async def evaluate():
    evaluator = BleuEvaluator(
        threshold=0.7,
        n_grams=4,
        smoothing_method="method1"
    )
    result = await evaluator.evaluate(
        response="The capital of France is Paris.",
        context=ExpectedResponseContext(
            expected_response="Paris is the capital of France."
        )
    )
    print(result)

if __name__ == "__main__":
    asyncio.run(evaluate())

The evaluator returns a tuple containing:

  • A score (0-100) indicating the BLEU score percentage
  • A list of explanations including the BLEU score, n-gram configuration, and threshold comparison

When to Use

Use the BLEU Evaluator when you need to:

  • Evaluate machine translation systems
  • Assess text generation quality
  • Compare different text generation models
  • Set quality thresholds for automated text generation
  • Measure similarity between generated and reference text
  • Evaluate the performance of language models