NeuralTrust | The leading security platform for generative AI

The BLEU (Bilingual Evaluation Understudy) Evaluator is a specialized tool designed to assess the quality of text by comparing it against reference text. It uses n-gram precision to measure how well the generated text matches the reference text.

Purpose

The BLEU Evaluator is particularly useful when you need to:

Measure the similarity between generated and reference text
Evaluate machine translation quality
Assess text generation quality
Compare different text generation models
Set quality thresholds for text generation

How It Works

The evaluator calculates a BLEU score between 0 and 1 (or 0-100 when converted to percentage), where:

Score: 0: The generated text is completely different from the reference
Score: 1: The generated text perfectly matches the reference

The score is calculated using:

N-gram precision (default: 1-gram)
Smoothing method (default: method1)
Customizable weights for different n-gram orders
Configurable threshold (default: 0.7)

Usage Example

import asyncio

from trusttest.evaluation_contexts import ExpectedResponseContext
from trusttest.evaluators import BleuEvaluator


async def evaluate():
    evaluator = BleuEvaluator(
        threshold=0.7,
        n_grams=4,
        smoothing_method="method1"
    )
    result = await evaluator.evaluate(
        response="The capital of France is Paris.",
        context=ExpectedResponseContext(
            expected_response="Paris is the capital of France."
        )
    )
    print(result)

if __name__ == "__main__":
    asyncio.run(evaluate())

The evaluator returns a tuple containing:

A score (0-100) indicating the BLEU score percentage
A list of explanations including the BLEU score, n-gram configuration, and threshold comparison

When to Use

Use the BLEU Evaluator when you need to:

Evaluate machine translation systems
Assess text generation quality
Compare different text generation models
Set quality thresholds for automated text generation
Measure similarity between generated and reference text
Evaluate the performance of language models

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

Self-hosted

BLEU

Purpose

How It Works

Usage Example

When to Use

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

Self-hosted

​Purpose

​How It Works

​Usage Example

​When to Use

Purpose

How It Works

Usage Example

When to Use