In this guide we will see how to create and configure a custom Evaluator using the CustomEvaluator class. This allows easly define your onw LLM as a judge for specific use cases.

Custom evaluators are particularly useful when you need to evaluate specific aspects of LLM responses that aren’t covered by the built-in evaluators, or when you need a specialized scoring system for your use case.

Creating a Custom Evaluator

The CustomEvaluator class allows you to define your own evaluation criteria with a custom scoring system. Here’s how to create one:

from trusttest.evaluators import CustomEvaluator

evaluator = CustomEvaluator(
    name="Trip Plan Accuracy",
    description="Validates that the trip plan matches the user's request and is logically consistent.",
    instructions="""
    **Instruction:** Evaluate the accuracy and completeness of the trip plan in the actual response against the user's prompt/question. Verify the following:
    1. **Flights:** Ensure the flight details are correct and match the user's request.
    2. **Itinerary:** Confirm that the daily activities and destinations align with the user's intended trip.
    3. **Logical Consistency:** Check that the trip plan is feasible.
    4. **Relevance:** Ensure the trip plan is relevant to the user's request.

    Deduct points for:
    - Missing or incorrect flight details.
    - Inaccurate or irrelevant activities.
    - Infeasible or illogical trip plans.
    - Lack of alignment with the user's request.
    """,
    threshold=3,
    score_range=(1, 5),
    scores=[
        {
            "score": 1,
            "description": "The trip plan is entirely incorrect or irrelevant to the user's request.",
        },
        {
            "score": 2,
            "description": "The trip plan is mostly incorrect or irrelevant.",
        },
        {
            "score": 3,
            "description": "The trip plan includes some correct elements but has multiple inaccuracies.",
        },
        {
            "score": 4,
            "description": "The trip plan matches the user's request with only minor inaccuracies.",
        },
        {
            "score": 5,
            "description": "The trip plan exactly matches the user's request with no inaccuracies.",
        },
    ],
)

Custom Evaluator Parameters

  • name: A descriptive name for your evaluator
  • description: A detailed description of what the evaluator checks
  • instructions: Detailed instructions for the LLM judge on how to evaluate responses
  • threshold: The minimum score needed to pass the evaluation
  • score_range: The range of possible scores (min, max)
  • scores: A list of score definitions with descriptions

Using the Custom Evaluator

Once you’ve created your custom evaluator, you can use it in an evaluation scenario just like any other evaluator:

from trusttest.dataset_builder import Dataset
from trusttest.evaluation_scenarios import EvaluationScenario
from trusttest.evaluator_suite import EvaluatorSuite
from trusttest.models.testing import DummyEndpoint
from trusttest.probes import DatasetProbe

scenario = EvaluationScenario(
    description="This is a test scenario",
    name="Test Scenario",
    evaluator_suite=EvaluatorSuite(
        evaluators=[evaluator],
        criteria="all_fail",
    ),
)

dataset = Dataset.from_json(path="data/qa_dataset.json")
probe = DatasetProbe(
    model=DummyEndpoint(),
    dataset=dataset
)

test_set = probe.get_test_set()
results = scenario.evaluate(test_set)
results.display_summary()

Saving and Loading Custom Evaluators

Saving custom evaluators is only supported for NeuralTrust type clients currently.

You can save your custom evaluator scenarios to the TrustTest platform for later use:

import trusttest
import os
from dotenv import load_dotenv

load_dotenv()

client = trusttest.client(
    type="neuraltrust",
    token=os.getenv("NEURALTRUST_TOKEN")
)

# Save the scenario
client.save_evaluation_scenario(scenario)

# Save the test set
client.save_evaluation_scenario_test_set(scenario.id, test_set)

# Save the evaluation results
client.save_evaluation_scenario_run(results)

# Load the scenario later
loaded_scenario = client.get_evaluation_scenario(scenario.id)

Complete Example

import os
from dotenv import load_dotenv

import trusttest
from trusttest.evaluation_scenarios import EvaluationScenario
from trusttest.evaluator_suite import EvaluatorSuite
from trusttest.evaluators import CustomEvaluator
from trusttest.models.testing import DummyEndpoint
from trusttest.probes import DatasetProbe
from trusttest.dataset_builder import Dataset

load_dotenv()

client = trusttest.client(
    type="neuraltrust",
    token=os.getenv("NEURALTRUST_TOKEN")
)

dataset_path = "data/qa_dataset.json"
dataset = Dataset.from_json(path=dataset_path)
probe = DatasetProbe(
    model=DummyEndpoint(),
    dataset=dataset
)

evaluator = CustomEvaluator(
    name="Trip Plan Accuracy",
    description="Validates that the trip plan matches the user's request and is logically consistent.",
    instructions="""
    **Instruction:** Evaluate the accuracy and completeness of the trip plan in the actual response against the user's prompt/question. Verify the following:
    1. **Flights:** Ensure the flight details are correct and match the user's request.
    2. **Itinerary:** Confirm that the daily activities and destinations align with the user's intended trip.
    3. **Logical Consistency:** Check that the trip plan is feasible.
    4. **Relevance:** Ensure the trip plan is relevant to the user's request.
    """,
    threshold=3,
    score_range=(1, 5),
    scores=[
        {
            "score": 1,
            "description": "The trip plan is entirely incorrect or irrelevant to the user's request.",
        },
        {
            "score": 2,
            "description": "The trip plan is mostly incorrect or irrelevant.",
        },
        {
            "score": 3,
            "description": "The trip plan includes some correct elements but has multiple inaccuracies.",
        },
        {
            "score": 4,
            "description": "The trip plan matches the user's request with only minor inaccuracies.",
        },
        {
            "score": 5,
            "description": "The trip plan exactly matches the user's request with no inaccuracies.",
        },
    ],
)

scenario = EvaluationScenario(
    description="This is a test scenario",
    name="Test Scenario",
    evaluator_suite=EvaluatorSuite(
        evaluators=[evaluator],
        criteria="all_fail",
    ),
)

test_set = probe.get_test_set()
results = scenario.evaluate(test_set)

# Save to TrustTest platform
client.save_evaluation_scenario(scenario)
client.save_evaluation_scenario_test_set(scenario.id, test_set)
client.save_evaluation_scenario_run(results)

# Load and run again
loaded_scenario = client.get_evaluation_scenario(scenario.id)
results = loaded_scenario.evaluate(test_set)
results.display_summary()