Evaluation Context
The evaluation system in TrustTest is built around three main concepts: EvaluationScenarios
, EvaluatorSuites
and EvaluationContext
.
This architecture allows for flexible and comprehensive testing of language model responses through multiple evaluation criteria and formats.
Evaluation Scenarios
An Evaluation Scenario represents a specific test case or situation where we want to evaluate a language model’s response. Each scenario consists of:
- A descriptive name
- A detailed description of the test case
- An Evaluator Suite that defines how the response should be evaluated
Scenarios provide a structured way to define test cases and their expected outcomes, making it easier to maintain and understand the testing requirements.
Evaluation Context
Each evaluator recives the model response and the evaluation context.
The EvaluationContext system provides a structured way to define what data an evaluator needs to perform its assessment. This is implemented through different context types that specify the required information for each evaluation scenario.
Context Types
The system defines several context types that can be mixed if they are child classes:
-
QuestionContext: Contains the question being asked to the language model
-
ExpectedResponseContext: Contains the expected response and optionally the original question. Since it inherits from QuestionContext, it can be used in the same suite as QuestionContext.
-
ObjectiveContext: Contains descriptions for what is a correct and incorrect response. Since it doesn’t inherit from QuestionContext, it cannot be mixed with QuestionContext or ExpectedResponseContext.
Usage in Evaluators
Each evaluator can specify which context type it requires to perform its evaluation. This ensures that:
- Evaluators have all the necessary information to make their assessment
- The evaluation system can validate that required context is provided
- Different evaluators can work with different types of context data
- Context requirements are clearly documented and type-safe
For example, an evaluator that checks if a response matches an expected answer would require an ExpectedResponseContext
, while an evaluator that checks for specific keywords might only need a QuestionContext
.
This context system makes the evaluation process more robust and maintainable by clearly defining the data requirements for each type of evaluation.
Mixing Context Types
You can mix context types in the same suite if they are child classes. For example:
ExpectedResponseContext
is a child class ofQuestionContext
, so they can be used together in the same suiteExpectedResponseContext
is not a child ofObjectiveContext
, so they cannot be used together
With a propper IDE with static type configured checking like mypy
or pylance
, you will be able to see if a context is valid to be used in a suite.
Evaluator Suites
An Evaluator Suite is a collection of individual evaluators that work together to assess a response. The suite provides a way to:
- Combine multiple evaluation criteria
- Define how the results should be aggregated
- Make a final decision about the response’s quality
Suite Criteria
The suite supports different criteria for determining overall failure:
- any_fail: The response fails if any evaluator fails (default)
- all_fail: The response only fails if all evaluators fail
- one_fail: The response fails if exactly one evaluator fails
- percentage_fail: The response fails if a certain percentage of evaluators fail
Combining Evaluators
The power of Evaluator Suites comes from their ability to combine different types of evaluators to tackle different aspects of the response. This combination allows for:
- Comprehensive Testing: Different aspects of the response can be evaluated simultaneously
- Flexible Requirements: Different scenarios can have different evaluation criteria
- Graded Assessment: Some aspects can be more important than others
- Defense in Depth: Multiple evaluators can catch different types of failures
Example
This example shows how different evaluators can work together to provide a comprehensive assessment of a response’s quality. So we are able to check if a url is correct and if the response language is the same as the question language.