Getting Started
- Overview
- Installation
- Quickstart
- Tutorials & Guides
Core Concepts
Create tests
Evaluate results
- Overview
- Evaluation Context
- LLM as judge
- Heuristic
Self-hosted
Quickstart
To start using TrustTest, you need to install the package in your python environment:
uv add trusttest
For this quickstart, we are going to run a basic functional test against a dummy API and save the test locally.
In trusttest we have defined a set of dummy Models to easaly test the library.
from trusttest.models.testing import DummyEndpoint
model = DummyEndpoint()
response = model.respond("Hello, how are you?")
print(response)
This dummy model just have a fix set of responses for a fix set of inputs. Else it returns “I don’t know the answer to that question.
When our model is ready, we can choose the probe that will generate the test cases to evaluate the model.
In this case we are going to use DatasetProbe
to generate test cases from a dataset.
from trusttest.dataset_builder import Dataset, DatasetItem
from trusttest.evaluation_contexts import ExpectedResponseContext
from trusttest.models.testing import DummyEndpoint
from trusttest.probes.dataset import DatasetProbe
model = DummyEndpoint()
probe = DatasetProbe(
model=model,
dataset=Dataset(
[
[
DatasetItem(
question="What is Python?",
context=ExpectedResponseContext(
expected_response="Python is a high-level, interpreted programming language."
),
)
],
[
DatasetItem(
question="What is the capital of France?",
context=ExpectedResponseContext(
expected_response="The capital of France is Paris."
),
)
],
]
),
)
test_set = probe.get_test_set()
The generated test_set
has two test cases. A test case is a set of questions and model responses with other metadata for evaluation.
When the our test_set
read, we can define which evaluation metrics and criteria we want to use to evaluate the model.
from trusttest.dataset_builder import Dataset, DatasetItem
from trusttest.evaluation_contexts import ExpectedResponseContext
from trusttest.evaluation_scenarios import EvaluationScenario
from trusttest.evaluator_suite import EvaluatorSuite
from trusttest.evaluators import (
BleuEvaluator,
ExpectedLanguageEvaluator,
)
from trusttest.models.testing import DummyEndpoint
from trusttest.probes.dataset import DatasetProbe
model = DummyEndpoint()
probe = DatasetProbe(...)
test_set = probe.get_test_set()
scenario = EvaluationScenario(
name="Quickstart Functional Test",
description="Functional test example.",
evaluator_suite=EvaluatorSuite(
evaluators=[
BleuEvaluator(threshold=0.3),
ExpectedLanguageEvaluator(expected_language="en"),
],
criteria="any_fail",
),
)
In this Evaluation Scenario we are using the BleuEvaluator
and the ExpectedLanguageEvaluator
, with the criteria any_fail
to evaluate the model.
So if any of the evaluators fails, the scenario will fail.
Now that we have defined our model and the way to evaluate it, we are ready to get the evaluation results.
# ...
results = scenario.evaluate(test_set)
results.display()
results.display_summary()
If everything is working as expected, the results should be displayed in the console.
And thats it! 🎉 You have just created your first functional test with TrustTest, go to the next section to see undersant all that is possible with TrustTest.
from trusttest.dataset_builder import Dataset, DatasetItem
from trusttest.evaluation_contexts import ExpectedResponseContext
from trusttest.evaluation_scenarios import EvaluationScenario
from trusttest.evaluator_suite import EvaluatorSuite
from trusttest.evaluators import (
BleuEvaluator,
ExpectedLanguageEvaluator,
)
from trusttest.models.testing import DummyEndpoint
from trusttest.probes.dataset import DatasetProbe
model = DummyEndpoint()
probe = DatasetProbe(
model=model,
dataset=Dataset(
[
[
DatasetItem(
question="What is Python?",
context=ExpectedResponseContext(
expected_response="Python is a high-level, interpreted programming language."
),
)
],
[
DatasetItem(
question="What is the capital of France?",
context=ExpectedResponseContext(
expected_response="The capital of France is Paris."
),
)
],
]
),
)
test_set = probe.get_test_set()
scenario = EvaluationScenario(
name="Quickstart Functional Test",
description="Functional test example.",
evaluator_suite=EvaluatorSuite(
evaluators=[
BleuEvaluator(threshold=0.3),
ExpectedLanguageEvaluator(expected_language="en"),
],
criteria="any_fail",
),
)
results = scenario.evaluate(test_set)
results.display()
results.display_summary()