Step 1: Evaluation Target
Define the model that we are going to evaluate.
Step 2: Probe
Define the probe that will generate the test cases.
DatasetProbe
to generate test cases from a dataset.test_set
has two test cases. A test case is a set of questions and model responses with other metadata for evaluation.Step 3: Evaluation Scenario
Define the evaluation metrics and criteria.
test_set
read, we can define which evaluation metrics and criteria we want to use to evaluate the target.BleuEvaluator
and the ExpectedLanguageEvaluator
, with the criteria any_fail
to evaluate the target.
So if any of the evaluators fails, the scenario will fail.Step 4: Run the Scenario
Evaluate the test set.
Complete Example
Full python code for the quickstart.