In this guide we will see how to create and configure a custom Evaluator using the CustomEvaluator class. This allows easly define your onw LLM as a judge for specific use cases.
Custom evaluators are particularly useful when you need to evaluate specific aspects of LLM responses that aren’t covered by the built-in evaluators, or when you need a specialized scoring system for your use case.
The CustomEvaluator class allows you to define your own evaluation criteria with a custom scoring system. Here’s how to create one:
Copy
from trusttest.evaluators import CustomEvaluatorevaluator = CustomEvaluator( name="Trip Plan Accuracy", description="Validates that the trip plan matches the user's request and is logically consistent.", instructions=""" **Instruction:** Evaluate the accuracy and completeness of the trip plan in the actual response against the user's prompt/question. Verify the following: 1. **Flights:** Ensure the flight details are correct and match the user's request. 2. **Itinerary:** Confirm that the daily activities and destinations align with the user's intended trip. 3. **Logical Consistency:** Check that the trip plan is feasible. 4. **Relevance:** Ensure the trip plan is relevant to the user's request. Deduct points for: - Missing or incorrect flight details. - Inaccurate or irrelevant activities. - Infeasible or illogical trip plans. - Lack of alignment with the user's request. """, threshold=3, score_range=(1, 5), scores=[ { "score": 1, "description": "The trip plan is entirely incorrect or irrelevant to the user's request.", }, { "score": 2, "description": "The trip plan is mostly incorrect or irrelevant.", }, { "score": 3, "description": "The trip plan includes some correct elements but has multiple inaccuracies.", }, { "score": 4, "description": "The trip plan matches the user's request with only minor inaccuracies.", }, { "score": 5, "description": "The trip plan exactly matches the user's request with no inaccuracies.", }, ],)
Saving custom evaluators is only supported for NeuralTrust type clients currently.
You can save your custom evaluator scenarios to the TrustTest platform for later use:
Copy
import trusttestimport osfrom dotenv import load_dotenvload_dotenv()client = trusttest.client( type="neuraltrust", token=os.getenv("NEURALTRUST_TOKEN"))# Save the scenarioclient.save_evaluation_scenario(scenario)# Save the test setclient.save_evaluation_scenario_test_set(scenario.id, test_set)# Save the evaluation resultsclient.save_evaluation_scenario_run(results)# Load the scenario laterloaded_scenario = client.get_evaluation_scenario(scenario.id)
import osfrom dotenv import load_dotenvimport trusttestfrom trusttest.evaluation_scenarios import EvaluationScenariofrom trusttest.evaluator_suite import EvaluatorSuitefrom trusttest.evaluators import CustomEvaluatorfrom trusttest.targets.testing import DummyTargetfrom trusttest.probes import DatasetProbefrom trusttest.dataset_builder import Datasetload_dotenv()client = trusttest.client( type="neuraltrust", token=os.getenv("NEURALTRUST_TOKEN"))dataset_path = "data/qa_dataset.json"dataset = Dataset.from_json(path=dataset_path)probe = DatasetProbe( target=DummyTarget(), dataset=dataset)evaluator = CustomEvaluator( name="Trip Plan Accuracy", description="Validates that the trip plan matches the user's request and is logically consistent.", instructions=""" **Instruction:** Evaluate the accuracy and completeness of the trip plan in the actual response against the user's prompt/question. Verify the following: 1. **Flights:** Ensure the flight details are correct and match the user's request. 2. **Itinerary:** Confirm that the daily activities and destinations align with the user's intended trip. 3. **Logical Consistency:** Check that the trip plan is feasible. 4. **Relevance:** Ensure the trip plan is relevant to the user's request. """, threshold=3, score_range=(1, 5), scores=[ { "score": 1, "description": "The trip plan is entirely incorrect or irrelevant to the user's request.", }, { "score": 2, "description": "The trip plan is mostly incorrect or irrelevant.", }, { "score": 3, "description": "The trip plan includes some correct elements but has multiple inaccuracies.", }, { "score": 4, "description": "The trip plan matches the user's request with only minor inaccuracies.", }, { "score": 5, "description": "The trip plan exactly matches the user's request with no inaccuracies.", }, ],)scenario = EvaluationScenario( description="This is a test scenario", name="Test Scenario", evaluator_suite=EvaluatorSuite( evaluators=[evaluator], criteria="all_fail", ),)test_set = probe.get_test_set()results = scenario.evaluate(test_set)# Save to TrustTest platformclient.save_evaluation_scenario(scenario)client.save_evaluation_scenario_test_set(scenario.id, test_set)client.save_evaluation_scenario_run(results)# Load and run againloaded_scenario = client.get_evaluation_scenario(scenario.id)results = loaded_scenario.evaluate(test_set)results.display_summary()