NeuralTrust | The leading security platform for generative AI

Use existing datasets to create functional tests for your AI model. This approach is ideal when you have curated Q&A pairs, golden datasets, or historical test cases.

Overview

Dataset-based functional testing allows you to:

Use curated test cases: Leverage carefully crafted Q&A pairs
Ensure reproducibility: Same tests across runs
Import existing datasets: Use your organization’s test data
Track regressions: Compare results over time

Supported Formats

Format	Description	Best For
YAML	Human-readable, easy to edit	Small to medium datasets
JSON	Structured, programmatic access	API-generated datasets
Parquet	Efficient storage, large scale	Large datasets

Dataset Structure

Basic Structure

Each test case consists of:

question: The input to send to the model
context: Expected response or evaluation criteria

# functional_tests.yaml
- - question: "What is the capital of France?"
    context:
      expected_response: "The capital of France is Paris."

- - question: "How do I reset my password?"
    context:
      expected_response: "To reset your password, go to Settings > Security > Reset Password."

With Evaluation Criteria

- - question: "Explain machine learning in simple terms"
    context:
      expected_response: "Machine learning is a type of AI where computers learn from data."
      evaluation_criteria: "Should mention learning from data, avoid technical jargon"

Code Example

Loading from YAML

from trusttest.probes.dataset import DatasetProbe
from trusttest.dataset_builder import Dataset
from trusttest.targets.http import HttpTarget, PayloadConfig
from trusttest.evaluators import CorrectnessEvaluator
from trusttest.evaluator_suite import EvaluatorSuite
from trusttest.evaluation_scenarios import EvaluationScenario

# Configure target
target = HttpTarget(
    url="https://your-model-endpoint.com/chat",
    headers={"Content-Type": "application/json"},
    payload_config=PayloadConfig(
        format={"messages": [{"role": "user", "content": "{{ test }}"}]},
        message_regex="{{ test }}",
    ),
)

# Load dataset
dataset = Dataset.from_yaml("functional_tests.yaml")

# Create probe
probe = DatasetProbe(target=target, dataset=dataset)

# Generate test set
test_set = probe.get_test_set()

# Evaluate
evaluator = CorrectnessEvaluator()
suite = EvaluatorSuite(evaluators=[evaluator])
scenario = EvaluationScenario(evaluator_suite=suite)

results = scenario.evaluate(test_set)
results.display_summary()

Loading from JSON

dataset = Dataset.from_json("functional_tests.json")
probe = DatasetProbe(target=target, dataset=dataset)

Loading from Parquet

dataset = Dataset.from_parquet("functional_tests.parquet")
probe = DatasetProbe(target=target, dataset=dataset)

Creating Datasets Programmatically

from trusttest.dataset_builder import Dataset, DatasetItem
from trusttest.evaluation_contexts import ExpectedResponseContext

# Create test cases
items = [
    [DatasetItem(
        question="What are your business hours?",
        context=ExpectedResponseContext(
            expected_response="We are open Monday to Friday, 9 AM to 5 PM."
        ),
    )],
    [DatasetItem(
        question="How can I contact support?",
        context=ExpectedResponseContext(
            expected_response="You can reach support at [email protected] or call 1-800-SUPPORT."
        ),
    )],
]

dataset = Dataset(items=items)

# Save for later use
dataset.to_yaml("my_tests.yaml")
dataset.to_json("my_tests.json")

Combining Multiple Datasets

# Load multiple datasets
general_tests = Dataset.from_yaml("general_tests.yaml")
edge_cases = Dataset.from_yaml("edge_cases.yaml")
regression_tests = Dataset.from_yaml("regression_tests.yaml")

# Combine
combined_items = (
    general_tests.items + 
    edge_cases.items + 
    regression_tests.items
)

combined_dataset = Dataset(items=combined_items)
probe = DatasetProbe(target=target, dataset=combined_dataset)

Evaluation Options

Exact Match

from trusttest.evaluators import EqualsEvaluator

evaluator = EqualsEvaluator()  # Exact string match

Semantic Similarity

from trusttest.evaluators import CorrectnessEvaluator

evaluator = CorrectnessEvaluator()  # LLM judges semantic correctness

BLEU Score

from trusttest.evaluators import BleuEvaluator

evaluator = BleuEvaluator(threshold=0.7)  # BLEU score threshold

Best Practices

Diverse test cases: Include various question types and topics
Clear expectations: Write unambiguous expected responses
Edge cases: Include boundary conditions and unusual inputs
Regular updates: Add new test cases as you discover issues
Version control: Track dataset changes alongside code

From RAG - Generate tests from knowledge bases
From Prompt - Generate tests dynamically
Heuristic Evaluators

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

From Dataset

Overview

Supported Formats

Dataset Structure

Basic Structure

With Evaluation Criteria

Code Example

Loading from YAML

Loading from JSON

Loading from Parquet

Creating Datasets Programmatically

Combining Multiple Datasets

Evaluation Options

Exact Match

Semantic Similarity

BLEU Score

Best Practices

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

​Overview

​Supported Formats

​Dataset Structure

​Basic Structure

​With Evaluation Criteria

​Code Example

​Loading from YAML

​Loading from JSON

​Loading from Parquet

​Creating Datasets Programmatically

​Combining Multiple Datasets

​Evaluation Options

​Exact Match

​Semantic Similarity

​BLEU Score

​Best Practices

​Related Topics

Overview

Supported Formats

Dataset Structure

Basic Structure

With Evaluation Criteria

Code Example

Loading from YAML

Loading from JSON

Loading from Parquet

Creating Datasets Programmatically

Combining Multiple Datasets

Evaluation Options

Exact Match

Semantic Similarity

BLEU Score

Best Practices

Related Topics