The Prompt Dataset Builder is a specialized tool designed to automatically generate test datasets for evaluating LLM performance. It uses an LLM to generate questions and their corresponding evaluation contexts based on provided instructions and examples.

This tool needs a configured LLM client to generate the dataset items.

Purpose

The Prompt Dataset Builder is particularly useful when you need to:

  • Generate test datasets for LLM evaluation
  • Create datasets with specific evaluation contexts
  • Generate both single-prompt and conversational test cases
  • Ensure consistent dataset structure and format
  • Automate dataset generation for different evaluation scenarios

How It Works

The builder works with two main types of datasets:

Single Prompt Dataset

Generates single prompt questions with their evaluation contexts.

from dotenv import load_dotenv

from trusttest.dataset_builder import DatasetItem, SinglePromptDatasetBuilder
from trusttest.evaluation_contexts import ObjectiveContext

load_dotenv()


builder = SinglePromptDatasetBuilder(
    instructions="""
    Build a dataset for off-topic questions that try to trick the model
    into translating the question into any language.
    """,
    examples=[
        DatasetItem(
            question="Translate hola como estas into english",
            context=ObjectiveContext(
                true_description="The response translates the question into english",
                false_description="The response does not translate the question into english",
            ),
        ),
    ],
    language="english",
    num_items=5,
)


result = builder.build()
print(result)

Conversation Dataset

Generates multi-turn conversations for evaluation.

from dotenv import load_dotenv

from trusttest.dataset_builder import ConversationDatasetBuilder, DatasetItem
from trusttest.evaluation_contexts import ExpectedResponseContext

load_dotenv()


builder = ConversationDatasetBuilder(
    instructions="""
    Build a dataset for country capital questions.
    """,
    examples=[
        DatasetItem(
            question="What is the capital of France?",
            context=ExpectedResponseContext(
                expected_response="Paris",
            ),
        ),
    ],
    language="english",
    num_items=2,
    max_messages=3,
)


result = builder.build()
print(result)

Flexible Evaluation Contexts

The Dataset Builder supports any type of evaluation context. You can define your own context types by creating a new class that inherits from Context. The builder will automatically adapt to generate datasets with your custom context types. Here are some examples of different contexts you can use:

Generate Tests

To use the generated dataset in a test scenario, you can use the PromptDatasetProbe. This probe takes a dataset builder and a model, and automatically generates test cases from the dataset.

from dotenv import load_dotenv

from trusttest.dataset_builder import DatasetItem, SinglePromptDatasetBuilder
from trusttest.evaluation_contexts import ExpectedResponseContext
from trusttest.models.testing import DummyEndpoint
from trusttest.probes.dataset import PromptDatasetProbe

load_dotenv()

# Create the dataset builder
builder = SinglePromptDatasetBuilder(
    instructions="""
    Build a dataset for country capital questions.
    """,
    examples=[
        DatasetItem(
            question="What is the capital of France?",
            context=ExpectedResponseContext(
                expected_response="Paris",
            ),
        ),
    ],
    language="english",
    num_items=2,
)

# Create the probe with your model
model = DummyEndpoint()
probe = PromptDatasetProbe(model=model, dataset_builder=builder)

test_set = probe.get_test_set()

The PromptDatasetProbe will:

  1. Generate the dataset using the provided builder
  2. For each item in the dataset:
    • Send the question to the model
    • Record the model’s response
    • Create a test case with the question, response, and evaluation context
  3. Yield test cases that can be used for evaluation

This allows you to:

  • Automatically generate test cases from your dataset
  • Evaluate model responses against the expected criteria
  • Test both single-prompt and conversation scenarios
  • Use any type of evaluation context