NeuralTrust | The leading security platform for generative AI

The RAG (Retrieval-Augmented Generation) Probe is a specialized tool designed to automatically generate and evaluate test cases for RAG systems. It uses a knowledge base to generate both functional and adversarial questions, and then evaluates the model’s responses against these questions.

This probe needs a LLM client configured to generate the questions and topic generation if not provided. Also and embedding model configured to generate the questions and topic generation if not provided.

Purpose

The RAG Probe is particularly useful when you need to:

Test RAG systems with automatically generated questions
Evaluate model performance on specific knowledge domains
Generate both functional and adversarial test cases
Assess system robustness against different types of questions
Validate knowledge base integration and retrieval capabilities

How It Works

The probe works with two main types of scenarios:

Functional RAG Scenario

Generates benign questions to test normal system operation
Supports multiple question types:
- Simple Questions: Basic factual questions
- Complex Questions: More detailed and nuanced questions
- Really Complex Questions: Highly detailed and multi-part questions
- Conversational Questions: Questions in a more natural, conversational style
- Distracting Questions: Questions with additional context that might distract
- Double Questions: Questions that ask about two related things
- OOS Questions: Out-of-scope questions to test system boundaries

Adversarial RAG Scenario

Generates malicious questions to test system robustness
Supports multiple attack types:
- Instruction Manipulation: Questions that attempt to overwrite, invalidate, edit, or contradict the information in the context, forcing the system to provide incorrect information while maintaining a natural question format.
- Role Play: Questions that assign a new name/role to the assistant and provide context updates that override previous information, creating scenarios that lead to incorrect responses.
- Hypothetical: Questions that create hypothetical scenarios based on the context, testing the system’s ability to handle speculative situations while maintaining factual accuracy.
- Storytelling: Questions that attempt to make the system engage in storytelling about the context, potentially leading to fictional or exaggerated responses.
- Obfuscation: Questions that use complex language, technical terms, or confusing phrasing to obscure the actual intent, testing the system’s ability to handle complex queries.
- Payload Splitting: Questions that split malicious content across multiple parts or use indirect references, testing the system’s ability to handle fragmented or indirect queries.
- List Based: Questions that request lists or enumerations of information, potentially leading to incomplete or incorrect responses.
- Special Token: Questions that include special characters, tokens, or unusual formatting to test the system’s handling of non-standard input.
- Off Tone: Questions that attempt to make the system respond in an inappropriate or unprofessional tone, testing its ability to maintain appropriate communication standards.

The probe will:

Load documents into a knowledge base
Generate questions based on document topics
Query the model with generated questions
Evaluate responses using configured evaluators
Provide detailed results and metrics

Usage Examples

from trusttest.catalog import RagFunctionalScenario
from trusttest.knowledge_base import Document, InMemoryKnowledgeBase
from trusttest.probes.rag import BenignQuestion

# Configure knowledge base
documents = [
    Document(
        id="1",
        content="Your document content here",
        topic="Your topic here"
    )
]
knowledge_base = InMemoryKnowledgeBase(documents=documents)

# Functional testing with different question types
functional_scenario = RagFunctionalScenario(
    model=your_model,
    knowledge_base=knowledge_base,
    num_questions=10,
    question_types=[
        BenignQuestion.SIMPLE,
        BenignQuestion.COMPLEX,
        BenignQuestion.REALLY_COMPLEX,
        BenignQuestion.CONVERSATIONAL,
        BenignQuestion.DISTRACTING,
        BenignQuestion.DOUBLE,
        BenignQuestion.OOS
    ]
)

# Run evaluation
test_set = functional_scenario.probe.get_test_set()
results = functional_scenario.eval.evaluate(test_set)
results.display()

When to Use

Use the RAG Probe when you need to:

Validate knowledge base integration
Assess system robustness against domain specific attacks
Generate comprehensive test cases automatically
Test specific knowledge domains or topics
Compare different RAG configurations
Identify system vulnerabilities

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

Self-hosted

Test Generation

Purpose

How It Works

Functional RAG Scenario

Adversarial RAG Scenario

Usage Examples

When to Use

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

Self-hosted

​Purpose

​How It Works

​Functional RAG Scenario

​Adversarial RAG Scenario

​Usage Examples

​When to Use

Purpose

How It Works

Functional RAG Scenario

Adversarial RAG Scenario

Usage Examples

When to Use