> ## Documentation Index
> Fetch the complete documentation index at: https://docs.neuraltrust.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

`KnowledgeBase` is a collection of documents that are used to generate test cases for a specific domain or task. Usually refered as the Vector Database for retrieval augmented generation (RAG).

This documents are grouped by topics, if not defined the `KnowledgeBase` will generate automatically the topics.

## Connectors

### AzureKnowledgeBase

Leverages **Azure Cognitive Search** and is best suited for:

* Cloud-based document indexing and storage
* Full-text search with advanced filtering and ranking
* Integration with Microsoft’s AI-powered search stack

### Neo4jKnowledgeBase

Built on **Neo4j**, this connector excels at:

* Handling complex document relationships
* Graph-based querying and clustering
* Constructing dynamic knowledge graphs

### PostgresKnowledgeBase

Leverages **Postgres** with **pgvector** and is best suited for:

* Full-text search with advanced filtering and ranking
* Graph-based querying and clustering

### InMemoryKnowledgeBase

A minimal, no-dependency implementation designed for:

* Prototyping and local testing
* Lightweight, quick-start environments
* Small-scale document classification

## Topic Creation Process

The topic creation pipeline groups unlabeled documents into coherent topics using embeddings, dimensionality reduction, clustering, and LLM-based summarization.

<Note>This process is triggered when no predefined (seed) topics are provided.</Note>

1. **Document Retrieval**
   * Pulls all documents from Azure Cognitive Search using the mapped `id` and `content` fields.
   * Filters out empty or whitespace-only content.

2. **Embedding Generation**
   * Applies an embedding model to each document’s content (truncated to `3 * max_tokens`).
   * Produces high-dimensional semantic vectors for clustering.

3. **Dimensionality Reduction**
   * Uses **UMAP** to reduce embedding vectors to a lower-dimensional space for clustering.
   * Parameters such as `n_neighbors`, `n_components`, and initialization strategy are tuned based on document count.

4. **Topic Clustering**
   * Runs **HDBSCAN** over the reduced vectors to group documents into topic clusters.
   * Noise and outliers are discarded (`label = -1`).

5. **LLM-based Topic Naming**
   * For each valid topic cluster, generates a name using a language target.
   * Uses up to `max_docs` samples per topic and truncates each sample to `max_doc_length`.

6. **Return Structure**
   * Returns:
     * A dictionary mapping topic names to associated documents.
     * A flat list of all topic names.
     * A flat list of all processed documents.