KnowledgeBase is a collection of documents that are used to generate test cases for a specific domain or task. Usually refered as the Vector Database for retrieval augmented generation (RAG).

This documents are grouped by topics, if not defined the KnowledgeBase will generate automatically the topics.

Connectors

AzureKnowledgeBase

Leverages Azure Cognitive Search and is best suited for:

  • Cloud-based document indexing and storage
  • Full-text search with advanced filtering and ranking
  • Integration with Microsoft’s AI-powered search stack

Neo4jKnowledgeBase

Built on Neo4j, this connector excels at:

  • Handling complex document relationships
  • Graph-based querying and clustering
  • Constructing dynamic knowledge graphs

PostgresKnowledgeBase

Leverages Postgres with pgvector and is best suited for:

  • Full-text search with advanced filtering and ranking
  • Graph-based querying and clustering

InMemoryKnowledgeBase

A minimal, no-dependency implementation designed for:

  • Prototyping and local testing
  • Lightweight, quick-start environments
  • Small-scale document classification

Topic Creation Process

The topic creation pipeline groups unlabeled documents into coherent topics using embeddings, dimensionality reduction, clustering, and LLM-based summarization.

This process is triggered when no predefined (seed) topics are provided.
  1. Document Retrieval

    • Pulls all documents from Azure Cognitive Search using the mapped id and content fields.
    • Filters out empty or whitespace-only content.
  2. Embedding Generation

    • Applies an embedding model to each document’s content (truncated to 3 * max_tokens).
    • Produces high-dimensional semantic vectors for clustering.
  3. Dimensionality Reduction

    • Uses UMAP to reduce embedding vectors to a lower-dimensional space for clustering.
    • Parameters such as n_neighbors, n_components, and initialization strategy are tuned based on document count.
  4. Topic Clustering

    • Runs HDBSCAN over the reduced vectors to group documents into topic clusters.
    • Noise and outliers are discarded (label = -1).
  5. LLM-based Topic Naming

    • For each valid topic cluster, generates a name using a language model.
    • Uses up to max_docs samples per topic and truncates each sample to max_doc_length.
  6. Return Structure

    • Returns:
      • A dictionary mapping topic names to associated documents.
      • A flat list of all topic names.
      • A flat list of all processed documents.