Overview
KnowledgeBase
is a collection of documents that are used to generate test cases for a specific domain or task. Usually refered as the Vector Database for retrieval augmented generation (RAG).
This documents are grouped by topics, if not defined the KnowledgeBase
will generate automatically the topics.
Connectors
AzureKnowledgeBase
Leverages Azure Cognitive Search and is best suited for:
- Cloud-based document indexing and storage
- Full-text search with advanced filtering and ranking
- Integration with Microsoft’s AI-powered search stack
Neo4jKnowledgeBase
Built on Neo4j, this connector excels at:
- Handling complex document relationships
- Graph-based querying and clustering
- Constructing dynamic knowledge graphs
PostgresKnowledgeBase
Leverages Postgres with pgvector and is best suited for:
- Full-text search with advanced filtering and ranking
- Graph-based querying and clustering
InMemoryKnowledgeBase
A minimal, no-dependency implementation designed for:
- Prototyping and local testing
- Lightweight, quick-start environments
- Small-scale document classification
Topic Creation Process
The topic creation pipeline groups unlabeled documents into coherent topics using embeddings, dimensionality reduction, clustering, and LLM-based summarization.
-
Document Retrieval
- Pulls all documents from Azure Cognitive Search using the mapped
id
andcontent
fields. - Filters out empty or whitespace-only content.
- Pulls all documents from Azure Cognitive Search using the mapped
-
Embedding Generation
- Applies an embedding model to each document’s content (truncated to
3 * max_tokens
). - Produces high-dimensional semantic vectors for clustering.
- Applies an embedding model to each document’s content (truncated to
-
Dimensionality Reduction
- Uses UMAP to reduce embedding vectors to a lower-dimensional space for clustering.
- Parameters such as
n_neighbors
,n_components
, and initialization strategy are tuned based on document count.
-
Topic Clustering
- Runs HDBSCAN over the reduced vectors to group documents into topic clusters.
- Noise and outliers are discarded (
label = -1
).
-
LLM-based Topic Naming
- For each valid topic cluster, generates a name using a language model.
- Uses up to
max_docs
samples per topic and truncates each sample tomax_doc_length
.
-
Return Structure
- Returns:
- A dictionary mapping topic names to associated documents.
- A flat list of all topic names.
- A flat list of all processed documents.
- Returns: