Knowledge Bases Overview - Noxus Documentation

Knowledge Bases (KBs) are intelligent document repositories in Noxus that enable semantic search, question answering, and AI-powered information retrieval. They transform your documents into queryable knowledge that powers AI agents and workflows.

What is a Knowledge Base?

A Knowledge Base is a collection of documents that have been:

Ingested: Uploaded and processed
Chunked: Split into meaningful segments
Embedded: Converted to vector representations
Indexed: Stored for fast semantic search

Unlike simple file storage, Knowledge Bases understand the meaning of content and can retrieve relevant information based on semantic similarity, not just keyword matching.

How Knowledge Bases Work

Architecture

Documents → Ingestion → Chunking → Embedding → Vector Store → Semantic Search

1. Document Ingestion

Upload files (PDF, Word, Excel, images, etc.)
Extract text and images
Process multi-modal content
Preserve metadata (source, page numbers, etc.)

2. Chunking

Split documents into smaller segments
Maintain context within chunks
Configurable chunk size and overlap
Smart splitting at sentence/paragraph boundaries

3. Embedding

Convert text chunks to vector representations
High-dimensional vectors capture semantic meaning
Multiple embedding models available
Vectors stored in PostgreSQL with pgvector

4. Semantic Search

Query with natural language
Find relevant chunks by vector similarity
Hybrid search combines semantic + keyword matching
Reranking improves precision

5. Generation

Retrieved chunks provide context for LLMs
AI generates answers grounded in your documents
Citations include source and page numbers
Reduces hallucination with factual grounding

Key Features

Semantic Search

Find information by meaning, not just keywords:

Query: “What’s our refund policy?”
Matches: “Returns accepted within 30 days…”, “Money-back guarantee…”, “Refund process…”

Traditional keyword search would miss semantic variants.

Multi-Document Querying

Query across hundreds or thousands of documents simultaneously:

Entire document libraries
Product catalogs
Policy collections
Research papers
Knowledge repositories

Folder Organization

Organize documents in folders for targeted search:

Sales: Customer-facing documents
Internal: HR policies, procedures
Product: Technical documentation
Legal: Contracts, agreements

Filter searches to specific folders for faster, more relevant results. Process documents with both text and images:

PDFs with images: Extract text and interpret charts/diagrams
Scanned documents: OCR for text extraction
Presentations: Process slides with visual content
Reports: Handle mixed text/image content

Metadata Preservation

Track important document metadata:

Source filename and path
Page numbers for citations
Upload date and version
Folder location
Custom metadata fields

Supported Document Types

Noxus Knowledge Bases support a wide range of formats:

Text Documents

PDF: Full text extraction, image processing, page-level tracking
Word (DOCX/DOC): Text and formatting preservation
Plain Text (TXT, MD): Direct ingestion
HTML: Web pages and formatted content

Spreadsheets

Excel (XLSX/XLS): Per-sheet processing, table extraction
CSV: Tabular data ingestion
Google Sheets: Via integration

Presentations

PowerPoint (PPTX/PPT): Slide text extraction
Google Slides: Via integration

Images

PNG, JPG, WEBP: OCR and vision-based processing
Scanned PDFs: Automatic OCR or vision model processing

Retrieval Strategies

Knowledge Bases offer multiple search strategies:

Semantic Search (Vector Search)

How it works: Compares query embedding to document chunk embeddings Best for:

Natural language queries
Conceptual similarity
Finding related information
Cross-lingual search (with multilingual embeddings)

Example: “climate initiatives” matches “environmental programs”, “sustainability efforts”

Keyword Search (BM25)

How it works: Traditional keyword matching with ranking Best for:

Exact term searches
Technical terminology
IDs, codes, specific names
Precise matching

Example: “SKU-12345” requires exact match

Hybrid Search

How it works: Combines semantic + keyword search with weighted scoring Best for:

Most general-purpose queries
Balance between precision and recall
Handles both conceptual and exact matches

Configuration: Alpha parameter (0=keyword only, 1=semantic only, 0.5=balanced)

Reranking

How it works: Second-stage ranking with ColBERT model for precision Best for:

High-precision requirements
Reducing false positives
When top-k accuracy matters most

Trade-off: Slower but more accurate

Retrieval Configuration

Fine-tune retrieval behavior:

Top-K (Number of Chunks)

Default: 5-10 chunks
More chunks = more context but slower, costlier
Fewer chunks = faster but may miss information

Similarity Threshold

Filter out low-relevance chunks
Range: 0.0 (all results) to 1.0 (exact match only)
Typical: 0.5-0.7 for good balance

Chunk Size

Smaller chunks (256-512 tokens): More precise, more chunks needed
Larger chunks (512-1024 tokens): More context per chunk, fewer retrievals

Chunk Overlap

Overlap between consecutive chunks
Prevents information loss at boundaries
Typical: 50-100 tokens overlap

Alpha (Hybrid Search Weight)

0.0: Keyword-only (BM25)
0.5: Balanced hybrid
1.0: Semantic-only (vector)

Reranking

Enable for top-N results
Rerank top 20-50 candidates to find best 5-10
Improves precision significantly

Using Knowledge Bases in Workflows

Knowledge Base Q&A Node

Query knowledge bases with AI-powered answers: Inputs:

Query: Your question
Knowledge Base: Select KB to search
Folder Filter (optional): Limit to specific folder
Retrieval Settings: Configure search behavior

Outputs:

Answer: AI-generated response grounded in documents
Sources: Citations with page numbers
Chunks: Raw retrieved text (optional)

Example:

Query: "What are the system requirements for installation?"
KB: Product Documentation
Folder: Installation Guides
→ Answer: "The system requires Windows 10 or later, 8GB RAM, and 500MB disk space..."
→ Sources: [Installation Guide.pdf, page 3]

Retriever Node

Get raw chunks without AI generation: Use case: When you want to control the generation yourself or just need relevance search Outputs:

List of relevant chunks
Similarity scores
Source metadata

Example:

Query: "customer complaints"
→ Returns 10 relevant chunks from customer feedback documents
→ Use in custom analysis workflow

File to KB Doc Node

Upload documents to knowledge bases from workflows: Use case: Automated document ingestion, ETL pipelines Example:

Email Attachment → Extract File → File to KB Doc → Uploaded to "Customer Inquiries" KB

Creating and Managing Knowledge Bases

Creating a Knowledge Base

Navigate to Knowledge Bases in your workspace
Click Create Knowledge Base
Name your KB descriptively
Choose embedding model (default recommended)
Configure chunking settings (optional)

Uploading Documents

Via UI:

Open Knowledge Base
Click Upload Documents
Select files or drag & drop
Optionally specify folder
Wait for processing to complete

Via Workflow: Use File to KB Doc node to upload programmatically Via API: Use KB upload endpoints for integration

Organizing with Folders

Create folder structure:

By department: Sales, Marketing, Engineering
By topic: Products, Policies, Procedures
By date: Q1-2024, Q2-2024
By source: Website, Internal Docs, Customer Communications

Benefits:

Faster searches (smaller search space)
Better relevance (domain-specific)
Easier management
Clearer organization

Updating Documents

Re-upload with same filename: Replaces previous version Version control: Keep document versions in separate folders (v1, v2, etc.) Incremental updates: Upload only changed documents

Deleting Documents

Remove documents via UI or API:

Deletes from vector store
Removes embeddings
Frees storage space
Cannot be undone

Best Practices

Document Preparation

Clean PDFs: Use PDFs with selectable text, not scanned images (unless OCR is intended) Consistent Formatting: Well-formatted documents chunk better Meaningful Filenames: Helps with source attribution and organization Remove Duplicates: Duplicate content wastes storage and confuses retrieval

Chunking Strategy

Default Settings: Start with defaults (512 token chunks, 50 token overlap) Technical Docs: Smaller chunks (256) for precise code/API references Narratives: Larger chunks (1024) to preserve context in stories/articles Mixed Content: Medium chunks (512) work for most use cases

Embedding Model Selection

Default (FastEmbed): Fast, efficient, good for most use cases OpenAI Embeddings: Higher quality, more expensive, best accuracy Multilingual Models: For non-English or mixed-language documents Domain-Specific: Coming soon - specialized embeddings for legal, medical, etc.

Query Optimization

Be Specific: “What’s the return window?” vs “returns?” Natural Language: Write queries as questions, not keywords Provide Context: “For enterprise customers, what’s the SLA?” vs “SLA?” Iterate: Refine queries based on results

Folder Strategy

Balance Granularity: Not too many folders (hard to manage), not too few (defeats purpose) Consistent Naming: Use clear, consistent folder names Document Purpose: Align folders with how you’ll query documents

Limitations and Considerations

Document Size Limits

Individual file size: 100MB max
Very large documents may timeout during processing
Split large documents before upload if needed

Total Storage

Depends on plan tier
Monitor storage usage in KB dashboard
Archive or remove unused documents

Processing Time

Small documents (1-10 pages): Seconds
Medium documents (10-100 pages): 1-5 minutes
Large documents (100+ pages): 5-30 minutes
Batch uploads: Process in parallel

Embedding Costs

Initial ingestion: One-time embedding cost per document
Queries: Small embedding cost per query
Re-indexing: Full cost to re-embed documents

Search Limitations

Maximum query length: ~500 words
Maximum results returned: 100 chunks
Folder filters: Prefix matching only

Monitoring Knowledge Base Performance

KB Dashboard

Track key metrics:

Total documents
Total chunks indexed
Storage used
Query volume
Average query latency

Query Analytics

Monitor query performance:

Most common queries
Slowest queries
Queries with no results
Retrieval quality metrics

Document Analytics

Understand document usage:

Most-queried documents
Documents never queried (candidates for removal)
Documents with low relevance scores

Troubleshooting

”No relevant results found”

Causes:

Documents haven’t been indexed yet
Query doesn’t match content
Similarity threshold too high
Wrong folder filter

Solutions:

Wait for indexing to complete
Rephrase query
Lower similarity threshold
Remove or adjust folder filter

”Results not relevant”

Causes:

Poor chunking strategy
Wrong retrieval strategy
Low-quality embeddings

Solutions:

Adjust chunk size/overlap
Try hybrid search instead of pure semantic
Use reranking for precision
Consider better embedding model

”Slow queries”

Causes:

Very large knowledge base
Reranking enabled on many chunks
Complex folder structures

Solutions:

Use folder filters to narrow search
Reduce top-k parameter
Disable reranking or reduce rerank-top-n
Consider splitting into multiple KBs

Advanced Topics

Ingestion Pipeline

Deep dive into document processing and embedding

Search Strategies

Master semantic, hybrid, and reranked search

Optimization

Optimize retrieval quality and performance

Best Practices

Proven patterns for effective knowledge bases

Knowledge Bases transform static documents into dynamic, queryable knowledge that powers intelligent AI interactions. Master them to build truly knowledgeable AI systems.

​What is a Knowledge Base?

​How Knowledge Bases Work

​Architecture

​Key Features

​Semantic Search

​Multi-Document Querying

​Folder Organization

​Multi-Modal Content

​Metadata Preservation

​Supported Document Types

​Text Documents

​Spreadsheets

​Presentations

​Images

​Archives

​Retrieval Strategies

​Semantic Search (Vector Search)

​Keyword Search (BM25)

​Hybrid Search

​Reranking

​Retrieval Configuration

​Top-K (Number of Chunks)

​Similarity Threshold

​Chunk Size

​Chunk Overlap

​Alpha (Hybrid Search Weight)

​Reranking

​Using Knowledge Bases in Workflows

​Knowledge Base Q&A Node

​Retriever Node

​File to KB Doc Node

​Creating and Managing Knowledge Bases

​Creating a Knowledge Base

​Uploading Documents

​Organizing with Folders

​Updating Documents

​Deleting Documents

​Best Practices

​Document Preparation

​Chunking Strategy

​Embedding Model Selection

​Query Optimization

​Folder Strategy

​Limitations and Considerations

​Document Size Limits

​Total Storage

​Processing Time

​Embedding Costs

​Search Limitations

​Monitoring Knowledge Base Performance

​KB Dashboard

​Query Analytics

​Document Analytics

​Troubleshooting

​”No relevant results found”

​”Results not relevant”

​”Slow queries”

​Advanced Topics

Ingestion Pipeline

Search Strategies

Optimization

Best Practices

What is a Knowledge Base?

How Knowledge Bases Work

Architecture

Key Features

Semantic Search

Multi-Document Querying

Folder Organization

Multi-Modal Content

Metadata Preservation

Supported Document Types

Text Documents

Spreadsheets

Presentations

Images

Archives

Retrieval Strategies

Semantic Search (Vector Search)

Keyword Search (BM25)

Hybrid Search

Reranking

Retrieval Configuration

Top-K (Number of Chunks)

Similarity Threshold

Chunk Size

Chunk Overlap

Alpha (Hybrid Search Weight)

Reranking

Using Knowledge Bases in Workflows

Knowledge Base Q&A Node

Retriever Node

File to KB Doc Node

Creating and Managing Knowledge Bases

Creating a Knowledge Base

Uploading Documents

Organizing with Folders

Updating Documents

Deleting Documents

Best Practices

Document Preparation

Chunking Strategy

Embedding Model Selection

Query Optimization

Folder Strategy

Limitations and Considerations

Document Size Limits

Total Storage

Processing Time

Embedding Costs

Search Limitations

Monitoring Knowledge Base Performance

KB Dashboard

Query Analytics

Document Analytics

Troubleshooting

”No relevant results found”

”Results not relevant”

”Slow queries”

Advanced Topics