What is a Knowledge Base?
A Knowledge Base is a collection of documents that have been:- Ingested: Uploaded and processed
- Chunked: Split into meaningful segments
- Embedded: Converted to vector representations
- Indexed: Stored for fast semantic search
How Knowledge Bases Work
Architecture
- Upload files (PDF, Word, Excel, images, etc.)
- Extract text and images
- Process multi-modal content
- Preserve metadata (source, page numbers, etc.)
- Split documents into smaller segments
- Maintain context within chunks
- Configurable chunk size and overlap
- Smart splitting at sentence/paragraph boundaries
- Convert text chunks to vector representations
- High-dimensional vectors capture semantic meaning
- Multiple embedding models available
- Vectors stored in PostgreSQL with pgvector
- Query with natural language
- Find relevant chunks by vector similarity
- Hybrid search combines semantic + keyword matching
- Reranking improves precision
- Retrieved chunks provide context for LLMs
- AI generates answers grounded in your documents
- Citations include source and page numbers
- Reduces hallucination with factual grounding
Key Features
Semantic Search
Find information by meaning, not just keywords:- Query: “What’s our refund policy?”
- Matches: “Returns accepted within 30 days…”, “Money-back guarantee…”, “Refund process…”
Multi-Document Querying
Query across hundreds or thousands of documents simultaneously:- Entire document libraries
- Product catalogs
- Policy collections
- Research papers
- Knowledge repositories
Folder Organization
Organize documents in folders for targeted search:- Sales: Customer-facing documents
- Internal: HR policies, procedures
- Product: Technical documentation
- Legal: Contracts, agreements
Multi-Modal Content
Process documents with both text and images:- PDFs with images: Extract text and interpret charts/diagrams
- Scanned documents: OCR for text extraction
- Presentations: Process slides with visual content
- Reports: Handle mixed text/image content
Metadata Preservation
Track important document metadata:- Source filename and path
- Page numbers for citations
- Upload date and version
- Folder location
- Custom metadata fields
Supported Document Types
Noxus Knowledge Bases support a wide range of formats:Text Documents
- PDF: Full text extraction, image processing, page-level tracking
- Word (DOCX/DOC): Text and formatting preservation
- Plain Text (TXT, MD): Direct ingestion
- HTML: Web pages and formatted content
Spreadsheets
- Excel (XLSX/XLS): Per-sheet processing, table extraction
- CSV: Tabular data ingestion
- Google Sheets: Via integration
Presentations
- PowerPoint (PPTX/PPT): Slide text extraction
- Google Slides: Via integration
Images
- PNG, JPG, WEBP: OCR and vision-based processing
- Scanned PDFs: Automatic OCR or vision model processing
Archives
- ZIP files: Extract and process all contents
Retrieval Strategies
Knowledge Bases offer multiple search strategies:Semantic Search (Vector Search)
How it works: Compares query embedding to document chunk embeddings Best for:- Natural language queries
- Conceptual similarity
- Finding related information
- Cross-lingual search (with multilingual embeddings)
Keyword Search (BM25)
How it works: Traditional keyword matching with ranking Best for:- Exact term searches
- Technical terminology
- IDs, codes, specific names
- Precise matching
Hybrid Search
How it works: Combines semantic + keyword search with weighted scoring Best for:- Most general-purpose queries
- Balance between precision and recall
- Handles both conceptual and exact matches
Reranking
How it works: Second-stage ranking with ColBERT model for precision Best for:- High-precision requirements
- Reducing false positives
- When top-k accuracy matters most
Retrieval Configuration
Fine-tune retrieval behavior:Top-K (Number of Chunks)
- Default: 5-10 chunks
- More chunks = more context but slower, costlier
- Fewer chunks = faster but may miss information
Similarity Threshold
- Filter out low-relevance chunks
- Range: 0.0 (all results) to 1.0 (exact match only)
- Typical: 0.5-0.7 for good balance
Chunk Size
- Smaller chunks (256-512 tokens): More precise, more chunks needed
- Larger chunks (512-1024 tokens): More context per chunk, fewer retrievals
Chunk Overlap
- Overlap between consecutive chunks
- Prevents information loss at boundaries
- Typical: 50-100 tokens overlap
Alpha (Hybrid Search Weight)
- 0.0: Keyword-only (BM25)
- 0.5: Balanced hybrid
- 1.0: Semantic-only (vector)
Reranking
- Enable for top-N results
- Rerank top 20-50 candidates to find best 5-10
- Improves precision significantly
Using Knowledge Bases in Workflows
Knowledge Base Q&A Node
Query knowledge bases with AI-powered answers: Inputs:- Query: Your question
- Knowledge Base: Select KB to search
- Folder Filter (optional): Limit to specific folder
- Retrieval Settings: Configure search behavior
- Answer: AI-generated response grounded in documents
- Sources: Citations with page numbers
- Chunks: Raw retrieved text (optional)
Retriever Node
Get raw chunks without AI generation: Use case: When you want to control the generation yourself or just need relevance search Outputs:- List of relevant chunks
- Similarity scores
- Source metadata
File to KB Doc Node
Upload documents to knowledge bases from workflows: Use case: Automated document ingestion, ETL pipelines Example:Creating and Managing Knowledge Bases
Creating a Knowledge Base
- Navigate to Knowledge Bases in your workspace
- Click Create Knowledge Base
- Name your KB descriptively
- Choose embedding model (default recommended)
- Configure chunking settings (optional)
Uploading Documents
Via UI:- Open Knowledge Base
- Click Upload Documents
- Select files or drag & drop
- Optionally specify folder
- Wait for processing to complete
Organizing with Folders
Create folder structure:- By department: Sales, Marketing, Engineering
- By topic: Products, Policies, Procedures
- By date: Q1-2024, Q2-2024
- By source: Website, Internal Docs, Customer Communications
- Faster searches (smaller search space)
- Better relevance (domain-specific)
- Easier management
- Clearer organization
Updating Documents
Re-upload with same filename: Replaces previous version Version control: Keep document versions in separate folders (v1, v2, etc.) Incremental updates: Upload only changed documentsDeleting Documents
Remove documents via UI or API:- Deletes from vector store
- Removes embeddings
- Frees storage space
- Cannot be undone
Best Practices
Document Preparation
Clean PDFs: Use PDFs with selectable text, not scanned images (unless OCR is intended) Consistent Formatting: Well-formatted documents chunk better Meaningful Filenames: Helps with source attribution and organization Remove Duplicates: Duplicate content wastes storage and confuses retrievalChunking Strategy
Default Settings: Start with defaults (512 token chunks, 50 token overlap) Technical Docs: Smaller chunks (256) for precise code/API references Narratives: Larger chunks (1024) to preserve context in stories/articles Mixed Content: Medium chunks (512) work for most use casesEmbedding Model Selection
Default (FastEmbed): Fast, efficient, good for most use cases OpenAI Embeddings: Higher quality, more expensive, best accuracy Multilingual Models: For non-English or mixed-language documents Domain-Specific: Coming soon - specialized embeddings for legal, medical, etc.Query Optimization
Be Specific: “What’s the return window?” vs “returns?” Natural Language: Write queries as questions, not keywords Provide Context: “For enterprise customers, what’s the SLA?” vs “SLA?” Iterate: Refine queries based on resultsFolder Strategy
Balance Granularity: Not too many folders (hard to manage), not too few (defeats purpose) Consistent Naming: Use clear, consistent folder names Document Purpose: Align folders with how you’ll query documentsLimitations and Considerations
Document Size Limits
- Individual file size: 100MB max
- Very large documents may timeout during processing
- Split large documents before upload if needed
Total Storage
- Depends on plan tier
- Monitor storage usage in KB dashboard
- Archive or remove unused documents
Processing Time
- Small documents (1-10 pages): Seconds
- Medium documents (10-100 pages): 1-5 minutes
- Large documents (100+ pages): 5-30 minutes
- Batch uploads: Process in parallel
Embedding Costs
- Initial ingestion: One-time embedding cost per document
- Queries: Small embedding cost per query
- Re-indexing: Full cost to re-embed documents
Search Limitations
- Maximum query length: ~500 words
- Maximum results returned: 100 chunks
- Folder filters: Prefix matching only
Monitoring Knowledge Base Performance
KB Dashboard
Track key metrics:- Total documents
- Total chunks indexed
- Storage used
- Query volume
- Average query latency
Query Analytics
Monitor query performance:- Most common queries
- Slowest queries
- Queries with no results
- Retrieval quality metrics
Document Analytics
Understand document usage:- Most-queried documents
- Documents never queried (candidates for removal)
- Documents with low relevance scores
Troubleshooting
”No relevant results found”
Causes:- Documents haven’t been indexed yet
- Query doesn’t match content
- Similarity threshold too high
- Wrong folder filter
- Wait for indexing to complete
- Rephrase query
- Lower similarity threshold
- Remove or adjust folder filter
”Results not relevant”
Causes:- Poor chunking strategy
- Wrong retrieval strategy
- Low-quality embeddings
- Adjust chunk size/overlap
- Try hybrid search instead of pure semantic
- Use reranking for precision
- Consider better embedding model
”Slow queries”
Causes:- Very large knowledge base
- Reranking enabled on many chunks
- Complex folder structures
- Use folder filters to narrow search
- Reduce top-k parameter
- Disable reranking or reduce rerank-top-n
- Consider splitting into multiple KBs
Advanced Topics
Ingestion Pipeline
Deep dive into document processing and embedding
Search Strategies
Master semantic, hybrid, and reranked search
Optimization
Optimize retrieval quality and performance
Best Practices
Proven patterns for effective knowledge bases
Knowledge Bases transform static documents into dynamic, queryable knowledge that powers intelligent AI interactions. Master them to build truly knowledgeable AI systems.