Data Hub
Data Hub is where you manage your Knowledge Bases.
Knowledge Bases (KBs) are collections of documents and data that provide context and information to AI agents. They serve as the foundation for enabling AI to access, understand, and utilize both structured and unstructured information.
Key Features
- Centralized Storage: Organize documents and data in one location
- Intelligent Processing: Content optimized for AI consumption
- Efficient Retrieval: Quick access to relevant information
- Flexible Sources: Support for multiple data formats and origins
Common Operations
Creating a Knowledge Base
You can create a new Knowledge Base within a group using the Add Knowledge Base endpoint. This returns the created KB object with its unique identifier.
Adding Documents
The platform provides several ways to add documents to a Knowledge Base:
- Upload files directly using the Upload Train endpoint
- Import from external sources like Google Drive, OneDrive, or SharePoint using the Generic Train endpoint
- Add documents with custom metadata using the Add Knowledge Base Document endpoint
Adding a document will NOT automatically trigger the ingestion process.
These operations support:
- Multiple file uploads
- Custom path prefixes for organization
- Automatic processing and indexing
Managing Documents
Document management is handled through various endpoints that allow you to:
- Retrieve documents with specific status using the Get Knowledge Base Documents endpoint
- Remove documents using the Delete Document endpoint
- Update document metadata using the Update Document endpoint
Viewing and Updating
Knowledge Base details, including document status, can be retrieved through the Get Knowledge Base endpoint. You can also update KB properties using the update knowledge base endpoint.
Monitoring Processing
You can monitor processing through the Running Jobs endpoint that provides detailed information about ongoing and completed operations.
Supported Sources
Document Upload
Direct file uploads (PDFs, text files, images) with support for batch uploading
Google Drive
Import documents and files directly from your Google Drive
OneDrive
Access and import documents stored in Microsoft OneDrive
SharePoint
Access and import documents from SharePoint repositories
Websites
Web crawling with configurable depth and URL patterns
Coming Soon
Slack, Notion, and more
Knowledge Base Types
Knowledge Bases come in two primary types:
Entity Knowledge Bases
Permanent Knowledge Bases that are managed through the Data Hub and can be referenced by multiple agents or workflows.
Temporary Knowledge Bases
Created within the Workflow Editor for specific workflow use cases, with the option to promote them to Entity KBs.
Entity Knowledge Bases
Entity Knowledge Bases are permanent repositories that:
- Are created and managed through the Data Hub
- Can be shared across multiple agents and workflows
- Persist until explicitly deleted
- Support all document sources and management operations
- Provide centralized management of organizational knowledge
These are the standard Knowledge Bases that most users will interact with for long-term knowledge storage and retrieval.
Temporary Knowledge Bases
Temporary Knowledge Bases are workflow-specific repositories that:
- Are created directly within the Workflow Editor
- Are initially only available within the workflow where they were created
- Can be used for processing intermediate data or testing document structures
- Can be promoted to Entity Knowledge Bases when needed for broader use
- Provide a flexible way to experiment with different knowledge structures
Temporary KBs are ideal for workflow-specific data that may not need to be part of your permanent knowledge repository. If you later decide the knowledge is valuable for broader use, you can promote it to an Entity KB without losing any data.
Status Tracking
Knowledge Base Status
Knowledge Bases have the following status values that indicate their overall state:
Status | Description |
---|---|
created | KB has been created but no documents have been added yet |
training | KB has documents that are currently being processed |
trained | All documents in the KB have been successfully processed |
error | All documents in the KB have failed processing |
The KB status is automatically updated based on the status of its documents.
Document Status
Individual documents within a Knowledge Base have their own status values:
Status | Description |
---|---|
uploaded | Document has been uploaded but processing hasn’t started |
training | Document is currently being processed (chunked, embedded, etc.) |
trained | Document has been successfully processed and is available for queries |
error | Document processing failed |
You can filter documents by status using the Get Documents by Status endpoint.
Knowledge Base Workflows
Knowledge Bases are processed through a series of automated workflows that handle document ingestion, processing, and indexing. Understanding these workflows can help you optimize your knowledge base usage.
Document Processing Flow
When you add documents to a Knowledge Base, they go through the following processing steps:
Document Upload
Documents are uploaded to secure storage and registered in the Knowledge Base with ‘uploaded’ status
Text Extraction
Text is extracted from various file formats (PDF, DOCX, images, etc.) using specialized parsers
Chunking
Documents are split into smaller, semantically meaningful chunks for better retrieval
Embedding Generation
Vector embeddings are created for each chunk to enable semantic search
Indexing
Chunks and their embeddings are stored in a vector database for efficient retrieval
Error Handling and Retries
If any step in the document processing flow fails:
- The document is marked with ‘error’ status
- Error details are captured in the processing run logs
- You can view failed documents using the Get Documents by Status endpoint with status=‘error’
- You can retry processing all failed documents using the Retry All Errors endpoint
Integration with Agents
Knowledge Bases can be integrated with AI agents to provide context for conversations:
- Create a Knowledge Base and add documents
- Wait for the documents to be fully processed (status=
trained
) - Create or update an agent with the Knowledge Base ID
- The agent will now use the Knowledge Base to provide context-aware responses
Batch Processing
For large document sets, the platform supports batch processing:
- Upload multiple documents in a single request
- Monitor processing status through the Running Jobs endpoint
- The system automatically manages concurrent processing to optimize performance
Best Practices
Organizing Documents
For optimal Knowledge Base management:
- Use consistent naming conventions for documents
- Organize documents in a logical folder structure using path prefixes
- Group related documents together for better context retrieval
- Consider document size and complexity when uploading (very large documents may need to be split)
Performance Optimization & Recommendations
To get the best performance from your Knowledge Bases:
- Keep individual documents focused on specific topics
- Use descriptive filenames that reflect document content
- Remove unnecessary formatting, headers, footers, and boilerplate text
- For websites, configure crawl depth appropriately to avoid irrelevant content
- Regularly review and remove outdated or irrelevant documents
- Group related documents together in a folder for better context retrieval
- Do make different KBs for different topics, rather than having a KB with a lot of documents
Supported File Types
Knowledge Bases can process a wide variety of file formats to accommodate different content types and sources. Understanding which file types are supported helps ensure successful document ingestion.
Document Formats
Category | Supported Formats |
---|---|
Documents | PDF, DOCX, DOC, PPTX, PPT |
Text Files | TXT, HTML, MD, JSON |
Images | JPG/JPEG, PNG |
Archives | ZIP |
Google Workspace | Google Docs, Google Slides |
Office Documents
Our platform supports all major document formats including PDF, Microsoft Word (DOCX, DOC), and PowerPoint (PPTX, PPT) files.
Web & Code Content
Process plain text (TXT), web pages (HTML), documentation (MD), and structured data (JSON) with full text extraction.
Visual Content
Extract text from images (JPG/JPEG, PNG) using advanced OCR technology to make visual content searchable.
Packaged Content
Upload ZIP archives containing multiple documents for batch processing, with automatic extraction and organization.
File Size Limits
File Type | Maximum Size | Notes |
---|---|---|
Documents | 50 MB | Includes PDF, DOCX, DOC, etc. |
Images | 20 MB | Text will be extracted using OCR |
Archives | 100 MB | Contents will be extracted and processed individually |
Very large files may take longer to process and could impact system performance. Consider splitting large documents into smaller, more focused files for optimal results.
Text Extraction
The platform uses specialized parsers to extract text from different file types:
- PDF documents: Full text extraction with layout preservation
- Microsoft Office: Structured content extraction with formatting awareness
- Images: Optical Character Recognition (OCR) for text extraction
- Archives: Automatic extraction and processing of contained files
Special Considerations
- Password-protected files are not supported and will fail during processing
- Scanned documents are supported through OCR but may have lower accuracy
- Corrupted files will fail during processing with appropriate error messages
- Embedded content in documents (like images in PDFs) is processed when possible