Skip to main content
Noxus provides a comprehensive operational framework designed to give you deep visibility into your AI infrastructure and the tools to manage it at scale.

Observability & Monitoring

Noxus leverages industry-standard tools to provide a 360-degree view of your deployment’s health and performance.

Prometheus Metrics

Standardized /metrics endpoints across all services provide real-time counters and histograms. Track flow execution rates, worker utilization, and system-wide throughput.

OpenTelemetry Tracing

Distributed tracing powered by OpenTelemetry allows you to follow a single request across the frontend, backend, and worker pools to identify bottlenecks.

Auditability & Compliance

Noxus maintains a high-fidelity record of all platform activity, ensuring you can meet strict regulatory and security requirements.

Platform Audit Logs

Every administrative and management action is recorded in a tamper-proof Audit Log. This includes:
  • Identity: User ID, email, and API key used for the action.
  • Context: Tenant and Workspace identifiers.
  • Action: The specific operation performed (e.g., create, update, delete, execute).
  • Resource: The type and ID of the resource affected (e.g., workflow, agent, knowledge_base).
  • Payload: The request body and metadata associated with the change.

API & Access Logs

Detailed logs of every incoming API call are maintained to track usage patterns and security events:
  • Performance: Request duration (ms) and response codes.
  • Routing: HTTP method and exact route accessed.
  • Attribution: Mapping of every call to a specific user, group, and API key.

Maintenance & Backups

Ensure your AI solutions remain available and resilient through automated lifecycle management.
1

Automated Backups

Configure scheduled snapshots for your persistence layer (PostgreSQL) and Object Storage. We recommend a minimum 30-day retention for production environments.
2

Disaster Recovery

Implement multi-region deployment patterns for critical workloads to ensure zero-downtime failover and RTO/RPO compliance.
3

Liquid Data Lifecycle

Define data retention rules to automatically move information between high-performance cache and low-cost object storage based on active usage.

Scaling & Resource Management

Dynamic Worker Scaling

Leverage KEDA and HPA to scale your compute resources based on actual demand:
  • Queue-Driven: Automatically spin up workers as task volume increases and scale-to-zero during idle periods.
  • Workload Isolation: Deploy dedicated worker pools for specific workspaces or high-priority tasks.

Detailed Operations Guide

Explore the full technical guide for monitoring, logging, and scaling your Noxus deployment.