Operations

Noxus provides a comprehensive operational framework designed to give you deep visibility into your AI infrastructure and the tools to manage it at scale.

Observability & Monitoring

Noxus leverages industry-standard tools to provide a 360-degree view of your deployment’s health and performance.

Prometheus Metrics

Standardized /metrics endpoints across all services provide real-time counters and histograms. Track flow execution rates, worker utilization, and system-wide throughput.

OpenTelemetry Tracing

Distributed tracing powered by OpenTelemetry allows you to follow a single request across the frontend, backend, and worker pools to identify bottlenecks.

Auditability & Compliance

Noxus maintains a high-fidelity record of all platform activity, ensuring you can meet strict regulatory and security requirements.

Platform Audit Logs

Every administrative and management action is recorded in a tamper-proof Audit Log. This includes:

Identity: User ID, email, and API key used for the action.
Context: Tenant and Workspace identifiers.
Action: The specific operation performed (e.g., create, update, delete, execute).
Resource: The type and ID of the resource affected (e.g., workflow, agent, knowledge_base).
Payload: The request body and metadata associated with the change.

API & Access Logs

Detailed logs of every incoming API call are maintained to track usage patterns and security events:

Performance: Request duration (ms) and response codes.
Routing: HTTP method and exact route accessed.
Attribution: Mapping of every call to a specific user, group, and API key.

Maintenance & Backups

Ensure your AI solutions remain available and resilient through automated lifecycle management.

Automated Backups

Configure scheduled snapshots for your persistence layer (PostgreSQL) and Object Storage. We recommend a minimum 30-day retention for production environments.

Disaster Recovery

Implement multi-region deployment patterns for critical workloads to ensure zero-downtime failover and RTO/RPO compliance.

Liquid Data Lifecycle

Define data retention rules to automatically move information between high-performance cache and low-cost object storage based on active usage.

Scaling & Resource Management

Dynamic Worker Scaling

Leverage KEDA and HPA to scale your compute resources based on actual demand:

Queue-Driven: Automatically spin up workers as task volume increases and scale-to-zero during idle periods.
Workload Isolation: Deploy dedicated worker pools for specific workspaces or high-priority tasks.

Detailed Operations Guide

Explore the full technical guide for monitoring, logging, and scaling your Noxus deployment.

Introduction

Core Concepts

Platform

Infrastructure

Observability & Monitoring

Prometheus Metrics

OpenTelemetry Tracing

Auditability & Compliance

Platform Audit Logs

API & Access Logs

Maintenance & Backups

Scaling & Resource Management

Dynamic Worker Scaling

Detailed Operations Guide

Introduction

Core Concepts

Platform

Infrastructure

​Observability & Monitoring

Prometheus Metrics

OpenTelemetry Tracing

​Auditability & Compliance

​Platform Audit Logs

​API & Access Logs

​Maintenance & Backups

​Scaling & Resource Management

​Dynamic Worker Scaling

Detailed Operations Guide

Observability & Monitoring

Auditability & Compliance

Platform Audit Logs

API & Access Logs

Maintenance & Backups

Scaling & Resource Management

Dynamic Worker Scaling