Monitoring

Noxus provides deep visibility into its distributed architecture through standardized health endpoints, Prometheus-compatible metrics, and distributed tracing.

Observability Architecture

The platform is designed to be monitored at three distinct layers: the Service Layer, the Coordination Layer, and the Data Layer.

Key Performance Indicators (KPIs)

To ensure a stable production environment, we recommend monitoring the following signals:

API & Control Plane

Latency: P95/P99 response times for API endpoints.
Error Rates: 4xx and 5xx response codes.
Throughput: Requests per second (RPS).

Execution Plane

Queue Depth: Number of tasks waiting in the broker (Redis/RabbitMQ).
Processing Lag: Time between task creation and execution start.
Worker Utilization: CPU/Memory usage per worker pool.

Persistence (Postgres)

Connection Pressure: Active vs. maximum allowed connections.
Slow Queries: Queries exceeding the 500ms threshold.
IOPS: Disk I/O utilization for vector search operations.

Coordination (Redis)

Memory Saturation: Percentage of available memory used.
Eviction Rate: Frequency of keys being removed due to memory limits.
Command Latency: Time taken to process coordination requests.

Health & Metrics Endpoints

All Noxus services expose standardized endpoints for automated health checks and metrics collection:

Health Checks: /status/health (Used by Kubernetes Liveness/Readiness probes).
Prometheus Metrics: /metrics (Exposes internal service counters and histograms).

Noxus provides a set of Default Grafana Dashboards in the noxus-infra repository. These pre-configured dashboards provide immediate visibility into API performance, worker queue health, and resource utilization across your deployment.

In Kubernetes deployments, the official Helm charts automatically annotate pods for Prometheus scraping, ensuring zero-config observability.

Alerting Strategy

We recommend setting up alerts for the following critical conditions:

Service Availability: Any core service reporting a non-healthy status.
Queue Backlog: Task queue depth exceeding defined thresholds for more than 5 minutes.
Database Saturation: PostgreSQL connection usage exceeding 80%.
Model Provider Failures: Sustained 5xx errors from external AI providers (OpenAI, Anthropic, etc.).

Logging

Pair metrics with centralized logs for faster root cause analysis.

Scaling

Use monitoring signals to drive automated scaling policies.

Overview

Deployment Options

Configuration

Security

Operations

Observability Architecture

Key Performance Indicators (KPIs)

API & Control Plane

Execution Plane

Persistence (Postgres)

Coordination (Redis)

Health & Metrics Endpoints

Alerting Strategy

Logging

Scaling

Overview

Deployment Options

Configuration

Security

Operations

​Observability Architecture

​Key Performance Indicators (KPIs)

API & Control Plane

Execution Plane

Persistence (Postgres)

Coordination (Redis)

​Health & Metrics Endpoints

​Alerting Strategy

Logging

Scaling

Observability Architecture

Key Performance Indicators (KPIs)

Health & Metrics Endpoints

Alerting Strategy