Observability Architecture
The platform is designed to be monitored at three distinct layers: the Service Layer, the Coordination Layer, and the Data Layer.Key Performance Indicators (KPIs)
To ensure a stable production environment, we recommend monitoring the following signals:API & Control Plane
- Latency: P95/P99 response times for API endpoints.
- Error Rates: 4xx and 5xx response codes.
- Throughput: Requests per second (RPS).
Execution Plane
- Queue Depth: Number of tasks waiting in the broker (Redis/RabbitMQ).
- Processing Lag: Time between task creation and execution start.
- Worker Utilization: CPU/Memory usage per worker pool.
Persistence (Postgres)
- Connection Pressure: Active vs. maximum allowed connections.
- Slow Queries: Queries exceeding the 500ms threshold.
- IOPS: Disk I/O utilization for vector search operations.
Coordination (Redis)
- Memory Saturation: Percentage of available memory used.
- Eviction Rate: Frequency of keys being removed due to memory limits.
- Command Latency: Time taken to process coordination requests.
Health & Metrics Endpoints
All Noxus services expose standardized endpoints for automated health checks and metrics collection:- Health Checks:
/status/health(Used by Kubernetes Liveness/Readiness probes). - Prometheus Metrics:
/metrics(Exposes internal service counters and histograms).
Noxus provides a set of Default Grafana Dashboards in the
noxus-infra repository. These pre-configured dashboards provide immediate visibility into API performance, worker queue health, and resource utilization across your deployment.In Kubernetes deployments, the official Helm charts automatically annotate pods for Prometheus scraping, ensuring zero-config observability.
Alerting Strategy
We recommend setting up alerts for the following critical conditions:- Service Availability: Any core service reporting a non-healthy status.
- Queue Backlog: Task queue depth exceeding defined thresholds for more than 5 minutes.
- Database Saturation: PostgreSQL connection usage exceeding 80%.
- Model Provider Failures: Sustained 5xx errors from external AI providers (OpenAI, Anthropic, etc.).