> ## Documentation Index
> Fetch the complete documentation index at: https://docs.noxus.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitoring

> Enterprise observability, real-time metrics, and health tracking for Noxus

Noxus provides deep visibility into its distributed architecture through standardized health endpoints, Prometheus-compatible metrics, and distributed tracing.

## Observability Architecture

The platform is designed to be monitored at three distinct layers: the **Service Layer**, the **Coordination Layer**, and the **Data Layer**.

```mermaid theme={null}
flowchart LR
  FE[Noxus Frontend] --> PM[Prometheus]
  BE[Noxus Backend] --> PM
  W[Noxus Workers] --> PM
  RE[Noxus Relays] --> PM

  BE --> OT[OpenTelemetry Collector]
  W --> OT
  RE --> OT

  PM --> GR[Grafana]
```

***

## Key Performance Indicators (KPIs)

To ensure a stable production environment, we recommend monitoring the following signals:

<CardGroup cols={2}>
  <Card title="API & Control Plane" icon="server">
    * **Latency**: P95/P99 response times for API endpoints. - **Error Rates**:
      4xx and 5xx response codes. - **Throughput**: Requests per second (RPS).
  </Card>

  <Card title="Execution Plane" icon="bolt">
    * **Queue Depth**: Number of tasks waiting in the broker (Redis/RabbitMQ). -
      **Processing Lag**: Time between task creation and execution start. -
      **Worker Utilization**: CPU/Memory usage per worker pool.
  </Card>

  <Card title="Persistence (Postgres)" icon="database">
    * **Connection Pressure**: Active vs. maximum allowed connections. - **Slow
      Queries**: Queries exceeding the 500ms threshold. - **IOPS**: Disk I/O
      utilization for vector search operations.
  </Card>

  <Card title="Coordination (Redis)" icon="microchip">
    * **Memory Saturation**: Percentage of available memory used. - **Eviction
      Rate**: Frequency of keys being removed due to memory limits. - **Command
      Latency**: Time taken to process coordination requests.
  </Card>
</CardGroup>

***

## Health & Metrics Endpoints

All Noxus services expose standardized endpoints for automated health checks and metrics collection:

* **Health Checks**: `/status/health` (Used by Kubernetes Liveness/Readiness probes).
* **Prometheus Metrics**: `/metrics` (Exposes internal service counters and histograms).

<Note>
  Noxus provides a set of **Default Grafana Dashboards** in the `noxus-infra`
  repository. These pre-configured dashboards provide immediate visibility into
  API performance, worker queue health, and resource utilization across your
  deployment.
</Note>

<Note>
  In Kubernetes deployments, the official Helm charts automatically annotate
  pods for Prometheus scraping, ensuring zero-config observability.
</Note>

***

## Alerting Strategy

We recommend setting up alerts for the following critical conditions:

1. **Service Availability**: Any core service reporting a non-healthy status.
2. **Queue Backlog**: Task queue depth exceeding defined thresholds for more than 5 minutes.
3. **Database Saturation**: PostgreSQL connection usage exceeding 80%.
4. **Model Provider Failures**: Sustained 5xx errors from external AI providers (OpenAI, Anthropic, etc.).

<CardGroup cols={2}>
  <Card title="Logging" icon="clipboard-list" href="/deployment/operations/logging">
    Pair metrics with centralized logs for faster root cause analysis.
  </Card>

  <Card title="Scaling" icon="trending-up" href="/deployment/operations/scaling">
    Use monitoring signals to drive automated scaling policies.
  </Card>
</CardGroup>
