Skip to main content
Noxus is designed with a decoupled architecture that allows you to scale the Control Plane (API/Frontend) and the Execution Plane (Workers) independently based on their unique workload profiles.

Service Scaling Model

The platform utilizes different scaling strategies for its various components to optimize for both performance and cost.

Control Plane

Frontend & Backend
  • Scaled via standard HPA (Horizontal Pod Autoscaler).
  • Triggers based on CPU and Memory utilization.
  • Optimized for consistent API responsiveness.

Execution Plane

Worker Pools
  • Scaled per-pool via KEDA or HPA.
  • Triggers based on task queue depth or resource usage.
  • Optimized for high-throughput AI processing.

Advanced Worker Pool Scaling

Worker pools are the most dynamic part of the Noxus infrastructure. They support sophisticated scaling patterns to handle unpredictable AI workloads.

KEDA-Driven Scaling (Queue-Based)

For most production environments, we recommend using KEDA (Kubernetes Event-driven Autoscaling) for worker pools:
  • Scale-to-Zero: Automatically shut down workers when no tasks are in the queue to save costs.
  • Rapid Bursts: Instantly spin up dozens of workers when a high-volume batch job is submitted.
  • Queue Awareness: Scaling is based on the actual number of pending tasks in Redis or RabbitMQ, not just CPU usage.

Resource-Based Scaling (HPA)

For workloads with consistent, long-running tasks, standard HPA can be used to maintain a steady pool of workers based on CPU or Memory saturation.

Multi-Region & Multi-Zone Scaling

For global enterprises, Noxus supports scaling across multiple geographic regions and availability zones.
  • Regional Replicas: Deploy independent Frontend and Backend replicas in different regions to minimize latency for global users.
  • Zone Resilience: Distribute worker pools across multiple availability zones to ensure continuous operation during a zone failure.
  • Independent Policies: Configure unique autoscaling rules for each region based on local traffic patterns.

Scaling Best Practices

  • Monitor Bottlenecks: Always keep an eye on PostgreSQL and Redis performance, as these can become bottlenecks before your compute resources do.
  • Right-Size Pools: Create dedicated worker pools for different task types (e.g., a GPU pool for inference, a high-memory pool for document processing).
  • Test Your Limits: Conduct regular load tests to understand the scaling latency of your infrastructure (how long it takes to spin up a new worker).