Service Scaling Model
The platform utilizes different scaling strategies for its various components to optimize for both performance and cost.Control Plane
Frontend & Backend
- Scaled via standard HPA (Horizontal Pod Autoscaler).
- Triggers based on CPU and Memory utilization.
- Optimized for consistent API responsiveness.
Execution Plane
Worker Pools
- Scaled per-pool via KEDA or HPA.
- Triggers based on task queue depth or resource usage.
- Optimized for high-throughput AI processing.
Advanced Worker Pool Scaling
Worker pools are the most dynamic part of the Noxus infrastructure. They support sophisticated scaling patterns to handle unpredictable AI workloads.KEDA-Driven Scaling (Queue-Based)
For most production environments, we recommend using KEDA (Kubernetes Event-driven Autoscaling) for worker pools:- Scale-to-Zero: Automatically shut down workers when no tasks are in the queue to save costs.
- Rapid Bursts: Instantly spin up dozens of workers when a high-volume batch job is submitted.
- Queue Awareness: Scaling is based on the actual number of pending tasks in Redis or RabbitMQ, not just CPU usage.
Resource-Based Scaling (HPA)
For workloads with consistent, long-running tasks, standard HPA can be used to maintain a steady pool of workers based on CPU or Memory saturation.Multi-Region & Multi-Zone Scaling
For global enterprises, Noxus supports scaling across multiple geographic regions and availability zones.- Regional Replicas: Deploy independent Frontend and Backend replicas in different regions to minimize latency for global users.
- Zone Resilience: Distribute worker pools across multiple availability zones to ensure continuous operation during a zone failure.
- Independent Policies: Configure unique autoscaling rules for each region based on local traffic patterns.
Scaling Best Practices
- Monitor Bottlenecks: Always keep an eye on PostgreSQL and Redis performance, as these can become bottlenecks before your compute resources do.
- Right-Size Pools: Create dedicated worker pools for different task types (e.g., a GPU pool for inference, a high-memory pool for document processing).
- Test Your Limits: Conduct regular load tests to understand the scaling latency of your infrastructure (how long it takes to spin up a new worker).