Skip to main content

HPA Auto Scaling

Pod List Tab

Manually adjusting scale every time traffic surges is tedious and risky. With HPA (Horizontal Pod Autoscaler), Pods automatically scale up or down based on service load.

Why Use HPA?
  • 24/7 automatic response: Handles traffic surges even at midnight automatically.
  • Cost savings: Reduces Pods when traffic is low to save resources.
  • Improved stability: Prevents manual operation mistakes.

What is HPA?

HPA (Horizontal Pod Autoscaler) is a core Kubernetes feature that automatically adjusts the number of Pods based on CPU, memory usage, or custom metrics. It's called "horizontal" scaling because it increases the "number" of Pods (vertical scaling increases Pod resources).

How It Works

[Metrics Server] → [HPA Controller] → [Deployment Scale]
↓ ↓ ↓
Collect metrics Compare targets Adjust Pod count

Supported Metrics

HPA can scale based on various metrics.

  • CPU Usage (Most Common): Scales based on Pod CPU usage. Suitable for compute-intensive workloads like web servers and API servers.

  • Memory Usage: Scales based on Pod memory usage. Suitable for memory-intensive workloads like caching and data processing.

  • Custom Metrics (Advanced): Scales based on metrics collected from Prometheus, etc. Used when scaling based on business logic like requests per second or message queue length is needed.

Which Metric Should You Choose?

For most web services, starting with CPU usage is a good choice. If your workload is memory-intensive, add memory usage as well.


Prerequisites

  • Metrics Server must be installed on the Kubernetes cluster.
  • Service must be deployed on K8s

Permission Notice: If you cannot access this feature, please request permission from your organization manager.

If metrics server is not installed, refer to the Metrics Server Installation guide.


HPA Configuration

Step 1: Open Service Deployment Settings

  1. Click the target service on the [Service Management] page
  2. Click the Deploy tab
  3. Click the Deployment Settings or HPA Settings button.

Step 2: Enable HPA

  1. Find the Auto Scaling (HPA) section.
  2. Turn on the Enable HPA toggle.

Step 3: Basic Settings

  • Minimum Replicas (Recommended: 2 or more): Minimum number of Pods to maintain even under low load. Recommend at least 2 for high availability. Setting to 1 creates a single point of failure.
  • Maximum Replicas (Recommended: 10, adjust based on resources): Maximum number of Pods to scale up to even under high load. Set appropriately considering cluster resources and costs.
  • Scale Up Stabilization (Recommended: 0 seconds): Wait time before adding Pods. Generally set to 0 seconds to respond immediately to load increases.
  • Scale Down Stabilization (Recommended: 300 seconds): Wait time before reducing Pods. If too short, flapping occurs where Pods are frequently created/deleted due to load fluctuations. Recommend minimum 5 minutes (300 seconds).

Step 4: Metric Settings

CPU-Based Scaling

  1. Click the Add Metric button.
  2. Metric Type: Select CPU
  3. Set target value:
    • Average Usage: 50-80% recommended.
    • Example: 70% → Scale up when average CPU exceeds 70%

Memory-Based Scaling

  1. Click the Add Metric button.
  2. Metric Type: Select Memory
  3. Set target value:
    • Average Usage: 70-90% recommended.
    • Example: 80% → Scale up when average memory exceeds 80%

Step 5: Save and Apply

  1. Click the Save button.
  2. KIWI creates the HPA resource in the cluster.
  3. Confirm the "HPA has been applied" message

HPA Status Monitoring

View in Service Management

You can check HPA status on the deployed service card:

  • Current Replicas: Current number of running Pods.
  • Target Replicas: Target number of Pods determined by HPA
  • Current Metric: Current CPU/memory usage
  • Target Metric: Configured target value

View in Operations Modal

  1. Click the Operations button for the service.
  2. Select the Scaling tab
  3. Check HPA status and history:
    • Scale up/down events
    • Metric trend graphs
    • Replica change history

Advanced Settings

Scaling Policies

Fine-tune scaling behavior.

Scale Up Policy

  • Type (Pods / Percent): Specify scaling unit as absolute count (Pods) or percentage (Percent). For example, with Pods, scales by a fixed count; with Percent, scales by percentage of current Pod count.
  • Value: Number or percentage of Pods to add. For example, if type is Pods and value is 4, adds maximum 4 Pods at a time.
  • Period: Time interval for policy application. For example, with 15 seconds, determines whether to scale up every 15 seconds.

Example: Type Pods, Value 4, Period 15 seconds → Add up to 4 Pods every 15 seconds.

Scale Down Policy

  • Type (Pods / Percent): Specify reduction unit as absolute count (Pods) or percentage (Percent). Percent is recommended for gradual reduction.
  • Value: Number or percentage of Pods to reduce. For example, if type is Percent and value is 10, reduces by 10% of current Pod count.
  • Period: Time interval for policy application. Longer intervals than scale up are recommended for scale down.

Example: Type Percent, Value 10, Period 60 seconds → Reduce up to 10% Pods every 60 seconds.

Multiple Metrics

You can combine multiple metrics for scaling.

  1. Configure multiple metrics with Add Metric
  2. HPA selects the highest replica count among all metrics.
  3. Example: CPU 70%, Memory 80% both configured.
    • CPU requires 4, Memory requires 6 → Scales to 6

Custom Metrics (Advanced)

Custom metrics available with Prometheus integration:

  1. Select Custom Metrics option.
  2. Enter metric name (e.g., http_requests_per_second)
  3. Set target value (e.g., 1000)

Practical Use Scenarios

Scenario 1: Web Application

Requirements: Auto scale based on traffic

Settings:

  • Min: 2, Max: 20
  • CPU: 70% target
  • Scale down stabilization: 300 seconds.

Scenario 2: API Server

Requirements: Fast scale up, gradual scale down

Settings:

  • Min: 3, Max: 50
  • CPU: 60% target
  • Scale up: Immediate (0 seconds)
  • Scale down: 10-minute stabilization.

Scenario 3: Batch Processor

Requirements: Memory-intensive workload

Settings:

  • Min: 1, Max: 10
  • Memory: 80% target
  • Scale down: 5-minute stabilization.

Resource Request Settings

For HPA to work accurately, Pod resource requests must be configured.

How to Set Resource Requests

  1. Check Resources section in Deployment Settings
  2. Configure resource requests:
  • CPU Request (Recommended: 100m - 500m): Amount of CPU guaranteed for the Pod. Set based on application's average CPU usage. HPA calculates usage based on this value.
  • Memory Request (Recommended: 128Mi - 512Mi): Amount of memory guaranteed for the Pod. Set based on application's average memory usage.
  • CPU Limit (Recommended: 2-4x request): Maximum CPU the Pod can use. Set to 2-4x of request to handle burst traffic.
  • Memory Limit (Recommended: 1.5-2x request): Maximum memory the Pod can use. Pod is terminated by OOM if this is exceeded, so set with margin.

Example Configuration

resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"

With this configuration, HPA CPU target 70% means:

  • 200m × 70% = 140m average usage maintains.
  • Scales up when exceeds 200m × 70%

Troubleshooting

HPA Not Scaling

  • When metrics are not displayed: Metrics Server is not installed. First install Metrics Server by referring to the Metrics Server Installation guide.
  • When displayed as Unknown: HPA cannot collect metrics. Check if resource requests are configured on Pods. HPA cannot calculate usage without resource requests.
  • When target is not reached: Current load is low and scaling is not needed. This is normal behavior, and it will scale automatically when load increases.

Excessive Scaling

  • When scale up/down repeats frequently (flapping): Scale down stabilization time is too short. Increase stabilization time to 300 seconds (5 minutes) or more.
  • When too many Pods are created: CPU/memory target values are set too low. Increase target values to 70-80%.
  • When Pods are Pending due to resource shortage: Maximum replicas reached or cluster resources are insufficient. Adjust maximum replica value or add nodes.

Scale Down Delay

  • Cause: Scale down stabilization time
  • Resolution: Shorten stabilization time if needed (but watch for flapping)

Best Practices

Here are recommendations for using HPA effectively.

Configuration Recommendations

Proven Settings

The settings below have been validated in many projects. Use them as a reference.

  1. Minimum Replicas: 2 or more - Essential for high availability. One replica creates a single point of failure.
  2. Target CPU: 70-80% - Too low causes unnecessary costs.
  3. Scale Down Stabilization: At least 300 seconds - Prevents flapping (frequent scaling up/down).
  4. Resource Requests: Based on actual usage - Required for HPA to work accurately.

Monitoring

Regularly check that HPA is working well.

  1. Check Metrics: Regularly review HPA status and current replica count.
  2. Alert Settings: Set up alerts for when maximum replicas are reached.
  3. Cost Monitoring: Track cloud costs associated with scaling.

Testing

Validate HPA behavior before production deployment.

  1. Load Testing: Increase traffic to verify scale up works correctly.
  2. Stabilization Testing: Reduce traffic to verify scale down works appropriately.
  3. Limit Testing: Check how the service behaves when maximum replicas are reached.
Test Before Production

Don't apply HPA settings directly to production. Test thoroughly in a staging environment first.


Disable HPA

To disable HPA:

  1. Turn off the Enable HPA toggle in Deployment Settings
  2. Click the Save button.
  3. HPA resource is deleted and switches to manual scaling.