Kubernetes Horizontal Pod Autoscaling

Introduction

There are two common scaling methods: Vertical scaling and Horizontal scaling.

Vertical scaling involves adding more hardware, such as RAM or CPU, or increasing the number of server nodes. Horizontal scaling, on the other hand, means adding more instances of an app to fully utilize the available resources on a node or server.

However, horizontal scaling has its limits. Once a node's resources are maxed out, vertical scaling becomes necessary. This article will focus on horizontal scaling using Kubernetes Horizontal Pod Autoscaling (HPA), which automatically scales resources up or down based on system demands.

Implementation Process

1. Build a Docker image for your application.

2. Deploy the image using a Deployment and LoadBalancer service.

3. Configure HPA to automatically scale resources.

To use HPA for auto-scaling based on CPU/Memory, Kubernetes must have the metrics-server installed. If you’re using a cloud provider, the metrics-server is usually installed by default. For local Kubernetes setups, you need to manually install the metrics-server.

If you’re using Kind for a local Kubernetes setup, follow these steps to install the metrics-server after successfully creating the cluster:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Or install with Helm

helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server`
helm upgrade --install metrics-server metrics-server/metrics-server --namespace kube-system

To check metrics-server already installed

kubectl get pods -n kube-system

1. Build a Docker Image for the Application

Use the following code block to create a NodeJS Express Server:

import express from 'express'

const port = 3000
const app = express()

app
  .get('/', (_, res) => {
    res.send('This is NodeJS Typescript Application! Current time is ' + Date.now())
  })
  .get('/sum', (req, res) => {
    const value = +req?.query?.value
    const start = Date.now()
    const result = Array(+value)
      .fill(0)
      .map((_, i) => i)
      .reduce((a, b) => a + b)
    const now = Date.now()
    const duration = now - start
    res.json({duration, now, result})
  })
  .listen(port, () => {
    console.log(`Server is running http://localhost:${port}`)
  })

Next, let's build the Docker image and push it to Google Artifact Registry or Docker Hub. You can refer to my guide on how to do this here.

2. Deploy the image using a Deployment and a LoadBalancer service

Create a `deployment.yml` file that includes the configuration for the Deployment to deploy the image you built, along with a LoadBalancer service, as shown below:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-name
  labels:
    name: label-name
spec:
  selector:
    matchLabels:
      app: label-name
  template:
    metadata:
      labels:
        app: label-name
    spec:
      restartPolicy: Always
      containers:
        - name: express-ts
          image: express-ts
          resources:
            requests: # min
              memory: "100Mi"
              cpu: "100m"
            limits: # max
              memory: "300Mi"
              cpu: "300m"
---
apiVersion: v1
kind: Service
metadata:
  name: service-name
  labels:
    service: label-name
spec:
  selector:
    app: label-name
  type: LoadBalancer
  ports:
    - protocol: TCP
      port: 80 # port service
      targetPort: 3000 # port pod

I've explained the details about deployment and the LoadBalancer service in this article.

Here, we also cover resource configuration. You can choose to configure CPU, Memory, or both, depending on which parameters you want to scale.

If you define resource values, you can scale by a percentage of the initially defined resources.
If you don't define resource values, you must specify the exact resource values to scale.

3. Configuring HPA for Auto-scaling Resources

You can include the HPA configuration either in your `deployment.yml` file or in a separate file with the following content:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-name
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deployment-name # target to deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu # scale base on CPU
        target:
          type: Utilization
          averageUtilization: 80 # target 80%
  behavior:
    scaleDown:
      policies:
        - type: Pods
          periodSeconds: 30
          value: 3
      stabilizationWindowSeconds: 120

minReplicas, maxReplicas: min and max scaling resource
metrics: Define the type of resource you want to scale; in this case, it's the CPU.
averageUtilization: This is a percentage. When it exceeds this value, the system will scale up.
behavior: This is optional. Here, it's used to define the scale-down behavior, allowing a maximum of 3 Pods to scale down in 30 seconds.
stabilizationWindowSeconds: When the system remains stable for this duration, it will scale down (the default value is 5 minutes).

Next, apply to create the resource as follows:

kubectl apply -f deployment.yml

Note: I've defined all resources in a single file for simplicity, but in practice, you should separate each resource into individual YAML files for better management.

Information on the resource once created:

Please ensure that this API is working so we can continue testing the HPA.

Testing HPA

You can test the API using any method you know. Here, I provide a code block to send 10 requests every second. Replace the URL with the EXTERNAL-IP of the LoadBalancer service.

const numOfRequest = 10
const url = 'http://172.23.0.3/sum?value=10000000'
let idx = 0

setInterval(() => {
  Promise.all(
    Array(numOfRequest)
      .fill(0)
      .map(() =>
        fetch(url)
          .then(res => res.json())
          .then(data => console.log('Completed', ++idx, data.duration))
          .catch(console.error)
      )
  )
}, 1000)

After executing, the server resources will gradually increase, triggering the auto-scaling process.

You can check if HPA has performed the auto-scaling as follows:

You will notice that the number of Replicas will gradually increase when the CPU usage exceeds the 80% target (field value averageUtilization) and will gradually decrease after a period of time (field value stabilizationWindowSeconds) when the system stabilizes.

See more articles here.