Kubernetes Beginner Series 9 — Resource Management and Autoscaling

Why You Must Specify Resources
requests and limits
How It Affects Scheduling
QoS Classes — Who Dies First in an OOM Situation
HPA — Horizontal Autoscaling
- Custom Metrics-Based HPA
VPA — Vertical Autoscaling
LimitRange — Defaults and Maximums
ResourceQuota — Namespace Total Limits
How to Determine the Right Values
Hands-On — Observing HPA in Action

Why You Must Specify Resources

The moment you create a Pod in Kubernetes, the cluster scheduler asks one question: “Which node should this Pod be placed on?”

To answer that, it needs to know how much resources the Pod will consume. How much CPU it requires, how many GB of memory it needs. Without this information, the scheduler just places it on any arbitrary node. If you’re unlucky, resource-hungry Pods pile up on a single node and start killing each other with OOM errors.

That’s why Kubernetes has you specify resources in the Pod spec. These values form the foundation for scheduling, OOM handling, and autoscaling. Misconfigure them, and you end up in a strange situation where nodes have plenty of headroom but Pods are starved.

requests and limits

Resource specification has two axes:

requests: The minimum guarantee — “I need at least this much.” The basis for scheduler node selection
limits: The upper bound — “Don’t use more than this.” Enforced by kubelet/runtime

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
    - name: app
      image: myapp:1.0
      resources:
        requests:
          cpu: "250m"      # 0.25 vCPU
          memory: "256Mi"
        limits:
          cpu: "500m"      # 0.5 vCPU
          memory: "512Mi"

CPU uses the m (millicore) unit. 1000m equals 1 vCPU. 250m means 0.25 cores. Memory uses Mi (Mebibyte) and Gi (Gibibyte). Be careful — M (Megabyte) and Mi are different (base-10 vs base-2). The convention is to stick with Mi/Gi.

The key point here is that CPU and memory are handled differently:

CPU: A compressible resource. When the limit is exceeded, the Pod gets throttled. It slows down but doesn’t die
Memory: An incompressible resource. When the limit is exceeded, the Pod gets OOMKilled. It simply dies

So it’s safer to set the memory limit with headroom above actual usage, while CPU limits require careful thought. An overly low CPU limit makes applications sluggish during GC or momentary spikes.

How It Affects Scheduling

The Kubernetes scheduler uses a Pod’s requests to find “a node with enough room for this Pod.”

flowchart LR
    A[Pod created<br/>requests: 500m CPU / 1Gi Mem] --> B[Scheduler]
    B --> C{Node 1<br/>Available<br/>200m / 512Mi}
    B --> D{Node 2<br/>Available<br/>800m / 2Gi}
    B --> E{Node 3<br/>Available<br/>1000m / 4Gi}
    C -.->|Rejected| F[Insufficient for request]
    D --> G[Eligible]
    E --> G
    G --> H[Scoring then<br/>final node selection]

One important point: the scheduler judges based on the sum of requests, not actual usage. If a node has 4 vCPUs and Pods have reserved 3.5 vCPUs via requests, then even if actual CPU usage is only 5%, a Pod requesting more than 0.6 vCPU cannot be placed on that node.

So if you set requests too high, you get node waste. Conversely, if you set them too low, Pods pile onto nodes that are actually busy, degrading overall performance. This is why you need to tune based on actual usage from monitoring.

QoS Classes — Who Dies First in an OOM Situation

When a node runs low on memory, the kernel triggers the OOM Killer to forcibly terminate processes. Kubernetes assigns QoS classes to determine which Pod to kill first.

The classification criteria are straightforward:

Guaranteed: All containers have requests == limits, and both CPU and memory are specified
Burstable: At least one request is set, but the Guaranteed criteria aren’t fully met
BestEffort: No requests/limits at all

# Guaranteed
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

# Burstable
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

# BestEffort (no resources block at all)

When a node runs low on memory, the order for picking OOM victims goes like this:

flowchart TB
    A[Node memory pressure] --> B{Any BestEffort Pods?}
    B -->|Yes| C[Evict first]
    B -->|No| D{Any Burstable Pods with<br/>high usage relative to limit?}
    D -->|Yes| E[Evict]
    D -->|No| F[Guaranteed are last to go]

It’s safest to run production workloads as Guaranteed whenever possible. Pods with high restart costs — like databases or caches — should always be Guaranteed. On the other hand, batch jobs or dev tool Pods can remain BestEffort. When resources run short, they get evicted first, protecting critical workloads.

HPA — Horizontal Autoscaling

When traffic increases, scale out by adding more Pods; when it decreases, scale back in. That’s the role of the HorizontalPodAutoscaler (HPA). It automatically adjusts a Deployment’s replicas based on metrics like CPU utilization.

Let’s look at how the HPA controller periodically queries the metrics-server to make scaling decisions:

sequenceDiagram
    participant H as HPA Controller
    participant MS as metrics-server
    participant P as Pods
    participant D as Deployment
    loop Every 15 seconds
        H->>MS: Query current average CPU utilization
        MS->>P: Collect kubelet metrics
        P-->>MS: CPU: 700m (requests 500m → 140%)
        MS-->>H: Average utilization 140%
        H->>H: 140/70 = 2.0x → Calculate required replicas
        H->>D: spec.replicas = N (scale up)
        D->>P: Create new Pods
    end
    Note over H: Scale-down requests are executed after 5 min stabilization

HPA requires metrics-server to be installed in the cluster. Managed Kubernetes services typically include it by default.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60

Breaking down the configuration:

min/max Replicas: Never goes below 2, never exceeds 10
target 70%: Scales up when the average CPU utilization across all Pods exceeds 70% of requests
scaleUp: Reacts immediately, can increase by up to 100% every minute (2 -> 4 -> 8)
scaleDown: Waits 5 minutes for metrics to stabilize before scaling down, reduces by up to 50% per minute

The important thing here is what “70% CPU utilization” is measured against. HPA calculates it relative to requests. If a Pod’s CPU requests are 500m and actual usage is 350m, utilization is 70%. So for HPA to work properly, Pods must have requests set.

Custom Metrics-Based HPA

Scaling on CPU or memory alone is often insufficient. Async workers should scale based on queue length, and API servers are better served by requests per second (RPS). HPA can scale on arbitrary metrics through the Custom Metrics API.

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"   # Maintain an average of 100 RPS per Pod

To provide these metrics, you need to install a component like Prometheus Adapter that exposes Prometheus metrics through the Kubernetes Custom Metrics API. It’s a bit of work upfront, but once set up, you can define practical policies like “add more Pods when latency increases.”

VPA — Vertical Autoscaling

While HPA scales the number of Pods, the VerticalPodAutoscaler (VPA) adjusts the size (requests/limits) of individual Pods. Something like: “This Pod was declared with 250m CPU, but it consistently uses 400m. Let me bump up its requests.”

Let’s capture how HPA and VPA modify the same Pod along different axes in a single diagram:

flowchart LR
    subgraph HPA_DEMO["HPA (Horizontal)"]
        H1["Pod\n250m / 256Mi"] --> HOUT["Pod x3\n250m / 256Mi each"]
    end
    subgraph VPA_DEMO["VPA (Vertical)"]
        V1["Pod\n250m / 256Mi"] --> VOUT["Pod x1\n500m / 512Mi"]
    end

VPA is not installed by default and must be deployed separately. It has three operating modes:

Mode	Behavior
`Off`	Only computes recommendations without applying them (analysis only)
`Initial`	Sets values only at Pod creation time
`Auto`	Adjusts values by recreating running Pods

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 2
          memory: 4Gi

An important caveat: Do not use VPA and HPA simultaneously on the same CPU/memory metrics. They’ll counteract each other and create erratic scaling loops. If HPA is based on CPU, VPA should only adjust memory or run in Off mode for recommendations only.

In practice, a common pattern is to run VPA with updateMode: "Off" and manually adjust requests based on its recommendations. It’s useful for reducing resource waste.

LimitRange — Defaults and Maximums

LimitRange enforces “no smaller than this, no bigger than that” at the namespace level.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: backend
spec:
  limits:
    - type: Container
      default:              # Default limits
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:       # Default requests
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "4Gi"
      min:
        cpu: "50m"
        memory: "64Mi"

With this in place, even if a developer creates a Pod without specifying requests/limits, default values are automatically applied. This prevents BestEffort Pods from proliferating. Additionally, requests exceeding max are rejected at the admission stage.

ResourceQuota — Namespace Total Limits

While LimitRange constrains individual Pods, ResourceQuota caps the total resources for an entire namespace.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: backend-quota
  namespace: backend
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    persistentvolumeclaims: "10"
    services.loadbalancers: "2"
    pods: "50"

The backend team’s namespace cannot exceed a total of 10 vCPUs / 20Gi of memory in requests. LoadBalancer-type Services are limited to 2. Pods are limited to 50.

This prevents a single team from consuming all cluster resources. Conversely, when you’re troubleshooting “why is my Pod stuck in Pending?”, quota exhaustion is often the cause — worth keeping in mind.

# Check current usage
kubectl describe resourcequota -n backend

How to Determine the Right Values

“So what should I set requests and limits to?” is an eternal question. There’s no formula, but a few principles commonly used in practice:

Start loose. For new services, you don’t know the usage pattern, so set generous values
Tighten after load testing. Observe actual usage with Prometheus/Grafana and adjust
Use P95 to P99 usage as your baseline. If you target the average, spikes will blow things up
Set limits to roughly 1.5-2x requests. For JVM apps, set Xmx to 70-80% of the memory limit
Guaranteed for DBs/caches, Burstable for web/API servers, BestEffort for batch jobs

Running VPA in Off mode makes this process much easier. It continuously generates recommendations based on usage patterns.

Hands-On — Observing HPA in Action

Let’s see HPA in action with a simple load test. This is a slightly modified version of the official php-apache example.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu-load
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cpu-load
  template:
    metadata:
      labels:
        app: cpu-load
    spec:
      containers:
        - name: app
          image: k8s.gcr.io/hpa-example
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: "100m"
            limits:
              cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
  name: cpu-load
spec:
  selector:
    app: cpu-load
  ports:
    - port: 80
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cpu-load-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cpu-load
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50

After applying, generate load from another terminal:

kubectl apply -f hpa-demo.yaml

# Generate load (Ctrl+C to stop)
kubectl run -i --tty load --rm --image=busybox --restart=Never \
  -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://cpu-load; done"

# Watch HPA status from another terminal
kubectl get hpa cpu-load-hpa --watch
# NAME           REFERENCE             TARGETS     MINPODS   MAXPODS   REPLICAS
# cpu-load-hpa   Deployment/cpu-load   0%/50%      1         10        1
# cpu-load-hpa   Deployment/cpu-load   180%/50%    1         10        1
# cpu-load-hpa   Deployment/cpu-load   180%/50%    1         10        4
# cpu-load-hpa   Deployment/cpu-load   120%/50%    1         10        8

Once load begins, CPU utilization exceeds the target, and replicas start increasing shortly after. When you stop the load, it scales back down to 1 after the 5-minute stabilizationWindow.

Try running this simple example the first time you set up autoscaling. It gives you a concrete feel for “so this is how HPA works.”

The next part covers security within the cluster. We’ll look at what ServiceAccounts are, how RBAC splits permissions, and how NetworkPolicy restricts communication between Pods.

-> Part 10: RBAC and Security