Table of contents
- Why You Must Specify Resources
- requests and limits
- How It Affects Scheduling
- QoS Classes — Who Dies First in an OOM Situation
- HPA — Horizontal Autoscaling
- VPA — Vertical Autoscaling
- LimitRange — Defaults and Maximums
- ResourceQuota — Namespace Total Limits
- How to Determine the Right Values
- Hands-On — Observing HPA in Action
Why You Must Specify Resources
The moment you create a Pod in Kubernetes, the cluster scheduler asks one question: “Which node should this Pod be placed on?”
To answer that, it needs to know how much resources the Pod will consume. How much CPU it requires, how many GB of memory it needs. Without this information, the scheduler just places it on any arbitrary node. If you’re unlucky, resource-hungry Pods pile up on a single node and start killing each other with OOM errors.
That’s why Kubernetes has you specify resources in the Pod spec. These values form the foundation for scheduling, OOM handling, and autoscaling. Misconfigure them, and you end up in a strange situation where nodes have plenty of headroom but Pods are starved.
requests and limits
Resource specification has two axes:
- requests: The minimum guarantee — “I need at least this much.” The basis for scheduler node selection
- limits: The upper bound — “Don’t use more than this.” Enforced by kubelet/runtime
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
containers:
- name: app
image: myapp:1.0
resources:
requests:
cpu: "250m" # 0.25 vCPU
memory: "256Mi"
limits:
cpu: "500m" # 0.5 vCPU
memory: "512Mi"
CPU uses the m (millicore) unit. 1000m equals 1 vCPU. 250m means 0.25 cores. Memory uses Mi (Mebibyte) and Gi (Gibibyte). Be careful — M (Megabyte) and Mi are different (base-10 vs base-2). The convention is to stick with Mi/Gi.
The key point here is that CPU and memory are handled differently:
- CPU: A compressible resource. When the limit is exceeded, the Pod gets throttled. It slows down but doesn’t die
- Memory: An incompressible resource. When the limit is exceeded, the Pod gets OOMKilled. It simply dies
So it’s safer to set the memory limit with headroom above actual usage, while CPU limits require careful thought. An overly low CPU limit makes applications sluggish during GC or momentary spikes.
How It Affects Scheduling
The Kubernetes scheduler uses a Pod’s requests to find “a node with enough room for this Pod.”
flowchart LR
A[Pod created<br/>requests: 500m CPU / 1Gi Mem] --> B[Scheduler]
B --> C{Node 1<br/>Available<br/>200m / 512Mi}
B --> D{Node 2<br/>Available<br/>800m / 2Gi}
B --> E{Node 3<br/>Available<br/>1000m / 4Gi}
C -.->|Rejected| F[Insufficient for request]
D --> G[Eligible]
E --> G
G --> H[Scoring then<br/>final node selection]
One important point: the scheduler judges based on the sum of requests, not actual usage. If a node has 4 vCPUs and Pods have reserved 3.5 vCPUs via requests, then even if actual CPU usage is only 5%, a Pod requesting more than 0.6 vCPU cannot be placed on that node.
So if you set requests too high, you get node waste. Conversely, if you set them too low, Pods pile onto nodes that are actually busy, degrading overall performance. This is why you need to tune based on actual usage from monitoring.
QoS Classes — Who Dies First in an OOM Situation
When a node runs low on memory, the kernel triggers the OOM Killer to forcibly terminate processes. Kubernetes assigns QoS classes to determine which Pod to kill first.
The classification criteria are straightforward:
- Guaranteed: All containers have requests == limits, and both CPU and memory are specified
- Burstable: At least one request is set, but the Guaranteed criteria aren’t fully met
- BestEffort: No requests/limits at all
# Guaranteed
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "512Mi"
# Burstable
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
# BestEffort (no resources block at all)
When a node runs low on memory, the order for picking OOM victims goes like this:
flowchart TB
A[Node memory pressure] --> B{Any BestEffort Pods?}
B -->|Yes| C[Evict first]
B -->|No| D{Any Burstable Pods with<br/>high usage relative to limit?}
D -->|Yes| E[Evict]
D -->|No| F[Guaranteed are last to go]
It’s safest to run production workloads as Guaranteed whenever possible. Pods with high restart costs — like databases or caches — should always be Guaranteed. On the other hand, batch jobs or dev tool Pods can remain BestEffort. When resources run short, they get evicted first, protecting critical workloads.
HPA — Horizontal Autoscaling
When traffic increases, scale out by adding more Pods; when it decreases, scale back in. That’s the role of the HorizontalPodAutoscaler (HPA). It automatically adjusts a Deployment’s replicas based on metrics like CPU utilization.
Let’s look at how the HPA controller periodically queries the metrics-server to make scaling decisions:
sequenceDiagram
participant H as HPA Controller
participant MS as metrics-server
participant P as Pods
participant D as Deployment
loop Every 15 seconds
H->>MS: Query current average CPU utilization
MS->>P: Collect kubelet metrics
P-->>MS: CPU: 700m (requests 500m → 140%)
MS-->>H: Average utilization 140%
H->>H: 140/70 = 2.0x → Calculate required replicas
H->>D: spec.replicas = N (scale up)
D->>P: Create new Pods
end
Note over H: Scale-down requests are executed after 5 min stabilization
HPA requires metrics-server to be installed in the cluster. Managed Kubernetes services typically include it by default.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
Breaking down the configuration:
- min/max Replicas: Never goes below 2, never exceeds 10
- target 70%: Scales up when the average CPU utilization across all Pods exceeds 70% of requests
- scaleUp: Reacts immediately, can increase by up to 100% every minute (2 -> 4 -> 8)
- scaleDown: Waits 5 minutes for metrics to stabilize before scaling down, reduces by up to 50% per minute
The important thing here is what “70% CPU utilization” is measured against. HPA calculates it relative to requests. If a Pod’s CPU requests are 500m and actual usage is 350m, utilization is 70%. So for HPA to work properly, Pods must have requests set.
Custom Metrics-Based HPA
Scaling on CPU or memory alone is often insufficient. Async workers should scale based on queue length, and API servers are better served by requests per second (RPS). HPA can scale on arbitrary metrics through the Custom Metrics API.
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100" # Maintain an average of 100 RPS per Pod
To provide these metrics, you need to install a component like Prometheus Adapter that exposes Prometheus metrics through the Kubernetes Custom Metrics API. It’s a bit of work upfront, but once set up, you can define practical policies like “add more Pods when latency increases.”
VPA — Vertical Autoscaling
While HPA scales the number of Pods, the VerticalPodAutoscaler (VPA) adjusts the size (requests/limits) of individual Pods. Something like: “This Pod was declared with 250m CPU, but it consistently uses 400m. Let me bump up its requests.”
Let’s capture how HPA and VPA modify the same Pod along different axes in a single diagram:
flowchart LR
subgraph HPA_DEMO["HPA (Horizontal)"]
H1["Pod\n250m / 256Mi"] --> HOUT["Pod x3\n250m / 256Mi each"]
end
subgraph VPA_DEMO["VPA (Vertical)"]
V1["Pod\n250m / 256Mi"] --> VOUT["Pod x1\n500m / 512Mi"]
end
VPA is not installed by default and must be deployed separately. It has three operating modes:
| Mode | Behavior |
|---|---|
Off | Only computes recommendations without applying them (analysis only) |
Initial | Sets values only at Pod creation time |
Auto | Adjusts values by recreating running Pods |
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 4Gi
An important caveat: Do not use VPA and HPA simultaneously on the same CPU/memory metrics. They’ll counteract each other and create erratic scaling loops. If HPA is based on CPU, VPA should only adjust memory or run in Off mode for recommendations only.
In practice, a common pattern is to run VPA with updateMode: "Off" and manually adjust requests based on its recommendations. It’s useful for reducing resource waste.
LimitRange — Defaults and Maximums
LimitRange enforces “no smaller than this, no bigger than that” at the namespace level.
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: backend
spec:
limits:
- type: Container
default: # Default limits
cpu: "500m"
memory: "512Mi"
defaultRequest: # Default requests
cpu: "100m"
memory: "128Mi"
max:
cpu: "2"
memory: "4Gi"
min:
cpu: "50m"
memory: "64Mi"
With this in place, even if a developer creates a Pod without specifying requests/limits, default values are automatically applied. This prevents BestEffort Pods from proliferating. Additionally, requests exceeding max are rejected at the admission stage.
ResourceQuota — Namespace Total Limits
While LimitRange constrains individual Pods, ResourceQuota caps the total resources for an entire namespace.
apiVersion: v1
kind: ResourceQuota
metadata:
name: backend-quota
namespace: backend
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
persistentvolumeclaims: "10"
services.loadbalancers: "2"
pods: "50"
The backend team’s namespace cannot exceed a total of 10 vCPUs / 20Gi of memory in requests. LoadBalancer-type Services are limited to 2. Pods are limited to 50.
This prevents a single team from consuming all cluster resources. Conversely, when you’re troubleshooting “why is my Pod stuck in Pending?”, quota exhaustion is often the cause — worth keeping in mind.
# Check current usage
kubectl describe resourcequota -n backend
How to Determine the Right Values
“So what should I set requests and limits to?” is an eternal question. There’s no formula, but a few principles commonly used in practice:
- Start loose. For new services, you don’t know the usage pattern, so set generous values
- Tighten after load testing. Observe actual usage with Prometheus/Grafana and adjust
- Use P95 to P99 usage as your baseline. If you target the average, spikes will blow things up
- Set limits to roughly 1.5-2x requests. For JVM apps, set Xmx to 70-80% of the memory limit
- Guaranteed for DBs/caches, Burstable for web/API servers, BestEffort for batch jobs
Running VPA in Off mode makes this process much easier. It continuously generates recommendations based on usage patterns.
Hands-On — Observing HPA in Action
Let’s see HPA in action with a simple load test. This is a slightly modified version of the official php-apache example.
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-load
spec:
replicas: 1
selector:
matchLabels:
app: cpu-load
template:
metadata:
labels:
app: cpu-load
spec:
containers:
- name: app
image: k8s.gcr.io/hpa-example
ports:
- containerPort: 80
resources:
requests:
cpu: "100m"
limits:
cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
name: cpu-load
spec:
selector:
app: cpu-load
ports:
- port: 80
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cpu-load-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cpu-load
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
After applying, generate load from another terminal:
kubectl apply -f hpa-demo.yaml
# Generate load (Ctrl+C to stop)
kubectl run -i --tty load --rm --image=busybox --restart=Never \
-- /bin/sh -c "while sleep 0.01; do wget -q -O- http://cpu-load; done"
# Watch HPA status from another terminal
kubectl get hpa cpu-load-hpa --watch
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# cpu-load-hpa Deployment/cpu-load 0%/50% 1 10 1
# cpu-load-hpa Deployment/cpu-load 180%/50% 1 10 1
# cpu-load-hpa Deployment/cpu-load 180%/50% 1 10 4
# cpu-load-hpa Deployment/cpu-load 120%/50% 1 10 8
Once load begins, CPU utilization exceeds the target, and replicas start increasing shortly after. When you stop the load, it scales back down to 1 after the 5-minute stabilizationWindow.
Try running this simple example the first time you set up autoscaling. It gives you a concrete feel for “so this is how HPA works.”
The next part covers security within the cluster. We’ll look at what ServiceAccounts are, how RBAC splits permissions, and how NetworkPolicy restricts communication between Pods.




Loading comments...