Table of contents
- Why Open Up the Cluster
- Revisiting the Full Architecture
- Control Plane: The Brain of the Cluster
- Worker Node: The Servers That Do the Actual Work
- The Ripple Effect of a Single Deployment
- Getting a Feel for High Availability
Why Open Up the Cluster
In Part 1, we established that Kubernetes automatically reconciles to whatever state you declare. But who actually does the reconciling, and how? To answer that, we need to crack open the cluster internals.
Understanding the cluster’s internal structure gives three benefits. First, when an outage occurs, you can quickly determine which component to suspect. Second, when designing security, you can see what needs to be locked down. Third, it provides the foundation for learning advanced features like operators and custom controllers.
Most books throw in a single diagram and move on. This part takes a more thorough approach, painting a picture of why each component exists and when it springs into action.
Revisiting the Full Architecture
The official documentation organizes the cluster architecture like this. It helps to read along while comparing the diagram below with the Mermaid diagrams we’ll draw.
Source: Kubernetes Official Documentation — CC BY 4.0
Let’s start by drawing a more detailed version of the diagram from Part 1.
flowchart TB
subgraph CP["Control Plane (Master Node)"]
API[API Server]
ETCD[(etcd)]
SCH[Scheduler]
CM[Controller Manager]
CCM[Cloud Controller Manager]
end
subgraph WN["Worker Node"]
KUB[kubelet]
KP[kube-proxy]
CR[Container Runtime<br/>containerd / CRI-O]
POD1[Pod A]
POD2[Pod B]
CR --> POD1
CR --> POD2
KUB --> CR
end
USER[kubectl / CI / Operator] -->|REST API| API
API <-->|watch / write| ETCD
SCH -->|bind pod to node| API
CM -->|reconcile| API
CCM -->|cloud API| API
API <-->|watch / report| KUB
API <-->|watch service| KP
KP -.->|iptables/IPVS rules| POD1
KP -.->|iptables/IPVS rules| POD2
Keep this diagram in mind as we walk through each component.
Control Plane: The Brain of the Cluster
The Control Plane is the collection of components that decides and manages “what state the cluster should be in.” It is often also called the master node. In small clusters it runs on a single machine; in production environments, it’s configured across three or more machines for availability.
API Server — The Single Point of Communication
The API Server is the front door of the cluster. kubectl, CI pipelines, operators, and all internal components communicate through the API Server. If the API Server goes down, the cluster enters a “read-only” state. Pods that are already running continue, but new deployments or changes are impossible.
The API Server’s role can be summarized in three points:
- Authentication/Authorization: Verifies who sent the request and whether they have permission for that action
- Admission: Validates whether the request conforms to the schema and doesn’t violate policies
- Storage and Distribution: Stores valid requests in etcd and enables interested components to learn about changes
“Enables interested components to learn about changes” is the key part. Kubernetes revolves around a watch mechanism centered on the API Server. Components like the Scheduler, Controller Manager, and kubelet subscribe with “notify me when a resource I care about changes,” and react when changes occur.
etcd — The Cluster’s Memory
etcd is a distributed key-value store. Pod specs, service configurations, secrets — everything goes here. Only the API Server writes directly to etcd. All other components must go through the API Server.
Why this structure? To maintain the single gateway principle. If multiple components wrote to etcd directly, each would need to implement authorization checks, validation, and audit logging on its own. By routing through the API Server, all of this can be handled consistently in one place.
etcd uses the Raft consensus algorithm (Raft — a distributed consensus protocol where multiple nodes elect a leader and maintain data consistency through majority voting). This is why an odd number like 3 or 5 instances is recommended in production. Since more than half the nodes must be alive for writes to succeed, a 2-node setup means a single failure halts the cluster. Backups are also important — losing etcd means losing the cluster’s memory.
Scheduler — Deciding Where to Place Pods
When a new pod is created, the Scheduler determines “which Worker Node should run it.” Users don’t specify “put this pod on this node” directly in YAML. They simply declare “run this pod.” The actual placement is the Scheduler’s job.
The Scheduler makes decisions in two stages:
- Filtering: Eliminates nodes that can’t run the pod. Nodes with insufficient resources, mismatched taint/tolerations, or nodeSelector conditions that don’t match are excluded from the candidate list
- Scoring: Scores the remaining candidates to determine the best node. It considers rules like distributing resources evenly and avoiding clustering pods of the same service on one node
Once the decision is made, the Scheduler updates the API Server with “this pod is assigned to this node.” The kubelet on that node then detects this change through its watch and actually runs the container. The Scheduler itself doesn’t start containers. It only assigns.
Controller Manager — The Engine That Converges to Desired State
Inside the Controller Manager, dozens of controllers run. Each controller runs a simple loop for the resources it cares about:
while true {
desired := Read "what should be" from the API Server
actual := Check the actual state of the cluster
if desired != actual {
Perform actions to reconcile (send change requests to API Server)
}
}
This is called the Reconciliation Loop. It’s the heartbeat pattern of Kubernetes.
For example, the Deployment controller ensures “this Deployment should have one ReplicaSet,” and the ReplicaSet controller ensures “this ReplicaSet should have N pods.” If one pod dies, the ReplicaSet controller detects it, notes “there are currently 3 but there should be 5,” and requests the API Server to “create 2 more pods.”
Thanks to this principle, self-healing works. Users just write “there should be 5,” and the controller handles the actual reconciliation.
Cloud Controller Manager — Cloud Vendor Integration
This is the integration point needed when running Kubernetes on clouds like AWS, GCP, or Azure. For instance, when you create a Service of type LoadBalancer, the Cloud Controller Manager actually provisions an AWS ELB. In on-premises environments, this component is either absent or replaced by alternatives like MetalLB.
Worker Node: The Servers That Do the Actual Work
If the Control Plane is the conductor, Worker Nodes are the musicians. Actual application containers run here.
kubelet — The Node’s Agent
kubelet is an agent that runs one per Worker Node. It communicates with the API Server, watches “which pods are assigned to my node,” and manages whether those pods are actually running.
Here’s a summary of what kubelet does:
- Periodically reports “I’m alive” to the API Server (node heartbeat)
- Requests the container runtime to run pods assigned to it
- Restarts containers when they die
- Performs health checks (liveness/readiness probes) and reports results
- Collects logs, metrics, and events and sends them to the API Server
An important point is that kubelet doesn’t interact with Docker directly. It communicates with the container runtime through a standard interface called CRI (Container Runtime Interface).
Container Runtime — The Actual Container Executor
The container runtime’s job is to pull container images and run them. Docker used to be the standard, but since Kubernetes 1.24 Docker support was removed, and now most environments use containerd or CRI-O.
containerd was actually a component used inside Docker itself. It’s the container execution engine extracted and standardized from the larger Docker tool. When kubelet requests containerd via CRI to “pull this image and start a container,” containerd pulls the image and uses runc to launch the actual process.
Here’s the layered architecture in diagram form:
flowchart TB
KUB[kubelet] -->|CRI| CTR[containerd / CRI-O]
CTR -->|OCI| RUNC[runc]
RUNC -->|syscalls| LINUX[Linux Kernel<br/>cgroups / namespaces]
Understanding this makes debugging failures easier. If a container doesn’t start, check kubelet logs, containerd logs, and runc logs in order.
kube-proxy — The Service Network Worker
We’ll cover this in detail in Part 5, but briefly, kube-proxy is the component that implements the Service abstraction on nodes. When a user creates a Service, kube-proxy adjusts the node’s iptables rules (or IPVS) so that traffic arriving at the service IP gets forwarded to actual pod IPs.
Thanks to this, even when pods die and new ones come up with different IPs, the Service IP remains stable. This is why you access pods through Services rather than pointing to them directly.
The Ripple Effect of a Single Deployment
Let’s trace a single deployment to see how all these components work together.
sequenceDiagram
participant U as User
participant API as API Server
participant E as etcd
participant DC as Deployment Controller
participant RC as ReplicaSet Controller
participant S as Scheduler
participant K as kubelet
participant CR as containerd
U->>API: kubectl apply deployment.yaml
API->>E: Store Deployment
API-->>U: 201 Created
DC->>API: Watch Deployment
DC->>API: Request ReplicaSet creation
API->>E: Store ReplicaSet
RC->>API: Watch ReplicaSet
RC->>API: Request N Pod creation
API->>E: Store Pods (nodeName unset)
S->>API: Watch unassigned Pods
S->>API: Bind Pod to nodeName
K->>API: Watch Pods on own node
K->>CR: Request container execution
CR-->>K: Execution complete
K->>API: Report Pod status as Running
API->>E: Update status
The key insight of this flow is that no one commands anyone else directly. Everyone only talks to the API Server and watches for changes in the resources they care about. Thanks to this distributed cooperation, even if one component goes down briefly, the entire system doesn’t halt.
Getting a Feel for High Availability
In real production environments, Control Plane components are configured across multiple machines. etcd runs on 3 or 5 instances, and multiple API Server instances sit behind a load balancer. The Scheduler and Controller Manager use leader election so that only one instance is active at a time.
Worker Nodes are also typically configured with at least three. By distributing pods across multiple nodes, if one node dies, replicas on other nodes keep the service running.
You don’t need to understand this complex setup from the start. Just remember that there’s a perspective of “how should each component be distributed to eliminate single points of failure.”
In the next part, we’ll dive deep into pods themselves. We’ll practice and learn why the pod — not the container — is the unit of deployment, and what the sidecar pattern of putting multiple containers in one pod is all about.
-> Part 3: Pod




Loading comments...