Skip to content
ioob.dev
Go back

Kubernetes Beginner Series 2 — Cluster Architecture

· 8 min read
Kubernetes Series (2/12)
  1. Kubernetes Beginner Series 1 — What Is Kubernetes
  2. Kubernetes Beginner Series 2 — Cluster Architecture
  3. Kubernetes Beginner Series 3 — Pod
  4. Kubernetes Beginner Series 4 — Controllers
  5. Kubernetes Beginner Series 5 — Services and Networking
  6. Kubernetes Beginner Series 6 — Ingress and Gateway API
  7. Kubernetes Beginner Series 7 — ConfigMap and Secret
  8. Kubernetes Beginner Series 8 — Storage: PV, PVC, StorageClass
  9. Kubernetes Beginner Series 9 — Resource Management and Autoscaling
  10. Kubernetes Beginner Series 10 — RBAC and Security: The Principle of Least Privilege
  11. Kubernetes Beginner Series 11 — Observability: Logs, Metrics, and Traces
  12. Kubernetes Beginner Series 12 — Helm and Package Management
Table of contents

Table of contents

Why Open Up the Cluster

In Part 1, we established that Kubernetes automatically reconciles to whatever state you declare. But who actually does the reconciling, and how? To answer that, we need to crack open the cluster internals.

Understanding the cluster’s internal structure gives three benefits. First, when an outage occurs, you can quickly determine which component to suspect. Second, when designing security, you can see what needs to be locked down. Third, it provides the foundation for learning advanced features like operators and custom controllers.

Most books throw in a single diagram and move on. This part takes a more thorough approach, painting a picture of why each component exists and when it springs into action.

Revisiting the Full Architecture

The official documentation organizes the cluster architecture like this. It helps to read along while comparing the diagram below with the Mermaid diagrams we’ll draw.

Kubernetes cluster architecture official diagram — internal structure of the Control Plane and Worker Nodes

Source: Kubernetes Official Documentation — CC BY 4.0

Let’s start by drawing a more detailed version of the diagram from Part 1.

flowchart TB
    subgraph CP["Control Plane (Master Node)"]
        API[API Server]
        ETCD[(etcd)]
        SCH[Scheduler]
        CM[Controller Manager]
        CCM[Cloud Controller Manager]
    end

    subgraph WN["Worker Node"]
        KUB[kubelet]
        KP[kube-proxy]
        CR[Container Runtime<br/>containerd / CRI-O]
        POD1[Pod A]
        POD2[Pod B]
        CR --> POD1
        CR --> POD2
        KUB --> CR
    end

    USER[kubectl / CI / Operator] -->|REST API| API
    API <-->|watch / write| ETCD
    SCH -->|bind pod to node| API
    CM -->|reconcile| API
    CCM -->|cloud API| API
    API <-->|watch / report| KUB
    API <-->|watch service| KP
    KP -.->|iptables/IPVS rules| POD1
    KP -.->|iptables/IPVS rules| POD2

Keep this diagram in mind as we walk through each component.

Control Plane: The Brain of the Cluster

The Control Plane is the collection of components that decides and manages “what state the cluster should be in.” It is often also called the master node. In small clusters it runs on a single machine; in production environments, it’s configured across three or more machines for availability.

API Server — The Single Point of Communication

The API Server is the front door of the cluster. kubectl, CI pipelines, operators, and all internal components communicate through the API Server. If the API Server goes down, the cluster enters a “read-only” state. Pods that are already running continue, but new deployments or changes are impossible.

The API Server’s role can be summarized in three points:

“Enables interested components to learn about changes” is the key part. Kubernetes revolves around a watch mechanism centered on the API Server. Components like the Scheduler, Controller Manager, and kubelet subscribe with “notify me when a resource I care about changes,” and react when changes occur.

etcd — The Cluster’s Memory

etcd is a distributed key-value store. Pod specs, service configurations, secrets — everything goes here. Only the API Server writes directly to etcd. All other components must go through the API Server.

Why this structure? To maintain the single gateway principle. If multiple components wrote to etcd directly, each would need to implement authorization checks, validation, and audit logging on its own. By routing through the API Server, all of this can be handled consistently in one place.

etcd uses the Raft consensus algorithm (Raft — a distributed consensus protocol where multiple nodes elect a leader and maintain data consistency through majority voting). This is why an odd number like 3 or 5 instances is recommended in production. Since more than half the nodes must be alive for writes to succeed, a 2-node setup means a single failure halts the cluster. Backups are also important — losing etcd means losing the cluster’s memory.

Scheduler — Deciding Where to Place Pods

When a new pod is created, the Scheduler determines “which Worker Node should run it.” Users don’t specify “put this pod on this node” directly in YAML. They simply declare “run this pod.” The actual placement is the Scheduler’s job.

The Scheduler makes decisions in two stages:

  1. Filtering: Eliminates nodes that can’t run the pod. Nodes with insufficient resources, mismatched taint/tolerations, or nodeSelector conditions that don’t match are excluded from the candidate list
  2. Scoring: Scores the remaining candidates to determine the best node. It considers rules like distributing resources evenly and avoiding clustering pods of the same service on one node

Once the decision is made, the Scheduler updates the API Server with “this pod is assigned to this node.” The kubelet on that node then detects this change through its watch and actually runs the container. The Scheduler itself doesn’t start containers. It only assigns.

Controller Manager — The Engine That Converges to Desired State

Inside the Controller Manager, dozens of controllers run. Each controller runs a simple loop for the resources it cares about:

while true {
    desired := Read "what should be" from the API Server
    actual  := Check the actual state of the cluster
    if desired != actual {
        Perform actions to reconcile (send change requests to API Server)
    }
}

This is called the Reconciliation Loop. It’s the heartbeat pattern of Kubernetes.

For example, the Deployment controller ensures “this Deployment should have one ReplicaSet,” and the ReplicaSet controller ensures “this ReplicaSet should have N pods.” If one pod dies, the ReplicaSet controller detects it, notes “there are currently 3 but there should be 5,” and requests the API Server to “create 2 more pods.”

Thanks to this principle, self-healing works. Users just write “there should be 5,” and the controller handles the actual reconciliation.

Cloud Controller Manager — Cloud Vendor Integration

This is the integration point needed when running Kubernetes on clouds like AWS, GCP, or Azure. For instance, when you create a Service of type LoadBalancer, the Cloud Controller Manager actually provisions an AWS ELB. In on-premises environments, this component is either absent or replaced by alternatives like MetalLB.

Worker Node: The Servers That Do the Actual Work

If the Control Plane is the conductor, Worker Nodes are the musicians. Actual application containers run here.

kubelet — The Node’s Agent

kubelet is an agent that runs one per Worker Node. It communicates with the API Server, watches “which pods are assigned to my node,” and manages whether those pods are actually running.

Here’s a summary of what kubelet does:

An important point is that kubelet doesn’t interact with Docker directly. It communicates with the container runtime through a standard interface called CRI (Container Runtime Interface).

Container Runtime — The Actual Container Executor

The container runtime’s job is to pull container images and run them. Docker used to be the standard, but since Kubernetes 1.24 Docker support was removed, and now most environments use containerd or CRI-O.

containerd was actually a component used inside Docker itself. It’s the container execution engine extracted and standardized from the larger Docker tool. When kubelet requests containerd via CRI to “pull this image and start a container,” containerd pulls the image and uses runc to launch the actual process.

Here’s the layered architecture in diagram form:

flowchart TB
    KUB[kubelet] -->|CRI| CTR[containerd / CRI-O]
    CTR -->|OCI| RUNC[runc]
    RUNC -->|syscalls| LINUX[Linux Kernel<br/>cgroups / namespaces]

Understanding this makes debugging failures easier. If a container doesn’t start, check kubelet logs, containerd logs, and runc logs in order.

kube-proxy — The Service Network Worker

We’ll cover this in detail in Part 5, but briefly, kube-proxy is the component that implements the Service abstraction on nodes. When a user creates a Service, kube-proxy adjusts the node’s iptables rules (or IPVS) so that traffic arriving at the service IP gets forwarded to actual pod IPs.

Thanks to this, even when pods die and new ones come up with different IPs, the Service IP remains stable. This is why you access pods through Services rather than pointing to them directly.

The Ripple Effect of a Single Deployment

Let’s trace a single deployment to see how all these components work together.

sequenceDiagram
    participant U as User
    participant API as API Server
    participant E as etcd
    participant DC as Deployment Controller
    participant RC as ReplicaSet Controller
    participant S as Scheduler
    participant K as kubelet
    participant CR as containerd

    U->>API: kubectl apply deployment.yaml
    API->>E: Store Deployment
    API-->>U: 201 Created
    DC->>API: Watch Deployment
    DC->>API: Request ReplicaSet creation
    API->>E: Store ReplicaSet
    RC->>API: Watch ReplicaSet
    RC->>API: Request N Pod creation
    API->>E: Store Pods (nodeName unset)
    S->>API: Watch unassigned Pods
    S->>API: Bind Pod to nodeName
    K->>API: Watch Pods on own node
    K->>CR: Request container execution
    CR-->>K: Execution complete
    K->>API: Report Pod status as Running
    API->>E: Update status

The key insight of this flow is that no one commands anyone else directly. Everyone only talks to the API Server and watches for changes in the resources they care about. Thanks to this distributed cooperation, even if one component goes down briefly, the entire system doesn’t halt.

Getting a Feel for High Availability

In real production environments, Control Plane components are configured across multiple machines. etcd runs on 3 or 5 instances, and multiple API Server instances sit behind a load balancer. The Scheduler and Controller Manager use leader election so that only one instance is active at a time.

Worker Nodes are also typically configured with at least three. By distributing pods across multiple nodes, if one node dies, replicas on other nodes keep the service running.

You don’t need to understand this complex setup from the start. Just remember that there’s a perspective of “how should each component be distributed to eliminate single points of failure.”


In the next part, we’ll dive deep into pods themselves. We’ll practice and learn why the pod — not the container — is the unit of deployment, and what the sidecar pattern of putting multiple containers in one pod is all about.

-> Part 3: Pod


Related Posts

Share this post on:

Comments

Loading comments...


Previous Post
Kubernetes Beginner Series 1 — What Is Kubernetes
Next Post
Kubernetes Beginner Series 3 — Pod