Architecture: How Kubernetes Works

You have a fleet of containers. They need to run reliably, scale under load, recover from failures, and communicate with each other — all without you manually SSHing into servers at 2am. That's the problem Kubernetes was built to solve.

This post walks through how Kubernetes is actually architected: what the pieces are, how they talk to each other, and why they were designed that way. No prior Kubernetes experience needed — just a general comfort with the idea of running software on servers.

The big picture: a cluster

Everything in Kubernetes lives inside a cluster. A cluster is simply a set of machines (physical or virtual) that Kubernetes manages as a single unit. Those machines are called nodes, and they fall into two roles:

Control plane nodes — the brain. They make decisions about the cluster.
Worker nodes — the muscle. They actually run your application containers.

Here's what that looks like at the highest level:

The control plane never runs your application workloads. Its only job is to manage the cluster state — deciding what should run where, watching for failures, and reconciling reality with what you asked for. Worker nodes do the actual work of running containers.

The control plane

The control plane is made up of four components. Each has a very specific, focused responsibility.

The API server

Every interaction with Kubernetes — whether from you via kubectl, from an internal component, or from an external tool — goes through the API server (kube-apiserver). It is the single entry point for all cluster operations.

The API server is stateless. It validates requests, enforces authentication and authorisation, and then reads from or writes to etcd. It doesn't make scheduling decisions. It doesn't run controllers. It just exposes a RESTful API and is the only component allowed to talk directly to etcd.

etcd

etcd is a distributed key-value store. It is where Kubernetes keeps all of its state: what nodes exist, what pods are scheduled, what the desired configuration looks like, what has actually happened. If you lose etcd without a backup, you lose your cluster.

etcd uses the Raft consensus algorithm to stay consistent across multiple replicas. In production, you typically run three or five etcd instances so that a single failure doesn't bring the cluster down.

One important design point: no component other than the API server writes to etcd directly. Everything goes through the API server. This keeps the data layer clean and makes auditing and access control straightforward.

The scheduler

When a new pod needs to run and no node has been assigned yet, the scheduler (kube-scheduler) picks the right node for it.

It does this in two phases:

Filtering — eliminate nodes that can't run the pod. Not enough CPU? Node is tainted? Wrong zone? Out.
Scoring — rank the remaining nodes. Spread pods evenly? Prefer nodes with the image already cached? Score them and pick the winner.

The scheduler doesn't actually start the pod. It just writes the chosen node name into the pod's spec in etcd (via the API server). The kubelet on that node notices and takes it from there.

The controller manager

The controller manager (kube-controller-manager) runs a collection of control loops — small programs that continuously watch the cluster state and drive it toward the desired state.

There are many controllers bundled together: the ReplicaSet controller (ensures the right number of pod replicas exist), the Node controller (notices when nodes go offline), the Job controller (runs pods to completion), the Endpoints controller (keeps service routing tables updated), and more.

Each controller follows the same pattern:

This reconciliation loop is one of the most important ideas in Kubernetes. You don't tell Kubernetes what to do step by step — you declare the desired state, and the controllers figure out how to get there and stay there.

The worker node

Worker nodes run two Kubernetes-specific processes alongside your application containers.

kubelet

The kubelet is an agent that runs on every worker node. It watches the API server for pods that have been assigned to its node, and then ensures those pods are running and healthy.

The kubelet doesn't manage containers directly. It talks to a container runtime — such as containerd or CRI-O — via the Container Runtime Interface (CRI). The runtime is what actually pulls images and starts containers. kubelet just tells it what to do and reports back.

If a container crashes, kubelet restarts it. If the pod is deleted from the API server, kubelet stops the container. It is the ground-truth enforcer on each node.

kube-proxy

kube-proxy handles networking. Its job is to maintain network rules on the node so that traffic destined for a Kubernetes Service gets routed to the right pod, even as pods come and go.

In most modern clusters, kube-proxy uses iptables or ipvs rules to do this efficiently in the kernel. When a Service is created, kube-proxy writes rules that say: "traffic to this virtual IP should be load-balanced across these backend pod IPs."

Pods: the smallest deployable unit

Kubernetes doesn't schedule individual containers — it schedules pods. A pod is a group of one or more containers that share:

A network namespace (same IP address, same ports)
Storage volumes
A lifecycle (they start and stop together)

Most pods contain a single container. The multi-container pattern is used for tightly coupled helpers — a log shipper alongside a web server, or a proxy sidecar alongside a service mesh workload.

Pods are ephemeral by design. They are not expected to be long-lived. When a node fails, the pods on it don't migrate — they are destroyed and replaced elsewhere. The pod's IP address changes. Any local storage is gone. This is intentional: it forces you to build stateless, resilient applications.

How the pieces connect: request lifecycle

Let's make this concrete. You run kubectl apply -f deployment.yaml. Here's exactly what happens:

You submit the deployment manifest. The API server authenticates you, validates the manifest, and persists it to etcd.
The scheduler is watching the API server for unscheduled pods. It sees the new pods, runs its filtering and scoring logic, and writes the node assignment back via the API server.
The kubelet on the chosen node is watching the API server for pods assigned to it. It sees the new pod and tells the container runtime to pull the image and start the container.
The controller manager's ReplicaSet controller keeps watching to ensure the desired replica count is maintained from this point on.

Services: stable networking for ephemeral pods

Pods come and go. Their IP addresses change. So how does one pod reliably talk to another?

The answer is a Service. A Service is a stable virtual IP address (called a ClusterIP) with a DNS name, backed by a set of pods matched by a label selector. Traffic to the Service IP gets load-balanced across the current healthy pods, regardless of how many times they've been replaced.

There are a few Service types worth knowing:

ClusterIP — internal only. Reachable within the cluster. The default.
NodePort — exposes the service on a static port on every node's IP. Useful for development.
LoadBalancer — provisions a cloud load balancer and routes external traffic in. Used in production on managed Kubernetes (EKS, GKE, AKS).

DNS in a Kubernetes cluster is handled by CoreDNS, which runs as a pod in the kube-system namespace. When a pod does a DNS lookup for my-service.my-namespace.svc.cluster.local, CoreDNS resolves it to the Service's ClusterIP.

Higher-level abstractions

You rarely create pods directly. Kubernetes provides higher-level objects that manage pods for you.

Deployment

A Deployment manages a ReplicaSet, which in turn manages pods. You tell the Deployment "I want 3 replicas of this container image." It ensures there are always 3 running. Rolling updates and rollbacks are handled automatically — new pods are brought up before old ones are taken down, keeping your service available throughout.

StatefulSet

A StatefulSet is for workloads that need stable identity. Each pod gets a persistent hostname (db-0, db-1, db-2) and a persistent volume that follows it. Used for databases, message queues, and anything where pod identity matters.

DaemonSet

A DaemonSet ensures one pod runs on every node. Used for cluster-wide agents: log collectors, monitoring exporters, network plugins.

Job and CronJob

A Job runs a pod to completion. A CronJob does it on a schedule. Used for batch processing, database migrations, report generation.

Ingress: routing external HTTP traffic

A Service of type LoadBalancer gives you one external IP per service. In a real application you might have dozens of services — you don't want dozens of load balancers. That's what Ingress solves.

An Ingress is a Kubernetes resource that defines HTTP/S routing rules: which hostname or URL path maps to which backend Service. A single Ingress can route api.myapp.com to one Service, myapp.com/static to another, and handle TLS termination for all of them.

But an Ingress resource on its own does nothing. It needs an Ingress controller — a pod running in the cluster that watches for Ingress resources and configures an actual reverse proxy (nginx, Envoy, HAProxy, or a cloud-native LB) to implement the rules. Popular choices are ingress-nginx and the AWS/GKE/Azure native controllers.

ConfigMaps and Secrets

Hardcoding configuration into container images is bad practice — you'd need a new image for every environment. Kubernetes solves this with ConfigMaps and Secrets.

A ConfigMap stores non-sensitive configuration as key-value pairs: database hostnames, feature flags, log levels. Pods consume them either as environment variables or as files mounted into the container's filesystem.

A Secret works the same way but is intended for sensitive data — passwords, API keys, TLS certificates. Secrets are base64-encoded in etcd (and can be encrypted at rest if you configure the cluster to do so). The separation exists because it lets you apply tighter access controls to Secrets via RBAC — a pod that needs a database hostname doesn't need access to database credentials.

Both are namespace-scoped, so a Secret in production is not visible to pods in staging.

Horizontal Pod Autoscaler

The HorizontalPodAutoscaler (HPA) automatically scales the number of pod replicas in a Deployment or StatefulSet based on observed metrics — typically CPU or memory utilisation, but also custom metrics from your own monitoring stack.

You define a target: "keep average CPU utilisation at 60%." The HPA controller (part of the controller manager) checks metrics every 15 seconds. If utilisation climbs above the target, it increases the replica count. If it drops, it scales back down — with a configurable cooldown to prevent thrashing.

HPA handles the pod count. If the cluster itself runs out of node capacity, the Cluster Autoscaler — a separate component — provisions new nodes from your cloud provider. Together they give you full elastic scaling.

RBAC: access control

Kubernetes has a built-in authorisation system called Role-Based Access Control (RBAC). Every action against the API server — listing pods, creating deployments, reading secrets — can be permitted or denied based on the identity making the request.

The model has four pieces:

ServiceAccount — an identity assigned to a pod (or a human user). Every pod has one, defaulting to the namespace's default ServiceAccount.
Role / ClusterRole — a set of permissions. A Role is namespace-scoped ("can read pods in the staging namespace"). A ClusterRole is cluster-wide ("can read nodes anywhere").
RoleBinding / ClusterRoleBinding — attaches a Role to a ServiceAccount (or user or group).

In practice this means a pod running your application should have a ServiceAccount with only the permissions it actually needs. A pod that reads ConfigMaps doesn't need permission to delete Secrets. Least privilege, enforced at the API server.

Networking and CNI

Kubernetes mandates a flat networking model: every pod gets a unique IP address, and any pod can reach any other pod directly without NAT. But Kubernetes itself doesn't implement this — it delegates to a CNI (Container Network Interface) plugin.

Popular CNI plugins include Calico, Flannel, Cilium, and Weave. Each implements the pod IP allocation and cross-node routing differently (overlay networks, BGP, eBPF). The choice affects performance, observability, and which NetworkPolicy features you get.

NetworkPolicy is Kubernetes' firewall primitive. By default all pods can talk to all pods. A NetworkPolicy lets you restrict that — "only pods with label app: frontend may connect to pods with label app: api on port 8080." CNI plugins are also responsible for enforcing these policies.

Persistent storage

Pods don't have durable storage by default. To persist data across pod restarts, Kubernetes uses Volumes, PersistentVolumes (PVs), and PersistentVolumeClaims (PVCs).

The separation is intentional: a PersistentVolume is a cluster-level storage resource (an EBS volume, an NFS share, a local disk). A PersistentVolumeClaim is a pod's request for storage — "I need 20Gi of block storage." Kubernetes binds PVCs to PVs automatically. The pod just mounts the PVC; it doesn't care where the storage actually comes from.

This abstraction lets the same application manifest work whether you're running on AWS, GCP, on-prem, or your laptop (with a local storage provider).

How Kubernetes handles failures

Fault tolerance isn't a feature you bolt on in Kubernetes — it's built into the reconciliation model.

Pod crashes — the kubelet restarts the container automatically. Kubernetes tracks restart counts and applies exponential backoff if a container keeps crashing (CrashLoopBackOff).
Node goes offline — the Node controller marks it NotReady after a timeout (default ~5 minutes). The pods on that node are evicted and rescheduled elsewhere by the ReplicaSet controller.
Unhealthy containers — liveness probes let Kubernetes know when to restart a container (it's running but stuck). Readiness probes let Kubernetes know when a container is ready to receive traffic — the Service will stop routing to a pod that fails its readiness check, even if the pod is still running.

Putting it all together

Here's a complete picture of a production Kubernetes cluster, with all the layers in place:

Namespaces and multi-tenancy

Kubernetes uses namespaces to partition a single cluster into virtual sub-clusters. Resources in one namespace are isolated from those in another by default. You can apply resource quotas, network policies, and RBAC rules per namespace.

Common patterns: a namespace per team, a namespace per environment (dev/staging/prod), or a namespace per application. The default namespace is where resources land if you don't specify one. System components live in kube-system.

What Kubernetes doesn't do

It's worth being explicit about what sits outside Kubernetes' scope, because it's often misunderstood:

It doesn't build your container images — that's your CI pipeline.
It doesn't provide application-level logging or metrics out of the box — you bring your own stack (Prometheus, Grafana, ELK, Datadog, etc.).
It doesn't do service mesh — mutual TLS between services, circuit breaking, and traffic shifting are the domain of tools like Istio or Linkerd.
It doesn't manage your cloud infrastructure — VPCs, subnets, IAM roles. That's Terraform, Pulumi, or your cloud console.

Kubernetes is the orchestration layer. The ecosystem around it — Helm for packaging, Argo CD for GitOps, cert-manager for TLS, the CNI plugins for networking — is what makes a full platform.

Summary

Kubernetes separates the cluster into a control plane and worker nodes. The control plane — API server, etcd, scheduler, and controller manager — handles desired state management. Worker nodes — kubelet and kube-proxy — run your actual workloads.

The foundational idea is the reconciliation loop: declare what you want, and Kubernetes continuously works to make reality match the declaration. That's what gives it resilience: there's no "run this command once and hope." There's only "desired state vs actual state, close the gap."

Understanding this architecture is what makes every kubectl command, every YAML manifest, and every failure mode click into place.

Related on this blog: Explained: Node Affinity in Kubernetes