When you deploy a workload to a Kubernetes cluster, the scheduler makes a decision: which node should run this pod? By default it balances across available nodes, picking whichever has the capacity. Most of the time that's fine. But in the real world, infrastructure is rarely uniform. You've got GPU nodes for ML workloads, high-memory nodes for caching tiers, nodes in specific availability zones for compliance, and bare-metal nodes for latency-sensitive services. You need to tell the scheduler where things belong — and more importantly, what to do when the perfect node isn't available.
That's where Node Affinity comes in. It's a declarative way to express scheduling preferences and requirements using label selectors, giving you fine-grained control without coupling your workload spec to specific hostnames. This post breaks down the concept, the mechanics, and the tradeoffs — with diagrams where a picture is genuinely worth the thousand words.
The problem: why default scheduling isn't always enough
Before going into Node Affinity specifically, it helps to understand the broader scheduling landscape. Kubernetes has three primary mechanisms for constraining pod placement:
nodeSelector— the blunt instrument. A simple key-value label match. If the node doesn't have the label, the pod doesn't schedule. No fallback, no nuance.- Node Affinity — the evolved version of
nodeSelector. Supports richer expressions (In, NotIn, Exists, Gt, etc.) and — critically — distinguishes between hard requirements and soft preferences. - Taints & Tolerations — the inverse approach. Instead of pods saying "I want to go there," nodes say "keep away unless you tolerate this." Complementary to Node Affinity, not a replacement.
Three mechanisms feeding into the scheduler — Node Affinity sits in the middle, balancing expressiveness and flexibility.
Node Affinity: the two flavours
Node Affinity lives under spec.affinity.nodeAffinity in your pod spec (or pod template in a Deployment). There are exactly two rule types, and understanding the distinction between them is the foundation of everything else:
| Rule Type | Meaning | If unmet |
|---|---|---|
requiredDuringSchedulingIgnoredDuringExecution | Hard requirement | Pod stays Pending |
preferredDuringSchedulingIgnoredDuringExecution | Soft preference | Falls back to any eligible node |
The IgnoredDuringExecution suffix is important: if a node's labels change after a pod is already running there, the pod is not evicted. It continues running even if the rule would now reject the placement.
Key insight: Think ofrequiredrules as a gate andpreferredrules as a hint. Gates block; hints guide. Using the wrong one is the most common Node Affinity mistake.
required rules block scheduling entirely; preferred rules influence it without blocking.
Anatomy of a Node Affinity spec
A real example — this Deployment targets GPU nodes in us-east-1a, with a soft preference for high-memory nodes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training-job
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-t4
- nvidia-a100
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node-type
operator: In
values:
- high-memory
- weight: 20
preference:
matchExpressions:
- key: node-type
operator: In
values:
- standard
The nodeSelectorTerms list under required is evaluated as an OR — the pod schedules on any node satisfying at least one term. Within a single term, the matchExpressions list is an AND — all expressions must be satisfied simultaneously.
Terms in nodeSelectorTerms combine with OR; expressions inside a single term combine with AND.
Available operators
| Operator | Behaviour | Typical use case |
|---|---|---|
In | Value is in the given list | Match one of several GPU types |
NotIn | Value is not in the list | Exclude spot/preemptible nodes |
Exists | Key exists (any value) | Require a node to be labelled at all |
DoesNotExist | Key is absent | Avoid nodes with a "draining" label |
Gt | Integer value is greater than | Minimum CPU generation or disk tier |
Lt | Integer value is less than | Cap on a measured property |
Preferred rules and the weight system
Each preferred entry carries a weight from 1 to 100. The scheduler scores every node that passes the hard filters by summing the weights of all preferred terms that node satisfies. The highest-scoring node wins.
Think of it as a bidding system for nodes. A node with node-type=high-memory earns 80 points; one with node-type=standard earns 20. If both pass the required gate, the scheduler strongly prefers the high-memory node — but it won't block scheduling if neither matches. A node satisfying no preferred terms earns zero points but is still eligible.
All three nodes pass the required gate. Weights from preferred rules determine the winner.
Node Affinity in OpenShift
OpenShift uses Kubernetes Node Affinity natively — the spec is identical. But it adds important operator-level tooling on top.
Machine Config Pools and node labels
In OpenShift, nodes are grouped into Machine Config Pools (MCPs) — master, worker, and any custom pools you define. MCPs automatically propagate labels onto their member nodes, so you can write Node Affinity rules against MCP-propagated labels rather than labelling each node manually. A custom MCP called gpu-workers will label its nodes with node-role.kubernetes.io/gpu-workers: "".
Infrastructure nodes
OpenShift has a concept of infrastructure nodes for cluster-internal workloads (the router, registry, monitoring stack). These carry the label node-role.kubernetes.io/infra: "". Using Node Affinity against this label keeps platform components off your application worker pool — and keeps your Red Hat subscription licensing clean, since infrastructure node costs are treated differently.
OpenShift 4.x tip: When using Node Affinity alongside the cluster autoscaler, make sure your MachineAutoscaler targets machine sets whose nodes carry the labels your affinity rules expect. If the autoscaler spins up nodes from an unlabelled machine set, your pods will remain Pending even though capacity was just added.
Node Affinity vs. nodeSelector vs. Taints & Tolerations
- Use
nodeSelectorfor simple, non-negotiable label requirements with no need for expression operators. - Use Node Affinity when you need
In/NotIn/Existsoperators, weighted soft preferences, or multiple OR-able condition sets. - Use Taints & Tolerations when nodes should repel all workloads by default — dedicated node pools that nothing lands on unless it explicitly opts in.
- Combine Node Affinity + Tolerations for full precision: tolerations allow a pod to land on a tainted node; affinity ensures it only lands on the right one.
Taints keep the regular pod off the GPU node. Affinity ensures the ML pod lands only on the GPU node, not just anywhere it has a toleration.
Practical patterns
Pattern 1: zone-pinned services
For services co-located with a database in a specific availability zone:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-1b
Pattern 2: prefer on-demand, tolerate spot
For stateless workloads that can tolerate interruption but prefer reliability:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node.kubernetes.io/lifecycle
operator: In
values:
- on-demand
- weight: 10
preference:
matchExpressions:
- key: node.kubernetes.io/lifecycle
operator: In
values:
- spot
Pattern 3: exclude draining nodes
During rolling upgrades, label nodes with draining=true before cordoning. This keeps fresh pods off them:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: draining
operator: DoesNotExist
Common mistakes and how to avoid them
Mistake 1: Usingrequiredwhen you meantpreferred. The symptom is pods stuck inPendingwith the event "0/N nodes are available: node(s) didn't match Pod's node affinity". Start withpreferredduring development; tighten torequiredonce node labelling is confirmed consistent.
Mistake 2: Confusing AND vs. OR logic. Two entries innodeSelectorTermsare OR'd. Two entries inside a term'smatchExpressionsare AND'd. Placing two requirements as separate terms when you meant both to be required simultaneously is a very easy trap.
Mistake 3: Label case mismatches. Node labels are case-sensitive.accelerator=nvidia-T4andaccelerator=nvidia-t4are different labels. Always verify withkubectl get nodes --show-labelsbefore writing affinity rules.
Mistake 4: Expecting running pods to be evicted on label changes. IgnoredDuringExecution means pods already running on a node are unaffected if its labels change. New pods won't schedule there, but existing ones stay put. Plan accordingly during maintenance windows.
Debugging Node Affinity issues
# See scheduling events and affinity failures
kubectl describe pod <pod-name> -n <namespace>
# List all node labels
kubectl get nodes --show-labels
# Filter nodes by a specific label
kubectl get nodes -l accelerator=nvidia-t4
# OpenShift: check Machine Config Pool labels
oc get machineconfigpool
oc describe machineconfigpool worker
If the describe output says "didn't match node affinity", cross-check your matchExpressions values against the actual node labels line by line. Nine times out of ten it's a typo or a case mismatch.
Wrapping up
Node Affinity is one of those Kubernetes features that looks intimidating in the spec but makes intuitive sense once you have the mental model. Hard requirements act as gates; soft preferences act as scoring hints. Terms are OR'd; expressions within a term are AND'd. The IgnoredDuringExecution suffix is everywhere because the stronger variants aren't stable yet.
For most production workloads: use required for non-negotiable placement constraints (zone, hardware type, compliance isolation) and preferred for cost optimisation and soft topology goals. Layer taints and tolerations on top when you need dedicated node pools that must actively repel general workloads.
If you're on OpenShift, lean on Machine Config Pools for label management — it keeps affinity rules consistent across the fleet without per-node label drift. And always verify labels before you deploy; that one is cheaper to check than to debug at 2am.