Explained: Node Affinity in Kubernetes

Codelooru Taints and Tolerations
Codelooru Node Affinity

When you deploy a workload to a Kubernetes cluster, the scheduler makes a decision: which node should run this pod? By default it balances across available nodes, picking whichever has the capacity. Most of the time that's fine. But in the real world, infrastructure is rarely uniform. You've got GPU nodes for ML workloads, high-memory nodes for caching tiers, nodes in specific availability zones for compliance, and bare-metal nodes for latency-sensitive services. You need to tell the scheduler where things belong — and more importantly, what to do when the perfect node isn't available.

That's where Node Affinity comes in. It's a declarative way to express scheduling preferences and requirements using label selectors, giving you fine-grained control without coupling your workload spec to specific hostnames. This post breaks down the concept, the mechanics, and the tradeoffs — with diagrams where a picture is genuinely worth the thousand words.


The problem: why default scheduling isn't always enough

Before going into Node Affinity specifically, it helps to understand the broader scheduling landscape. Kubernetes has three primary mechanisms for constraining pod placement:

  • nodeSelector — the blunt instrument. A simple key-value label match. If the node doesn't have the label, the pod doesn't schedule. No fallback, no nuance.
  • Node Affinity — the evolved version of nodeSelector. Supports richer expressions (In, NotIn, Exists, Gt, etc.) and — critically — distinguishes between hard requirements and soft preferences.
  • Taints & Tolerations — the inverse approach. Instead of pods saying "I want to go there," nodes say "keep away unless you tolerate this." Complementary to Node Affinity, not a replacement.
nodeSelector Simple key=value match Node Affinity Expressions + soft/hard Taints & Tolerations Node repels by default Kubernetes Scheduler Combines all constraints

Three mechanisms feeding into the scheduler — Node Affinity sits in the middle, balancing expressiveness and flexibility.


Node Affinity: the two flavours

Node Affinity lives under spec.affinity.nodeAffinity in your pod spec (or pod template in a Deployment). There are exactly two rule types, and understanding the distinction between them is the foundation of everything else:

Rule TypeMeaningIf unmet
requiredDuringSchedulingIgnoredDuringExecutionHard requirementPod stays Pending
preferredDuringSchedulingIgnoredDuringExecutionSoft preferenceFalls back to any eligible node

The IgnoredDuringExecution suffix is important: if a node's labels change after a pod is already running there, the pod is not evicted. It continues running even if the rule would now reject the placement.

Key insight: Think of required rules as a gate and preferred rules as a hint. Gates block; hints guide. Using the wrong one is the most common Node Affinity mistake.
Pod submitted Node Affinity rules? required / preferred required Must match label Pending if no node qualifies preferred Tries to match label Falls back if needed Pod scheduled

required rules block scheduling entirely; preferred rules influence it without blocking.


Anatomy of a Node Affinity spec

A real example — this Deployment targets GPU nodes in us-east-1a, with a soft preference for high-memory nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training-job
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: In
                values:
                - nvidia-t4
                - nvidia-a100
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - us-east-1a
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: node-type
                operator: In
                values:
                - high-memory
          - weight: 20
            preference:
              matchExpressions:
              - key: node-type
                operator: In
                values:
                - standard

The nodeSelectorTerms list under required is evaluated as an OR — the pod schedules on any node satisfying at least one term. Within a single term, the matchExpressions list is an AND — all expressions must be satisfied simultaneously.

nodeSelectorTerms Multiple terms = OR logic Term 1 expr-A AND expr-B Term 2 expr-C AND expr-D OR matchExpressions Multiple entries = AND logic Expr A: accelerator In [...] AND Expr B: zone In [...]

Terms in nodeSelectorTerms combine with OR; expressions inside a single term combine with AND.

Available operators

OperatorBehaviourTypical use case
InValue is in the given listMatch one of several GPU types
NotInValue is not in the listExclude spot/preemptible nodes
ExistsKey exists (any value)Require a node to be labelled at all
DoesNotExistKey is absentAvoid nodes with a "draining" label
GtInteger value is greater thanMinimum CPU generation or disk tier
LtInteger value is less thanCap on a measured property

Preferred rules and the weight system

Each preferred entry carries a weight from 1 to 100. The scheduler scores every node that passes the hard filters by summing the weights of all preferred terms that node satisfies. The highest-scoring node wins.

Think of it as a bidding system for nodes. A node with node-type=high-memory earns 80 points; one with node-type=standard earns 20. If both pass the required gate, the scheduler strongly prefers the high-memory node — but it won't block scheduling if neither matches. A node satisfying no preferred terms earns zero points but is still eligible.

Node A high-memory + GPU score: 80 pts Node B standard + GPU score: 20 pts Node C GPU only score: 0 pts Scheduler picks Node A

All three nodes pass the required gate. Weights from preferred rules determine the winner.


Node Affinity in OpenShift

OpenShift uses Kubernetes Node Affinity natively — the spec is identical. But it adds important operator-level tooling on top.

Machine Config Pools and node labels

In OpenShift, nodes are grouped into Machine Config Pools (MCPs) — master, worker, and any custom pools you define. MCPs automatically propagate labels onto their member nodes, so you can write Node Affinity rules against MCP-propagated labels rather than labelling each node manually. A custom MCP called gpu-workers will label its nodes with node-role.kubernetes.io/gpu-workers: "".

Infrastructure nodes

OpenShift has a concept of infrastructure nodes for cluster-internal workloads (the router, registry, monitoring stack). These carry the label node-role.kubernetes.io/infra: "". Using Node Affinity against this label keeps platform components off your application worker pool — and keeps your Red Hat subscription licensing clean, since infrastructure node costs are treated differently.

OpenShift 4.x tip: When using Node Affinity alongside the cluster autoscaler, make sure your MachineAutoscaler targets machine sets whose nodes carry the labels your affinity rules expect. If the autoscaler spins up nodes from an unlabelled machine set, your pods will remain Pending even though capacity was just added.

Node Affinity vs. nodeSelector vs. Taints & Tolerations

  • Use nodeSelector for simple, non-negotiable label requirements with no need for expression operators.
  • Use Node Affinity when you need In/NotIn/Exists operators, weighted soft preferences, or multiple OR-able condition sets.
  • Use Taints & Tolerations when nodes should repel all workloads by default — dedicated node pools that nothing lands on unless it explicitly opts in.
  • Combine Node Affinity + Tolerations for full precision: tolerations allow a pod to land on a tainted node; affinity ensures it only lands on the right one.
GPU node taint: gpu=true:NoSchedule label: accelerator=nvidia ML training pod toleration: gpu=true affinity: accelerator=nvidia Regular app pod no toleration, no affinity Worker node no taint, no GPU label toleration + affinity match ✓ no constraints, schedules freely blocked by taint ✗

Taints keep the regular pod off the GPU node. Affinity ensures the ML pod lands only on the GPU node, not just anywhere it has a toleration.


Practical patterns

Pattern 1: zone-pinned services

For services co-located with a database in a specific availability zone:

requiredDuringSchedulingIgnoredDuringExecution:
  nodeSelectorTerms:
  - matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - eu-west-1b

Pattern 2: prefer on-demand, tolerate spot

For stateless workloads that can tolerate interruption but prefer reliability:

preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
  preference:
    matchExpressions:
    - key: node.kubernetes.io/lifecycle
      operator: In
      values:
      - on-demand
- weight: 10
  preference:
    matchExpressions:
    - key: node.kubernetes.io/lifecycle
      operator: In
      values:
      - spot

Pattern 3: exclude draining nodes

During rolling upgrades, label nodes with draining=true before cordoning. This keeps fresh pods off them:

requiredDuringSchedulingIgnoredDuringExecution:
  nodeSelectorTerms:
  - matchExpressions:
    - key: draining
      operator: DoesNotExist

Common mistakes and how to avoid them

Mistake 1: Using required when you meant preferred. The symptom is pods stuck in Pending with the event "0/N nodes are available: node(s) didn't match Pod's node affinity". Start with preferred during development; tighten to required once node labelling is confirmed consistent.
Mistake 2: Confusing AND vs. OR logic. Two entries in nodeSelectorTerms are OR'd. Two entries inside a term's matchExpressions are AND'd. Placing two requirements as separate terms when you meant both to be required simultaneously is a very easy trap.
Mistake 3: Label case mismatches. Node labels are case-sensitive. accelerator=nvidia-T4 and accelerator=nvidia-t4 are different labels. Always verify with kubectl get nodes --show-labels before writing affinity rules.
Mistake 4: Expecting running pods to be evicted on label changes. IgnoredDuringExecution means pods already running on a node are unaffected if its labels change. New pods won't schedule there, but existing ones stay put. Plan accordingly during maintenance windows.

Debugging Node Affinity issues

# See scheduling events and affinity failures
kubectl describe pod <pod-name> -n <namespace>

# List all node labels
kubectl get nodes --show-labels

# Filter nodes by a specific label
kubectl get nodes -l accelerator=nvidia-t4

# OpenShift: check Machine Config Pool labels
oc get machineconfigpool
oc describe machineconfigpool worker

If the describe output says "didn't match node affinity", cross-check your matchExpressions values against the actual node labels line by line. Nine times out of ten it's a typo or a case mismatch.


Wrapping up

Node Affinity is one of those Kubernetes features that looks intimidating in the spec but makes intuitive sense once you have the mental model. Hard requirements act as gates; soft preferences act as scoring hints. Terms are OR'd; expressions within a term are AND'd. The IgnoredDuringExecution suffix is everywhere because the stronger variants aren't stable yet.

For most production workloads: use required for non-negotiable placement constraints (zone, hardware type, compliance isolation) and preferred for cost optimisation and soft topology goals. Layer taints and tolerations on top when you need dedicated node pools that must actively repel general workloads.

If you're on OpenShift, lean on Machine Config Pools for label management — it keeps affinity rules consistent across the fleet without per-node label drift. And always verify labels before you deploy; that one is cheaper to check than to debug at 2am.



×