You deploy a new service to your Kubernetes cluster. The pods come up healthy. You open a shell inside one of them and try to reach another service by name — http://payments-service — and nothing happens. Timeout. You try the full name: http://payments-service.billing.svc.cluster.local. Still nothing. You try the service's ClusterIP directly, and it works fine. Something in the cluster is resolving names, but it's not resolving yours.
If you've spent time with Kubernetes, you've been in this exact situation. The component responsible for cluster-wide name resolution is CoreDNS — and once you understand how it works, debugging these failures goes from guesswork to a methodical five-minute process.
The problem CoreDNS was built to solve
DNS inside a Kubernetes cluster is not the same problem as DNS on the internet. On the public internet, DNS names change rarely. A server resolves google.com once and caches the result for hours. Inside a cluster, services, pods, and endpoints can come and go in seconds. A pod that was at 10.244.1.5 thirty seconds ago might be gone. A new one at 10.244.2.9 just took its place. The DNS server needs to know about that change immediately.
Before CoreDNS, Kubernetes used kube-dns — a multi-container setup running dnsmasq, a custom stub resolver, and a health check sidecar wired together. It worked, but it was fragile. Scaling it required tuning multiple components independently. Adding custom behavior — like rewriting certain DNS names, or forwarding specific domains to an internal corporate resolver — meant patching the code or running workarounds alongside it.
CoreDNS replaced kube-dns as the default Kubernetes DNS server in version 1.11. Rather than hard-coding behavior, it's built entirely around a plugin chain. Every piece of functionality — Kubernetes service resolution, caching, forwarding, health checks, metrics — is a plugin. You configure which plugins run, and in what order, through a single configuration file called the Corefile. That's the entire model.
The plugin chain — how a query actually flows
When CoreDNS receives a DNS query, it doesn't route it to a fixed handler. It runs the query through a chain of plugins, one by one, in the order they're listed in the Corefile. Each plugin can do one of three things: handle the query and return a response, pass it to the next plugin in the chain, or return an error.
This model is similar to middleware in a web framework. Think of it like an HTTP request passing through a stack of filters — logging, authentication, rate limiting — before reaching the actual handler. In CoreDNS, the plugins are the stack, and the query is the request.
A query for payments-service.billing.svc.cluster.local enters the chain. The errors plugin watches for failures downstream. The log plugin records the query. The cache plugin checks its cache — if it has a valid answer, it returns immediately and the remaining plugins never run. If not, the query reaches the kubernetes plugin, which consults the Kubernetes API for a matching service. If it finds one, it returns the ClusterIP. If not — for example, if the query is for something outside the cluster — the forward plugin sends it upstream to a public resolver like 8.8.8.8.
The order of plugins in the Corefile is the order they run. If you put forward before kubernetes, external queries would work but internal service names would never resolve, because the forwarder would intercept them first. Order matters.
The Corefile — configuring the chain
Every CoreDNS instance is configured by a Corefile. In Kubernetes, this lives in a ConfigMap called coredns in the kube-system namespace. The format is structured around server blocks — each block defines which domains CoreDNS is authoritative for, on which port, and which plugin chain to run for those domains.
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
This is the default Corefile installed by most Kubernetes distributions. The .:53 block means: handle all queries (. is the root zone) on port 53. Let's walk through what each plugin does.
errors — logs any errors that occur downstream in the chain to stderr. It wraps everything that follows, so any plugin further down the chain can surface errors here.
health and ready — expose HTTP endpoints (/health at port 8080, /ready at port 8181) that Kubernetes uses for liveness and readiness probes. The lameduck 5s instruction tells CoreDNS to keep serving queries for 5 seconds after receiving a shutdown signal — giving in-flight requests time to complete before the pod dies.
kubernetes — this is the core plugin for cluster DNS. It watches the Kubernetes API server for services and endpoints, and resolves names under cluster.local, in-addr.arpa, and ip6.arpa. The pods insecure directive enables A record lookups for pods by their IP address (e.g., 10-244-1-5.default.pod.cluster.local). The fallthrough directive means: if this plugin can't find a match for reverse DNS queries (in-addr.arpa), pass the query to the next plugin rather than returning NXDOMAIN.
prometheus — exposes a metrics endpoint on port 9153 in Prometheus format. DNS query counts, latency histograms, cache hit rates — all available here.
forward — forwards queries that didn't match anything in the kubernetes plugin to upstream resolvers. The /etc/resolv.conf argument tells it to use whatever nameservers are configured on the node. This is how queries for external names like api.stripe.com escape the cluster and reach the internet.
cache 30 — caches positive and negative responses for up to 30 seconds. Without this, every DNS query would hit the Kubernetes API or the upstream resolver. With it, repeated lookups for stable names are answered immediately from memory.
loop — detects forwarding loops (CoreDNS forwarding to itself in a cycle) and shuts down gracefully rather than spinning indefinitely.
reload — watches the Corefile for changes and reloads the configuration automatically, without restarting the pod. You can edit the ConfigMap and have it take effect within seconds.
loadbalance — randomizes the order of A records in responses for names that resolve to multiple IPs. A primitive form of DNS-based load balancing.
How Kubernetes service names resolve
Every pod in Kubernetes gets a /etc/resolv.conf injected by the kubelet. It looks something like this:
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
The nameserver address — 10.96.0.10 — is the ClusterIP of the CoreDNS service itself. All DNS traffic from pods goes here first. The search list tells the resolver which domain suffixes to try appending to unqualified names. The ndots:5 option is significant: it means that if a query name contains fewer than 5 dots, the resolver will try appending each search domain before treating it as an absolute name.
So when a pod queries just payments-service, the resolver first tries payments-service.default.svc.cluster.local, then payments-service.svc.cluster.local, then payments-service.cluster.local, and finally payments-service. as an absolute name. The first one that returns a result wins. For an internet hostname like api.stripe.com — which has only 2 dots — the same search list is tried before the absolute name, meaning a lookup for api.stripe.com generates 4 DNS queries instead of 1.
This is the infamous ndots:5 problem — and it's a real source of latency in high-throughput applications making many external API calls. Each failed search-domain attempt is a round trip to CoreDNS, which forwards it to the upstream resolver, waits, gets NXDOMAIN back, and tries the next suffix. The fix is simple: use fully qualified domain names with a trailing dot in your application config (api.stripe.com.), which tells the resolver to skip the search list entirely. Alternatively, you can lower ndots in the pod's DNS config.
Where CoreDNS sits in the cluster
CoreDNS runs as a Deployment in the kube-system namespace, typically with two replicas for availability. It's exposed via a Service with a stable ClusterIP — the same IP that gets written into every pod's /etc/resolv.conf. The kubelet configures that address when it sets up the pod's network namespace.
CoreDNS watches the Kubernetes API continuously using a watch on the Services and Endpoints resources. When a new service is created, CoreDNS doesn't need to be restarted or reconfigured — it picks up the change immediately through the watch and starts answering queries for that service name. This is what makes it suitable for an environment where the service catalog changes constantly.
Common customizations
The Corefile ConfigMap is yours to edit. A few customizations come up regularly in production clusters.
Forwarding a specific domain to an internal resolver. If your company runs an internal DNS server for corp.example.com, you can tell CoreDNS to forward only those queries there, while still handling everything else normally:
corp.example.com:53 {
forward . 10.10.0.2
}
.:53 {
# default config
}
Multiple server blocks can coexist in the same Corefile. CoreDNS picks the most specific matching block for each query. The corp.example.com block will handle only queries for that domain; everything else falls through to the . block.
Rewriting service names. The rewrite plugin lets you alias one name to another. This is useful if you're migrating a service and want the old name to still resolve during the transition:
.:53 {
rewrite name payments.billing.svc.cluster.local payments-v2.billing.svc.cluster.local
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
forward . /etc/resolv.conf
cache 30
}
Tuning the cache. The default 30-second TTL is a balance between responsiveness to changes and query volume. In a cluster where services rarely change and you want to reduce DNS load, you can raise this. If you're running rolling deployments where pods are replaced frequently and stale IPs cause connection failures, you may want to lower it — or configure the cache with denial and success TTLs separately:
cache {
success 60
denial 5
}
This caches successful resolutions for 60 seconds but lets negative responses (NXDOMAIN) expire in 5 seconds, so newly created services become resolvable quickly.
Failure modes you'll actually encounter
DNS failures in Kubernetes tend to fall into a few recognizable patterns, and CoreDNS's visibility makes most of them diagnosable.
CoreDNS pod not running. If the CoreDNS deployment is down — crashlooping, evicted, or simply not scheduled — all cluster DNS fails simultaneously. Every pod that tries to resolve any name gets timeouts. The failure mode is distinctive: direct IP connections work, names don't. Check kubectl get pods -n kube-system first.
ConfigMap misconfiguration. A syntax error in the Corefile will cause CoreDNS to fail to load the new config and log the error. The reload plugin keeps the old config running rather than crashing, but the change won't take effect. Always validate changes with corefile-tool validate or check the CoreDNS pod logs immediately after editing the ConfigMap.
Cache serving stale data. After a service is deleted and recreated with a different ClusterIP, clients that have the old IP cached will fail to connect until the cache expires. A 30-second TTL means up to 30 seconds of failures. In practice this is rarely a problem for services (ClusterIPs are stable), but it can bite during certain migration patterns. The TTL from the kubernetes plugin is controlled by the ttl directive on that plugin, independent of the cache plugin's TTL.
Upstream resolver unavailable. If the node's upstream resolver is unreachable, queries for external names will time out. Internal service names will still resolve. This asymmetry — internal works, external doesn't — is the diagnostic signature. Check what /etc/resolv.conf on the node contains, and whether those addresses are reachable from the node.
Search domain explosion under load. In a high-request-rate service making many external API calls, the ndots:5 behavior described earlier can generate 3–4x the DNS query volume you'd expect. CoreDNS can become a bottleneck. The solutions are: use fully qualified names in application config, scale up the CoreDNS deployment (it's stateless and scales horizontally), or enable NodeLocal DNSCache — a DaemonSet that runs a local DNS cache on every node, absorbing most queries before they ever reach the CoreDNS pods.
Summary
CoreDNS is straightforward once you understand the two ideas it's built on. First: every query passes through a plugin chain, and the Corefile is just a description of that chain. Second: the kubernetes plugin is what connects CoreDNS to the live state of the cluster — it watches the API and answers queries for services and pods in real time.
Everything else follows from those two ideas. The ndots:5 behavior is a consequence of how the search list interacts with short external names — understand it once and you'll never be puzzled by unexpected latency again. Customizations like split-horizon DNS or name rewriting are just additional server blocks and plugin directives in the same Corefile. Debugging is a matter of looking at the right things in order: are the pods running, is the Corefile valid, is the cache warm, can the upstream resolver be reached.
CoreDNS is not the most glamorous component in a Kubernetes cluster, but it's one of the most foundational. Every service-to-service call that uses a hostname passes through it. When it works well, it's invisible. When it doesn't, understanding the plugin chain is how you find the problem in minutes rather than hours.
Part of the Explained series — concepts in tech, clearly.