Architecture: Confluent Cloud Gateway

Your Kafka producer has one job at startup: read bootstrap.servers, open a connection, and start sending. That single line of config is also its biggest liability. The host and port are baked in when the application boots, and a Kafka client is not designed to swap them out while running. So the day your active cluster goes dark, every producer and consumer pointing at it is stuck until that cluster comes back or the connection times out. You cannot quietly redirect them to the healthy cluster you have been replicating to for exactly this moment, because they have never heard of it.

This is the gap Confluent Cloud Gateway is built to close. It is a self-managed, Kafka protocol-aware proxy that sits between your clients and your clusters, giving clients a single stable endpoint while you retain the freedom to change what sits behind it. This post walks through how the gateway is built, how a request actually moves through it, and what really happens during a regional disaster, including the parts the word "automatic" tends to gloss over.

The big picture

Without a gateway, a Kafka client talks directly to brokers. It bootstraps against a seed broker, receives cluster metadata listing every broker by host and port, then opens direct connections to each one. The broker addresses the client learns are real, routable addresses. That tight coupling is what makes endpoint changes so painful.

Confluent Cloud Gateway breaks that coupling by inserting a proxy that speaks Kafka fluently. Clients connect to the gateway instead of to brokers. The gateway intercepts the protocol, forwards requests to the correct upstream cluster, and rewrites the responses so the client only ever sees gateway-owned virtual endpoints. The backend becomes invisible. When the backend changes, the client does not notice, because every address it has ever been handed belongs to the gateway.

Two concepts make this work, and the whole architecture rests on them. A streaming domain is a logical representation of one Kafka cluster inside the gateway: its name, its bootstrap servers, its broker node ID ranges. A route is the endpoint clients actually connect to, and each route is bound to exactly one streaming domain. Clients connect to routes; routes point at streaming domains; streaming domains map to real clusters.

The power sits in that final binding. Change which streaming domain a route points to, and you have redirected every client on that route to a different cluster, without touching a single client. This is the mechanism behind migrations, blue-green upgrades, and disaster recovery alike. They are all the same operation: repoint a route.

Streaming domains: modeling your clusters

A streaming domain is configuration, not a running component. It tells the gateway how to reach one backend cluster. At minimum it carries a name, the cluster's bootstrap servers with their listener IDs, and the nodeIdRanges that describe which broker node IDs exist in that cluster. The gateway needs the node ID ranges because of how it virtualizes brokers, which the next section covers.

A gateway can hold many streaming domains at once. For disaster recovery this is the starting point: you define one streaming domain for your active cluster and one for your passive cluster, both present in the gateway's configuration from day one. Both are known to the gateway; only one is currently wired to live traffic.

streamingDomains:
  - name: kafka1-domain
    type: kafka
    kafkaCluster:
      name: kafka-cluster-1
      bootstrapServers:
        - id: internal-listener
          endpoint: "kafka-1:44444"
  - name: kafka2-domain
    type: kafka
    kafkaCluster:
      name: kafka-cluster-2
      bootstrapServers:
        - id: internal-listener
          endpoint: "kafka-2:22222"

This declares two backend clusters to the gateway. Note that nothing here says which one is active. That decision lives in the route, defined separately, and that separation is the entire point.

Routes: the endpoint clients see

A route is the client-facing listener. It has its own endpoint, the host and port clients put in their bootstrap.servers, and it references one streaming domain plus the specific bootstrap server ID within it. When a client connects to a route's endpoint, the gateway knows which streaming domain to forward to because the route told it.

routes:
  - name: switchover-route
    endpoint: "gateway.internal:19092"
    streamingDomain:
      name: kafka1-domain
      bootstrapServerId: internal-listener

Here the switchover-route points at kafka1-domain. Every client using this route is talking to cluster 1. To move them all to cluster 2, you change the route's streaming domain reference to kafka2-domain and the bootstrap server ID to match. That edit is the failover, the migration, and the upgrade switch all in one. The clients never change.

The route also controls how individual brokers get virtualized, through its brokerIdentificationStrategy. The default port strategy gives each backend broker its own port on the gateway: clients reach broker 0 on one port, broker 1 on the next, and so on. The gateway uses the streaming domain's node ID ranges to set this mapping up, which is why those ranges matter. When the gateway rewrites a metadata response, it replaces the real broker addresses with these virtual ports, so the client's picture of the cluster is entirely gateway-owned.

Trial mode allows up to four routes with no license key. Going beyond four routes requires an enterprise license applied through the GATEWAY_LICENSES configuration. For a two-cluster disaster recovery setup, four routes is usually plenty.

Deployment: where it runs and who owns it

This answers the first practical question directly. Yes, the gateway runs on-premises, and yes, you manage it. Confluent Cloud Gateway supports deployment on-premises, in private cloud VPCs, or in hybrid environments. It is a self-managed solution that gives you full control over deployment, configuration, and operations. That control is the trade: Confluent does not run this for you the way it runs your Kafka clusters in Confluent Cloud.

The gateway runs as a stateless service, typically as a set of containers. You deploy it with Confluent for Kubernetes using a Gateway custom resource, or with Docker using the confluentinc/confluent-gateway-for-cloud image. Because it is stateless, you scale it horizontally by running more instances behind a load balancer, and any instance can serve any request.

Statelessness has a direct consequence for disaster recovery: the gateway itself must be at least as resilient as the thing it is protecting. If your gateway runs in the same region that just failed, it went down with the cluster, and your clients now cannot reach the proxy that was supposed to redirect them. A gateway meant to survive a regional outage has to live somewhere that outlives the region, with its own redundancy. The gateway protects your clients from cluster failure; nothing protects your clients from gateway failure except how you deploy the gateway.

The request lifecycle, end to end

Follow a producer's send() from client to broker and back. The path makes the rewriting behavior concrete.

The client bootstraps against the route endpoint and asks for metadata. The gateway forwards that to the active cluster and gets back the real broker list. Before passing it on, the gateway rewrites it: every real broker address becomes a virtual endpoint the gateway owns. The client receives this virtualized map and believes it is the cluster topology. From then on, every produce and fetch goes to a virtual broker port on the gateway, which proxies it to the real broker behind the scenes. At no point does the client hold a real backend address. That is precisely why the backend can be swapped underneath it.

Failover: what "automatic" does and does not mean

This is the heart of your question, so it is worth being exact. The gateway automatically routes traffic according to whichever streaming domain the route currently points to. It does not automatically decide to fail over. Those are different claims, and the difference is the whole story.

The routing is automatic: every request on a route is forwarded to that route's streaming domain with no per-request intervention. The decision to switch a route from the active streaming domain to the passive one is an operator action. Something or someone has to detect that the active cluster is unhealthy and trigger the switch by repointing the route. The gateway does not health-check your cluster and flip the route on its own.

Mechanically, triggering a switchover means updating the route to reference the passive streaming domain and its bootstrap server ID. Client switchover requires a restart of the gateway service. That restart severs existing connections, which forces clients to re-bootstrap, and because connections drop, consumer groups rebalance. If the restart takes longer than session.timeout.ms, which defaults to 45 seconds, the rebalance is unavoidable. So the clients themselves do not restart, but they do re-bootstrap and reconnect, now landing on the passive cluster.

Compare this to a database with automatic failover, where a standby is promoted by the database's own failure detector with no external trigger. The gateway does not include that detector. The current release reduces the recovery time and removes the need to touch clients, but the failover trigger is still yours to provide, whether through monitoring scripts, manual API calls, or an orchestrator.

Replication and the active/passive vs active/active question

The gateway moves connections. It does not move data. Replication between your two clusters is a separate concern that you set up outside the gateway, typically with Cluster Linking. This division of labor is the key to understanding both topologies you asked about.

In an active/passive setup, one cluster takes all writes and the other receives a replicated copy. Cluster Linking does this asynchronously, preserving offsets so a consumer can resume on the passive cluster at the right position. The catch is that mirror topics created by Cluster Linking are read-only by default, to keep their offsets faithful to the source. When the gateway switches a route to the passive cluster, producers immediately hit an error because they cannot write to a read-only mirror topic. So a route switch alone is not a complete failover. Something has to promote the mirror topics to read/write as part of the same operation, and that promotion is a Cluster Linking action, not a gateway action.

This is why a production failover needs an orchestrator wrapping the whole sequence: drop the in-flight connections, reverse or stop the cluster link so the new active cluster can take writes, promote its topics to read/write, switch the gateway route, and accept connections again. The gateway performs exactly one step in that chain. The rest is replication management you coordinate alongside it.

Active/active, where both clusters take writes simultaneously, is harder and more constrained. The documented guidance is to avoid client switchover when ordering and data consistency are required, for example with Kafka Streams applications, because the switchover spans more than one cluster and those guarantees do not hold across a boundary the application does not know exists. Two clusters taking independent writes can diverge, and the gateway has no mechanism to reconcile them; it only chooses which one a client currently talks to. Active/active across regions with this gateway tends to mean partitioning workloads so a given topic has a single writer at a time, rather than true concurrent writes to the same topics on both sides. If you need genuine concurrent multi-region writes with conflict handling, that is a data-layer design problem the gateway does not solve.

A regional disaster, step by step

Put it together with the scenario you asked about: one region goes completely down, taking the active cluster with it. Assume active/passive across two regions, Cluster Linking replicating active to passive, and a gateway deployed to survive the failed region.

The moment the region fails, producers and consumers pointed at the gateway route stop getting responses, because the route still points at the now-dead active cluster. The gateway is up, the route is intact, but its target is gone. Nothing self-corrects yet, because nothing has told the gateway the cluster is dead. Meanwhile producers buffer in memory up to their limits, and the default delivery timeout of two minutes is ticking. If the switchover does not complete inside that window, those buffered records expire and you take data loss, which is the practical link between your recovery time and your recovery point.

Recovery is the orchestrated sequence. Your monitoring detects the outage and triggers the failover. The orchestrator promotes the passive cluster's mirror topics to read/write so it can accept writes, reverses the cluster link direction so the old passive becomes the new replication source, then repoints the gateway route to the passive streaming domain and restarts the gateway service. Clients re-bootstrap against the same endpoint they always used, reconnect, rebalance their consumer groups, and resume, now producing to and consuming from what was the passive cluster. From the application's point of view, nothing about its configuration changed. From the platform's point of view, a coordinated multi-step recovery just happened, and the gateway handled one step of it.

A newer capability, intelligent fencing, gives finer control over this window. Fencing lets you deliberately block client traffic on a route during the switch and unblock it once the backend is ready, so clients are held rather than failing chaotically while the promotion and repoint happen. It tightens the choreography; it does not remove the need for the choreography.

Summary

Confluent Cloud Gateway solves one specific, stubborn problem: Kafka clients are welded to their bootstrap endpoints, and the gateway unwelds them by owning every address the client ever sees. Routes bound to streaming domains let you change the backend cluster without changing a single client, which is why migrations, upgrades, and disaster recovery all reduce to the same move, repointing a route.

The insight worth carrying away is where the boundary of the gateway's responsibility sits. It routes connections automatically along the current path, but it does not detect failure or decide to fail over, and it does not replicate or reconcile data. Those belong to your monitoring, your orchestrator, and Cluster Linking. In an active/passive regional disaster, the gateway performs exactly one step of a multi-step recovery, and the seamlessness clients experience is the product of that whole sequence working together, not of the gateway acting alone. Read against a database's self-promoting failover, the gateway is the mechanism that makes a triggered switchover invisible to clients, not the trigger itself. A fully managed switchover that closes that last gap is on Confluent's roadmap but is not what ships today.

References

Confluent Documentation: Deploy and Manage Confluent Cloud Gateway. Overview of the gateway, routes, and streaming domains.

Confluent Documentation: Confluent Cloud Gateway Deployment Process (Docker). Configuration reference, route and streaming domain fields, license modes.

Confluent Documentation: Confluent Cloud Gateway Migration Process. Client switchover mechanics, gateway restart and rebalance behavior, transaction handling.

Confluent Documentation: Confluent Cloud Release Notes. General availability and the 1.1.0 feature set.

Confluent Blog: Disaster Recovery in 60 Seconds: A POC for Seamless Client Failover on Confluent Cloud. Manual trigger requirement, read-only mirror topic constraint, orchestration workflow, roadmap note.

Confluent Blog: Introducing Confluent Platform 8.2. Intelligent fencing and unfencing, expanded non-Java client support.

Confluent Blog: Kafka Client Migrations With KCP and Confluent Cloud Gateway. Atomic route switch, fenced and switchover configuration, offset syncing.

Related on this blog: Architecture series