Architecture: Amazon Aurora

Codelooru mythos

You have a production MySQL database on RDS. It works fine at low traffic. Then your load doubles, and you start running into the ceiling: replication lag, failover times measured in minutes, and the uncomfortable knowledge that your data lives on a single EBS volume. You can throw more compute at it, but the storage layer stays the same. The database is the bottleneck, and there is no obvious way out short of a full re-architecture.

Amazon Aurora was built specifically for this situation. It is a relational database engine, wire-compatible with MySQL and PostgreSQL, that decouples storage from compute in a way that fundamentally changes what a managed relational database can do. This post walks through how Aurora is actually architected: what the pieces are, why they were designed that way, and what happens from the moment a write lands to the moment a reader can see it.


The big picture

The central idea in Aurora is straightforward but consequential: compute and storage are completely separate, and storage is a distributed service, not a volume attached to an instance.

In traditional RDS, each instance has its own EBS volume. Replication works by shipping binary log events from the primary to replicas, which replay them on their own separate storage. Every write gets done multiple times, replication lag is real and visible, and failover involves promoting a replica that might be seconds behind.

Aurora replaces that model entirely. There is one storage layer, shared across all compute instances. The writer sends only redo log records to that storage layer, not full data pages. The storage layer, spread across six nodes in three Availability Zones, applies those log records itself and serves pages back on demand. Readers never fall behind in the traditional sense because they all read from the same underlying data.

Aurora architecture big picture COMPUTE LAYER Writer Primary instance Reader 1 Read replica Reader 2 Read replica Reader …15 Read replica log records only — no dirty pages shipped STORAGE LAYER — shared distributed volume AZ 1 AZ 2 AZ 3 Storage node 1 Storage node 2 Storage node 3 Storage node 4 Storage node 5 Storage node 6 Compute and storage scale independently. All instances share the same storage volume.

The storage layer

Aurora's storage layer is the part that makes the rest of the design possible. It is not a managed EBS volume. It is a purpose-built distributed storage service that AWS calls the Aurora Storage Volume, internally composed of many small chunks called Protection Groups.

Each Protection Group is 10 GB of data, replicated across six storage nodes: two in each of three Availability Zones. Those nodes are individual storage microservices backed by local SSDs. The full database volume is a sequence of Protection Groups that grow on demand up to 128 TiB without any downtime or provisioning step.

What gets sent to storage

Traditional database replication ships dirty pages: the actual modified database blocks. Aurora's writer ships only redo log records, which describe the change that was made rather than the resulting state. The storage nodes receive these records and apply them to their local page copies themselves, a process Aurora calls the log applicator.

A single write transaction that modifies a few rows might touch several pages, but the redo log records describing those modifications are orders of magnitude smaller than the pages themselves. Aurora's own design papers noted this reduced write network I/O by roughly an order of magnitude compared to mirrored MySQL.

Quorum writes and reads

Aurora uses a quorum model for durability. With six storage nodes, it requires 4 of 6 acknowledgments before confirming a write to the caller, and 3 of 6 for a consistent read. This is designed so that an entire AZ going dark does not affect write availability: the two nodes in the failed AZ are two votes of six, leaving four to meet the write quorum. Aurora can also tolerate one additional node failure on top of a full AZ outage without losing reads, because three nodes across the remaining two AZs still meet the read quorum.

Aurora quorum writes across six storage nodes Writer instance sends log records AZ 1 AZ 2 AZ 3 Node 1 confirmed Node 2 confirmed Node 3 confirmed Node 4 confirmed Node 5 pending Node 6 pending Write acknowledged 4 of 6 confirmed — quorum met confirmed (quorum) not yet confirmed Write quorum: 4 of 6. Read quorum: 3 of 6. One whole AZ can fail without data loss.

Storage self-healing

Storage nodes are not passive. Each node continuously gossips with the others in its Protection Group, comparing which log records each has applied. When a node falls behind due to a transient failure or a slow disk, it identifies the gap and peers with a healthy node to fill it. This happens autonomously, without any involvement from the compute layer. The storage layer heals itself.

This is why Aurora can survive node failures without triggering a database failover. The compute tier does not notice a short storage node outage; it simply waits for four acknowledgments and gets them from the five remaining nodes.


The compute layer

Compute in Aurora means database instances: the things that run the MySQL or PostgreSQL engine and accept SQL connections. There are two kinds.

The writer instance

Every Aurora cluster has exactly one writer instance. All DDL and DML goes through it. The writer runs the database engine, maintains a buffer cache of recently used pages, and generates the redo log records that get sent to the storage layer.

When the writer commits a transaction, it sends the log records to all six storage nodes simultaneously and waits for four acknowledgments. Once it has them, the commit is durable and the client is notified. The writer does not wait for log records to be applied to pages; that happens asynchronously inside the storage nodes after the acknowledgment is returned. This is one of the reasons Aurora's write latency is low relative to what you might expect from a six-way replicated system.

Reader instances

A cluster can have up to fifteen reader instances. They also run the full database engine and maintain their own buffer caches, but they do not generate redo log records. Instead, they read pages directly from the shared storage layer on cache misses.

When the writer commits a transaction, it sends cache invalidation signals to all reader instances, telling them which pages are now stale. A reader that gets a cache miss on an invalidated page fetches the current version from storage. Because all compute instances share the same storage, replica lag is not a replication delay in the traditional sense; it is the time it takes for a reader to receive an invalidation signal and evict the stale page from its cache. In practice this is typically under 20 milliseconds.

Aurora write path and reader cache invalidation Client application SQL Writer instance generates redo log log records Storage nodes ×6 apply log to pages ack (4 of 6) cache invalidation Reader 1 buffer cache Reader 2 buffer cache page reads (cache miss) reads The writer ships only redo log records. Storage nodes reconstruct pages. Readers are never stale by more than ~20 ms.

Endpoints and routing

Aurora exposes several DNS endpoints so that applications can connect to the right instance for their workload without needing to know the cluster topology.

The cluster endpoint always resolves to the current writer instance. In a failover, Aurora updates its DNS record to point at the new writer. Applications using the cluster endpoint for writes need no reconfiguration; they just reconnect.

The reader endpoint load-balances connections across all healthy reader instances. This is the right target for read-heavy workloads: reporting queries, analytics, read replicas for specific services. As you add or remove readers, the endpoint adjusts automatically.

Aurora also supports custom endpoints, which let you define a named endpoint routing to a specific subset of instances. This is useful when you want to direct different workload types to instances with different sizes, or when you want to pin a particular application to a specific reader.

Finally, each instance has its own instance endpoint. These are rarely used in application code directly, but are useful for diagnostic queries and maintenance operations where you need to reach a specific instance.


Failover

Failover in Aurora is fundamentally different from failover in traditional RDS, and the difference comes directly from the shared storage design.

In standard MySQL RDS, a failover means: the primary goes down, a replica is promoted, that replica is slightly behind, and the new primary has to run crash recovery before accepting connections. This process typically takes 60 to 120 seconds, sometimes longer.

In Aurora, because all compute instances share the same storage and readers are never far behind, failover is simpler. Aurora detects that the writer is unresponsive, selects an existing reader to promote, acquires the write lock at the storage layer, and updates the cluster endpoint DNS record. The promoted instance does not need to replay a binary log or catch up from a different volume; it already reads from the same storage. With at least one reader present, failover typically completes in 10 to 15 seconds.

There are no data loss scenarios during a normal failover. The storage layer has already acknowledged every committed transaction across four of six nodes. The promoted reader has access to exactly that data.


Aurora Serverless

Aurora Serverless v2 is a capacity mode, not a different product. The underlying storage layer is identical. What changes is how compute is managed.

In a standard Aurora cluster, you choose an instance size and it runs continuously. In Serverless v2, the cluster automatically adjusts its compute capacity in fine-grained increments measured in Aurora Capacity Units (ACUs). Each ACU is approximately 2 GB of memory with corresponding CPU. You set a minimum and maximum ACU range, and Aurora scales within that range in response to load, scaling up in seconds when traffic spikes and scaling down during quiet periods.

Unlike the original Serverless v1 (now deprecated), v2 does not have a cold-start problem. v1 could scale to zero, but spinning back up after inactivity could take 20 to 30 seconds. v2 has a minimum of 0.5 ACUs, keeping the instance warm. You can mix Serverless v2 instances with provisioned instances in the same cluster: a serverless writer that tracks application load alongside a fixed-size reader for a reporting workload that benefits from predictable capacity.


Global databases

Aurora Global Database extends a single Aurora cluster across multiple AWS regions. The primary region runs a normal Aurora cluster. Secondary regions get their own compute instances, which read from a local copy of the storage replicated from the primary.

Replication to secondary regions happens at the storage layer, not through the database engine. The primary region's storage ships redo log records to the secondary region's storage, which applies them there. Because this bypasses the database engine entirely, the replication overhead is low: Aurora targets under one second of cross-region lag.

In a regional failure, a secondary region can be promoted to primary in roughly a minute. This is an active failover you initiate; Aurora does not do it automatically. Applications need to update connection strings to point at the new primary region's cluster endpoint.


How a write moves through the system end to end

It helps to trace a single transaction from client to durable storage to make the design concrete.

An application sends an UPDATE statement to the cluster endpoint. The cluster endpoint resolves to the writer instance's IP address. The writer receives the SQL, parses it, runs the optimizer, and executes it against the pages in its buffer cache. As it modifies pages in memory, it generates redo log records describing each change.

When the transaction commits, the writer sends those log records to all six storage nodes simultaneously and waits. Four of six nodes write the records to their local SSDs and respond with an acknowledgment. The writer considers the commit durable and sends the client a success response. The writer also multicasts cache invalidation signals to all reader instances, telling them which buffer cache pages are now invalid.

A reader receives an invalidation for a page. It evicts that page from its buffer cache. The next query that touches that page misses the cache. The reader asks the storage layer for the current version. The storage node, which has since applied the redo log records to its local page copy, returns the updated page. The reader serves the query with current data.


Failure modes and fault tolerance

Storage node failure: If a storage node fails, the quorum model ensures writes and reads continue through the remaining nodes. The self-healing mechanism kicks in: when the node recovers or is replaced, it identifies missing log records by gossiping with peers and re-applies them. The compute layer sees no interruption.

AZ failure: Losing an entire Availability Zone takes out two of six storage nodes. The write quorum of 4 of 6 is still met by the four remaining nodes. The read quorum of 3 of 6 is met with room to spare. Storage self-healing will re-replicate data from the surviving nodes once the AZ recovers. No compute-layer interruption occurs.

Writer instance failure: The compute failover scenario described earlier. With a reader already present, this takes 10 to 15 seconds. Applications need to handle reconnection; in-flight uncommitted transactions are lost, as they would be in any database failover.

What Aurora cannot protect against: A bug or corruption that propagates successfully through the quorum is durable. This is not unique to Aurora; it applies to any replicated system. Point-in-time restore, backed by continuous backups to S3, is the recovery path for data corruption. Aurora retains automated backups for up to 35 days and supports restoring to any second within that window.


Putting it all together

Complete Amazon Aurora architecture APPLICATION Application / client ENDPOINTS Cluster endpoint always points to writer Reader endpoint load-balances across readers COMPUTE LAYER Writer instance DDL · DML · redo log Reader ×1–15 SELECT · buffer cache · <20 ms lag log records page reads STORAGE LAYER — 6 nodes · 3 AZs · write quorum 4/6 · auto-grows to 128 TiB AZ 1 AZ 2 AZ 3 Node 1 · Node 2 self-heals · gossip Node 3 · Node 4 self-heals · gossip Node 5 · Node 6 self-heals · gossip All compute instances share the same storage volume. Storage spans three AZs and grows automatically.

What Aurora does not do

Aurora is not a distributed SQL database in the CockroachDB or Spanner sense. It has a single writer. There is no horizontal write scaling; if your write throughput exceeds what one instance can handle, you need sharding, a CQRS architecture, or a different engine altogether.

Aurora also does not handle multi-master write conflicts out of the box. Aurora Multi-Master (MySQL only) does allow multiple writers, but it requires application-level conflict handling and is niche. The standard single-writer model is what almost all Aurora deployments use.

Operational concerns like VPC configuration, IAM authentication, parameter groups, enhanced monitoring, and Performance Insights are real and worth understanding, but they are RDS-level concepts that apply to Aurora the same way they apply to any RDS engine. They are outside the scope of this architecture post.


Summary

Aurora's core insight is that the storage layer is where the real work of durability and replication should happen, not the compute layer. By shipping only redo log records to a distributed storage service, Aurora eliminates the traditional tradeoff between replication lag and write overhead. The storage layer handles durability through a 4-of-6 quorum, heals itself autonomously, and serves pages to any compute instance that requests them, making the distinction between writer and reader much thinner than in conventional replicated databases.

The result is a database that looks like MySQL or PostgreSQL to your application but behaves like a storage-disaggregated cloud service underneath: failovers in seconds rather than minutes, replica lag in milliseconds rather than seconds, and storage that grows automatically without a maintenance window. The compute and storage scaling independently is not a marketing point; it is the direct consequence of the architectural decision made at the very bottom of the stack.


Related on this blog: Architecture series



×