Background
Aurora Global Database from Amazon is a multi-region deployment of Aurora MySQL or Postgres.
Using Aurora Global Database, you deploy a primary cluster in a region that you choose, and add additional regions, up to 5, as “secondary clusters.” The secondary clusters are available to serve reads, can forward writes to the primary cluster, and may be used as planned or unplanned failover candidates.
We recently had an inquiry for a MySQL cluster multi-region deployment. The customer looked into Aurora Global Database and decided that it would not work for their needs, because “failover is manual, and there is no way to fail back.” What does this mean?
Unplanned Failover in Aurora Global Database
During a regional outage, these are the manual steps to fail over Aurora Global Database:
- Stop applications from writing.
- Identity failover candidate region. Choose the cluster that has the least lag time (if more than 2 regions are deployed).
- Detach the target cluster from Aurora Global Database. This cluster is now an independent, writable cluster.
- Reroute applications to write to the newly promoted cluster.
- Add another region to the cluster, essentially building Aurora Global Database from scratch.
- Add remaining regions.
How Much Time for an Outage?
The customer had an SLA of 15 minutes for any regional outage, and concluded they could not do the above steps within 15 minutes. Let’s suppose we have 3 regions: US EAST, US WEST, and EUROPE WEST, with US EAST as the Primary. If the failed region is US EAST, we would need to do the following:
- Stop application writes (at this point applications probably cannot write and are reporting errors).
- Inspect US WEST and EUROPE WEST to see which cluster has less replication lag. This will require back and forth within the console to each region.
- Let’s suppose we choose US WEST. We detach it. Now US WEST is an independent cluster, and EUROPE WEST is outdated, which means we now have to configure any applications reading from EUROPE WEST to read from US WEST.
- Manually reroute all applications in all regions. This could mean updating the connection strings on all connection pools, or reconfiguring RDS proxy to point to the new primary cluster.
- Rebuild Aurora Global Database – Add cluster in EUROPE WEST (the old cluster there is stale).
- Rebuild Aurora US EAST when it’s back online.
This failover process could not be easily performed in 15 minutes, and during this outage, applications are reporting errors. In addition, we do not have a guarantee that we could complete the steps at all, because during an regional outage, the AWS console will probably be overloaded since millions of customers will be attempting to reconfigure their services. In addition, failover means building Aurora Global Database from scratch. In the above scenario, users in EUROPE WEST are penalized even though the outage is not happening there.
How Does Tungsten Cluster Handle Regional Outages?
Topology 1: Composite Active/Passive
Composite Active/Passive clusters can be deployed in multiple regions, in AWS, another cloud provider, on premises, or a mix of any of those (yes, you can deploy one region in AWS and another in GCP and simply switch back and forth between them). There is a single active cluster and one or more passive clusters. Here’s how to failover in the event of a regional outage:
Using Tungsten CLI:
- Type:
Failover
- Done
Using Tungsten Dashboard GUI:
- Click:
Failover
- Done
Tungsten Cluster does all of the other steps for you, while keeping your applications online, in just a few seconds. It automatically selects the region with least lag, pauses application traffic (using the included Tungsten Connector), promotes the target region to primary, unpauses application traffic while sending them to the new region, and configures other passive regions to replicate from the new primary region. The unaffected regions are not penalized, applications stay online, and failback is possible when the region becomes online. There is also no dependency on an overloaded management console to perform failover.
Topology 2: Dynamic Active/Active
Dynamic Active/Active clusters are also clusters deployed in multiple regions (clouds, on-premises, etc). All clusters are technically writable, but writes are routed by the Tungsten Connector to just one region. This topology provides benefits of multiple active clusters, but without the risk of conflicts due to multiple Primaries (however leveraging multiple Primaries is possible and fully supported by Tungsten Cluster). Here’s how to perform a failover in Dynamic Active/Active:
Do Nothing
Correct, there’s no user action required because the Tungsten Connector will route queries to another region that is online in the event of an entire cluster failure. When the failed region comes back online, replication will play transactions to sync up the cluster and traffic can be once again rerouted to the cluster, providing seamless failback.
Keepin’ It Global
Tungsten Cluster can easily help satisfy the 15 minutes SLA, AND allow admins to work on other critical items during an outage instead of rebuilding clusters. In addition, every Tungsten Cluster installation comes with 24/7 support staffed by engineers with at least 20 years of database and systems admin experience. Managed databases do make certain items easy, like ingesting data (but getting data out is another story/blog….), however this convenience comes at a high cost with less flexibility and potential down time. Ask for a demo and let us show you how easy it is to deploy MySQL at geo scale.
Comments
Add new comment