Continuent Blog: Challenges with Galera Cluster, MariaDB Cluster, Percona XtraDB Cluster

Blog

Our team of MySQL database experts regularly blogs on topics that range from MySQL availability, MySQL replication, multi-master MySQL, and MySQL-aware proxies, all the way through to ‘how to’ content for our solutions: Tungsten Clustering, Tungsten Replicator and Tungsten Proxy.

As you may have heard, the core Galera development team used to work for Continuent on our earlier solution, Continuent m/cluster (remarkably similar to Galera) around 2002-2006. Continuent’s CEO and Founder, Eero Teerikorpi, chose to abandon that development path in favor of the Tungsten approach, that focuses on providing MySQL or MariaDB that is more available, scalable and performant, especially for larger deployments or clusters spread over long distances.

That decision enabled Tungsten to serve its niche - mission-critical MySQL or MariaDB applications - for a long time with great success. Customers include some of the most well-known and busy applications in Financial Services, Gaming, Telecommunications and others. But why does there seem to be confusion about asynchronous replication, thinking that it’s less preferred than synchronous (the foundation of Galera Cluster, aka MariaDB Cluster, and its fork, Percona XtraDB Cluster)?

Perhaps it’s because of the naming - “asynchronous” - which points to a difference between nodes and does not point to the performance-benefits in real-world applications. On the other hand, “synchronous” seems to suggest you can have your cake and eat it too. In this blog, we’ll look at why in-practice, synchronous replication can be a weak foundation for a clustering solution.

Theory vs. Practice

Synchronous replication aspires to an attractive ideal. In a perfect world, we’d like all nodes to have exactly the same data so that the distributed nodes can act like one. As the name implies, it’s assumed that asynchronous replication runs the risk of a lack of consistency across nodes, meaning a user might read stale data. But as we know from experience, theory is always different from practice. And, as we’ll discuss later in this blog, there’s a better way to get all nodes to act as one (hint: it’s not synchronous replication).

Synchronous In-Practice

There are certain use cases for which Galera is appropriate. The Galera website documents the fact that in the real-world, getting synchronous replication to be as performant as asynchronous is an area of continuous research and development. The limitations of synchronous make it so that it is often not a practical approach, and even with the certification method, the problems created by the synchronous approach run deep. A quote from this blog, “Geo-MySQL Reality Check: How Galera Cluster Caused Downtime,” explains:

“The speed of light is an ultimate limit for the double phase commit needed with synchronous replication… Synchronous solutions like Galera struggle with any deployment over a WAN due to a single issue: Physics. Latency kills performance.”

It is not just a WAN problem - it can also be an issue over shorter distances within a local cluster or single data center. As the Galera documentation mentions, the greater the number of nodes, the greater the delay, as the application must wait for nodes to acknowledge the write before committing a transaction. One slow node can slow the whole cluster down.

Consider the risk of unplanned downtime, mentioned in the “Geo-MySQL Reality Check” blog:

“The risk of Galera Clusters shutting itself down. The important point here is that within a 3-node Galera cluster, you can lose one of the cluster nodes and have the cluster remain online. Great. But if you lose two (2) nodes, that single remaining node will also shut down, which means your cluster is completely offline.”

That’s not ideal for a solution whose raison d’etre is to prevent downtime…

Hence, why we have one global organization who had both Galera and Tungsten Cluster, now converting all of their Galera into Tungsten. “It just works.”

Asynchronous In-Practice

For a properly provisioned system, the asynchronous replication lag between nodes is a moot point. It’s only a relevant risk that can occur during unusual circumstances, such as in a system lacking adequate resources or configuration during unusually high write load. But as mentioned earlier, in reality, with customers running very demanding systems, Tungsten doesn’t see this problem; it is partly due to the Tungsten Clustering components working together on top of the asynchronous Tungsten Replicator foundation.

Getting the Database Nodes to Act As One

Replication technology is the foundation of database clustering solutions. So while it is not the only difference between Galera Cluster and Tungsten Cluster, it’s the root of many differences that evolved over years of development after splitting off. As promised earlier, the right way to get a distributed, clustered system to act as one and prevent stale reads is not always synchronous replication, which can impose undue load and risk; the right approach for many use cases is to have different cluster components on top of async replication, with each component solving a particular problem, as a fully-integrated cluster stack.

For example, an intelligent MySQL proxy, the Tungsten Connector, and orchestrator, Tungsten Manager, together can check if a node is up-to-date and route reads and writes to the most appropriate node based on that (and other criteria). In addition, because of the intelligence of the Tungsten Manager, you can lose two nodes in a 3-node cluster and operations will continue.

Conclusion

If you’re using Galera Cluster currently and facing issues with database availability, locking, performance, or simply tired of the complexity involved in managing clusters across sites or regions - the process for migration from Galera to Tungsten is simple and can be done with zero downtime.

Our customers come to us for our Software, but they stay (current average over 10 years) because of our pride in Support. Part of this ‘Customer Success’ may be due to the average of less than 3 minutes response time - but also effectiveness, as our front-line Support is composed of highly experienced and attentive engineers.

To learn more about replication types, please visit this blog: Comparing Replication Technologies for MySQL Clustering.

To learn more about Geo-MySQL Galera Clusters, visit this blog: Geo-Distributed Galera Clusters.

For a complete comparison of Tungsten Cluster and Galera Cluster (and its MariaDB and Percona derivatives), visit the High Noon page.

With our customers, many of whom run applications with very high spikes during peak hours, asynchronous-based Tungsten Cluster has been tested and proven to “just work.” Reach out to learn more about how Galera Cluster compares with Tungsten Cluster!

Published In

Categories:

24/7/365 Support, Advanced Replication, Performance

Series:

Competitor Comparisons

Tags:

MySQL, MariaDB, Galera, Synchronous Replication, Asynchronous Replication

Author

Sara Captain

Director of Product Marketing

Sara has worn various hats at Continuent since 2014. Listening to Continuent customers over the years, Sara fell in love with the Continuent Tungsten suite of products. She started learning Linux and MySQL administration with the support of Continuent's amazing team, so she can help with keeping Customers happy. Prior to Continuent she worked in consulting with a focus on leveraging data.

View All Sara’s Posts

Challenges with Galera Cluster, MariaDB Cluster, Percona XtraDB Cluster