Introduction
At Continuent, we provide enterprise-level MySQL database high availability and continuous operations solutions. At the heart of our offering is Tungsten Cluster, a solution designed first and foremost to protect your data.
A Tungsten Cluster requires an odd number of database nodes to establish a voting quorum using group communications—typically, this means a three-node cluster. This quorum mechanism ensures that, in the event of a primary database failure, the system can automatically and correctly failover to a healthy node.
Because Tungsten Cluster’s quorum relies on jGroups protocols for communication, a highly reliable network is essential for proper functionality. While the cluster is designed to handle intermittent network failures, there may be situations where the Tungsten Manager executes a fail-safe operation to protect data integrity. In such cases, database nodes may isolate themselves from client traffic to prevent a split-brain scenario.
Although the Tungsten Cluster is built to withstand many hardware and software failures, the underlying network reliability ultimately defines the limits of its fault-tolerance capabilities.
Recently, a customer asked us about Fail-safe Shun, and we thought it would be helpful to share our insights with you. This post will cover the basic steps the Tungsten Manager takes when a network failure isolates it from the other nodes—helping you better understand how Fail-safe Shun works and why it matters.
Fail-Safe Shun - The Inside Scoop
In the case of a network failure, on what basis does Tungsten initiate a fail-safe shun? Is it triggered by ICMP unreachability or JGroups unreachability?
Failover only happens if the manager is in a non-primary jGroups partition:
2025/02/02 23:12:36 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2025/02/02 23:12:36 | POTENTIAL QUORUM SET MEMBERS ARE: db1, db3, db2
2025/02/02 23:12:36 | SIMPLE MAJORITY SIZE: 2
2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1
2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1
2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1
2025/02/02 23:12:36 |
========================================================================
2025/02/02 23:12:36 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2025/02/02 23:12:36 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2025/02/02 23:12:36 | WARN | db1 | WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE…
A partition (a group) is formed by JGroups. When a manager first starts, it looks for a group to join. If a group is not available, then the Manager will create a group. The other Managers that start after the first manager will join the group. If a Manager is shut down, then it will leave the group. The logs show that the db1 manager was alone in the group and restarted in failsafe-shun mode.
But how did we arrive here? Every 3 seconds a heartbeat is sent to the other nodes. If there is no reply for 10 seconds, then a heartbeat gap detection rule will fire:
2025/02/02 23:12:14 | INFO | db1 | INFO [EnterpriseRulesConsequenceDispatcher] - HEARTBEAT GAP DETECTED FOR MEMBER 'db2'
2025/02/02 23:12:16 | INFO | db1 | INFO [EnterpriseRulesConsequenceDispatcher] - HEARTBEAT GAP DETECTED FOR MEMBER 'db3'
2025/02/02 23:12:16 | NETWORK CONNECTIVITY: PING TIMEOUT=5
2025/02/02 23:12:16 | NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY
2025/02/02 23:12:16 | HOST db1/10.88.75.104: ALIVE
2025/02/02 23:12:16 | (ping) result: true, duration: 0.01s, notes: ping -c 1 -w 5 10.88.75.104
2025/02/02 23:12:16 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2'
2025/02/02 23:12:21 | HOST db2/10.88.75.105: NOT REACHABLE
2025/02/02 23:12:21 | (ping) result: false, duration: 5.01s, notes: ping -c 1 -w 5 10.88.75.105
2025/02/02 23:12:21 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db3'
2025/02/02 23:12:26 | HOST db3/10.88.75.106: NOT REACHABLE
2025/02/02 23:12:26 | (ping) result: false, duration: 5.00s, notes: ping -c 1 -w 5 10.88.75.106
In this case, as seen below, db1 viewed db2 and db3 as down or not reachable because the network was down. This situation will then trigger a MembershipInvalidAlarm. The alarm will check if the other managers are reachable through Group communication or not. If they are not reachable via jGroups, the manager will increment the dispatch number and after 10 seconds will check again. When the dispatch number equals the max dispatch AND the node is still seen as alone in the cluster, the manager will restart in fail-safe shun mode.
2025/02/02 23:12:36 | INFO | db1 | INFO [Rule_0550$u58$_INVESTIGATE$u58$_TIME_KEEPER_FOR_INVALID_MEMBERSHIP_ALARM71605603] - TIMER EXPIRED, INCREMENTED ALARM: MembershipInvalidAlarm: FAULT: MEMBER db3@pod(UNKNOWN), MAX DISPATCH=3, DISPATCH=3, EXPIRED=true
2025/02/02 23:12:36 | INFO | db1 | INFO [Rule_0530$u58$_INVESTIGATE_MEMBERSHIP_VALIDITY1008640238] - CONSEQUENCE
2025/02/02 23:12:36 | NETWORK CONNECTIVITY: PING TIMEOUT=5
2025/02/02 23:12:36 | NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY
2025/02/02 23:12:36 | HOST db1/10.88.75.104: ALIVE
2025/02/02 23:12:36 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 10.88.75.104
2025/02/02 23:12:36 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2'
2025/02/02 23:12:36 | HOST db2/10.88.75.105: ALIVE
2025/02/02 23:12:36 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 5 10.88.75.105
2025/02/02 23:12:36 |
========================================================================
2025/02/02 23:12:36 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2025/02/02 23:12:36 | POTENTIAL QUORUM SET MEMBERS ARE: db1, db3, db2
2025/02/02 23:12:36 | SIMPLE MAJORITY SIZE: 2
2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1
2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1
2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1
2025/02/02 23:12:36 |
========================================================================
2025/02/02 23:12:36 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2025/02/02 23:12:36 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2025/02/02 23:12:36 | WARN | db1 | WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE…
The Tungsten Manager initiates a fail-safe shun when a MemberHearbeatGap invokes a MembershipInvalidAlarm. Only a MembershipInvalidAlarm can cause a fail-safe shun, and only if the host is alone in the group.
Please note that the values above are MAX DISPATCH=3 and DISPATCH=3. The max dispatch value is controlled by the policy.invalid.membership.retry.threshold property.
In this case the network was online for 20 seconds between db1 and db2 (but not for db3) as we can see from the logs (on at 23:12:33):
2025/02/02 23:12:33 | NETWORK CONNECTIVITY: PING TIMEOUT=5
2025/02/02 23:12:33 | NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY
2025/02/02 23:12:33 | HOST db1/10.88.75.104: ALIVE
2025/02/02 23:12:33 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 10.88.75.104
2025/02/02 23:12:33 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2'
2025/02/02 23:12:33 | HOST db2/10.88.75.105: ALIVE
2025/02/02 23:12:33 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 5 10.88.75.105
At this point it was too late for the group communications to re-establish the quorum and form a group, so the Manager restarted at 23:12:36.
How To Tune The Delay Before Failsafe Shun
To increase the delay before a Failsafe Shun is invoked, set the policy.invalid.membership.retry.threshold
property in the INI and run tpm update
. The Managers would then be able to survive a network blip that lasts a maximum of 50 seconds:
property=policy.invalid.membership.retry.threshold=6
The formula works out to: 6 x 10 = 60 – 10 (safety period for jGroups to re-form the group) = 50-second delay before fail-safe shun occurs.
Questions & Answers
Q1. Does an ICMP packet drop relate to an invalid membership issue in Tungsten?
A1. No. Only a MemberHeartbeatGap (not getting a heartbeat reply for 10 seconds) will trigger a MembershipInvalidAlarm.
Q2. Does an ICMP packet drop cause a fail-safe shun to initiate?
A2. No. A Fail-safe shun will only happen if the Manager is alone in the JGroups partition.
Q3. In the case of a network failure, on what basis does Tungsten initiate the initial fail-safe shun? Is it triggered by ICMP unreachability or JGroups unreachability?
A3. Tungsten initiates a fail-safe shun when a MemberHearbeatGap invokes a MembershipInvalidAlarm. Only a MembershipInvalidAlarm can cause a fail-safe shun, and only if the host is alone in the group.
Q4. In some cases, we did not observe a ping failure in the Tungsten logs, yet Tungsten still initiated a fail-safe shun. If this is due to JGroups, how can we trace it through the tungsten logs beyond generic messages like "Quorum lost" and similar entries?
A4. Failsafe-Shun can happen even if there wasn't a ping failure (e.g. the host is up and running) but the other Managers are down (stopped, restarted).
Q5. What are your recommendations for moving from ICMP to a TCP-based network health check in Tungsten? What are the pros and cons of this transition?
A5. Not recommended - ICMP is the best practice as seen over years of experience in the field.
When ICMP is failing, TCP will fail too. If a network is unstable, using TCP to somehow mask the instability is not the correct approach to checking network health for a database cluster.
Our implementation of JGroups uses TCP to communicate with the group members and the ping utility uses ICMP.
Q6. Packet drops and overrun errors do not include timestamps, and other services remained unaffected while only Tungsten was impacted. How can we analyse Tungsten logs to determine whether the issue was caused by a network problem or network blip , especially given the lack of detailed JGroups logging?
A6. The correct way to analyze the Manager logs is to follow the provided example above. Start by locating the triggered events and then follow the included explanations.
Grep for "HEARTBEAT GAP DETECTED FOR MEMBER" and look for which events trigger other events.
Also, JGroups logging is disabled by default. The best practice is to keep JGroups logging disabled because it will rapidly fill the logs and is very difficult to interpret.
Q7. Which protocol does the Heartbeat mechanism use to send signals in the Tungsten system (TCP, ICMP, or UDP)?
A7. Heartbeat events are sent via TCP.
Q8. Which protocol does JGroups use by default (TCP or UDP)?
A8. Our jGroups implementation uses TCP to communicate.
Q9. What is the difference between the following entries in Tungsten logs?
GC VIEW OF CURRENT DB MEMBERS, VALIDATED DB MEMBERS, REACHABLE DB MEMBERS
A9.
GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3 - this is how the group communication (JGroups GC) sees members
VALIDATED DB MEMBERS ARE: db1, db3 - managers that are up and running and respond to the pingMember command (TCP)
REACHABLE DB MEMBERS ARE: db1, db3, db2 - host reachable by ping (host ping utility ICMP)
Q10. Does the coordinator decide to initiate a fail-safe shun across the cluster, or does each host’s Manager independently invoke the fail-safe script?
A10. Each Manager independently invokes the fail-safe script if they are alone in the quorum.
Q11. We observed that DB1 initiated a fail-safe shun at 23:12:36, DB2 at 23:12:33, and DB3 at 23:12:35. Based on the current settings, can you analyze the logs to determine the exact duration of the network blip before the fail-safe shun was triggered across all three database nodes?
A11. This can be seen only from your network monitoring logs. In the Manager logs we can see only the response to the “ping -c 1 -w 5 hostname” result during the membership validity check when we increment the timer, but during this 10-second period the network can be flapping (up-down-up-down), which is NOT visible from the Manager logs. What we can see during the membership validity check is that the network at that exact time was either up or down.
Q12. If we change the property policy.invalid.membership.retry.threshold=6
, what would be the potential impact on failover behavior or data loss?
A12. Failover can happen:
- If the primary MySQL server goes down. This has nothing to do with our case.
- If the host with the primary MySQL server goes down. This has nothing to do with our case.
- If the host with the primary MySQL server gets isolated AND the other two hosts can still form a majority of the quorum. This has nothing to do with our case.
- If the host with the primary MySQL server gets isolated AND also the other hosts get isolated from each other. In this case failover does not happen (we have only singular managers isolated from each other) and the cluster will go to failsafe shun. This is our case.
This means that setting the threshold to 6 will only result in keeping the primary online to resist the network blip. Data loss after a failover can happen ONLY if the operator of the cluster issues a “recover” command in force mode without paying any attention to the error displayed. Error is displayed when there are transactions left in the primary binlog that were not replicated to the new Primary. In this case the “recover” command will issue an error and will not proceed. The operator needs to manually apply the transactions to the new Primary. If the operator forces the "recover, data loss will occur.
Q13. Apart from the heartbeat gap detection invalid alarm, where can we find logs in Tungsten that indicate non-validated database members?
A13. The message "Non-validated database members" in the Manager’s tmsvc.log
means that there wasn’t a Manager on that host up and running or because of the network breakage that Manager is not reachable.
Q14. What is the difference between a fail-safe shun invocation and an automatic failover initiation when the first node detects a network issue? If the affected node is the primary, will it trigger an automatic failover, or a fail-safe shun? If the affected node is a secondary, will it only trigger a fail-safe shun? Here are the associated logs:
From Db1:
2025/02/02 23:12:33 | GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3
2025/02/02 23:12:33 | VALIDATED DB MEMBERS ARE: db1
2025/02/02 23:12:33 | REACHABLE DB MEMBERS ARE: db1, db3, db2
2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3
2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1, db3
2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1, db3, db2
2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1
2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1
2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1
2025/02/02 23:12:36 | FATAL | db1 | FATAL [ClusterManagementHelper] - Fail-safe invoked. Forcing a failsafe exit followed by a restart.
From Db2:
2025/02/02 23:12:28 | GC VIEW OF CURRENT DB MEMBERS IS: db2
2025/02/02 23:12:28 | VALIDATED DB MEMBERS ARE: db2
2025/02/02 23:12:28 | REACHABLE DB MEMBERS ARE: db2
2025/02/02 23:12:33 | GC VIEW OF CURRENT DB MEMBERS IS: db2
2025/02/02 23:12:33 | VALIDATED DB MEMBERS ARE: db2
2025/02/02 23:12:33 | REACHABLE DB MEMBERS ARE: db2
2025/02/02 23:12:33 | FATAL | db2 | FATAL [ClusterManagementHelper] - Fail-safe invoked. Forcing a failsafe exit followed by a restart.
From Db3:
2025/02/02 23:12:35 | GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3
2025/02/02 23:12:35 | VALIDATED DB MEMBERS ARE: db1, db3
2025/02/02 23:12:35 | REACHABLE DB MEMBERS ARE: db1, db3, db2
2025/02/02 23:12:35 | GC VIEW OF CURRENT DB MEMBERS IS: db3
2025/02/02 23:12:35 | VALIDATED DB MEMBERS ARE: db3
2025/02/02 23:12:35 | REACHABLE DB MEMBERS ARE: db3
2025/02/02 23:12:35 | FATAL | db3 | FATAL [ClusterManagementHelper] - Fail-safe invoked. Forcing a failsafe exit followed by a restart.
A14. Automatic failover is initiated by the coordinator. For a coordinator to be able to failover, it must be in a partition that has a majority of nodes. If every Manager is in its own single partition, no failover can be expected. Only a failsafe shun can happen at this point. Please check previous answers to Q6 to see when a failover occurs.
Final Thoughts
We hope this blog post has helped you understand the intricacies of the Tungsten fail-safe shun. As you can see, several factors are at play, and it's not always as simple as a single ping failure. By understanding the underlying mechanisms, you can better troubleshoot and manage your Tungsten clusters.
Remember, the key takeaway is that a fail-safe shun is a protective measure. It's Tungsten's way of saying, "I'm isolated, and I need to protect the data." While it can be disruptive, it's far better than the alternative: a split-brain scenario that could lead to data loss.
If you have any further questions, don't hesitate to reach out to our support team. We're always here to help you keep your Tungsten clusters running.
Smooth Sailing!
Comments
Add new comment