Introduction
Tungsten Clustering provides database failover capabilities.
For this to work, the Tungsten Manager layer is responsible for health-checking the cluster nodes and reacting to issues based on a defined set of rules.
These rules control what to test for, how often to check, and what to do in the event that a situation triggers a response.
In this blog post, we will explore this process in depth, provide some examples and explain some of the ways to tune the rule parameters.
This post was created in response to a customer case requesting more information about how failover works.
Heartbeat Gaps
Question 1: We are aware that the heartbeat gap could potentially be due to a network connectivity issue. However, we would like to understand the exact heartbeat gap detection logic on the Tungsten side. Specifically:
- What is a heartbeat gap?
- What checks are performed between Tungsten nodes internally before concluding a network connectivity issue?
- How many retries or attempts are made before declaring a heartbeat gap?
Answer 1:
- A heartbeat gap is created when the Manager detects a potential missing cluster member
- Managers will send a PING (default) every 3 seconds (or a TCP Echo, if configured) to the other managers.
- If the PING fails, then at every 10 seconds a MemberHeartbeatGapAlarm will be raised / incremented. At this time MemberShipValidity will be checked and a MembershipInvalidAlarm will be raised / incremented if necessary.
- After the 3rd MembershipInvalidAlarm (3x10 = 30 seconds), if the Manager finds itself still in a non-primary partition (for 3 members the majority is 2) then it will restart itself.
From the logs we can see on db1 that db1 restarted:
2024/11/21 23:11:58 | ========================================================================
2024/11/21 23:11:58 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2024/11/21 23:11:58 | POTENTIAL QUORUM SET MEMBERS ARE: db3, db2, db1
2024/11/21 23:11:58 | SIMPLE MAJORITY SIZE: 2
2024/11/21 23:11:58 | GC VIEW OF CURRENT DB MEMBERS IS: db1, db2
2024/11/21 23:11:58 | VALIDATED DB MEMBERS ARE: db1
2024/11/21 23:11:58 | REACHABLE DB MEMBERS ARE: db2, db1
2024/11/21 23:11:58 | ========================================================================
2024/11/21 23:11:58 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2024/11/21 23:11:58 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2024/11/21 23:11:58 | WARN | db1 | WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE...
On db2 we can see that also restarted itself because wasn't able to reach any other members online:
2024/11/21 23:11:57 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2024/11/21 23:11:57 | POTENTIAL QUORUM SET MEMBERS ARE: db3, db2, db1
2024/11/21 23:11:57 | SIMPLE MAJORITY SIZE: 2
2024/11/21 23:11:57 | GC VIEW OF CURRENT DB MEMBERS IS: db2
2024/11/21 23:11:57 | VALIDATED DB MEMBERS ARE: db2
2024/11/21 23:11:57 | REACHABLE DB MEMBERS ARE: db2
2024/11/21 23:11:57 | ========================================================================
2024/11/21 23:11:57 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2024/11/21 23:11:57 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2024/11/21 23:11:57 | INFO | db2 | INFO [EnterpriseRulesConsequenceDispatcher] - SHUTTING DOWN ROUTER GATEWAY DUE TO LOST QUORUM
On db3 we can see that db3 wasn't able to reach the other nodes and restarted itself:
2024/11/21 23:11:59 | ========================================================================
2024/11/21 23:11:59 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2024/11/21 23:11:59 | POTENTIAL QUORUM SET MEMBERS ARE: db3, db2, db1
2024/11/21 23:11:59 | SIMPLE MAJORITY SIZE: 2
2024/11/21 23:11:59 | GC VIEW OF CURRENT DB MEMBERS IS: db3
2024/11/21 23:11:59 | VALIDATED DB MEMBERS ARE: db3
2024/11/21 23:11:59 | REACHABLE DB MEMBERS ARE: db3
2024/11/21 23:11:59 | ========================================================================
2024/11/21 23:11:59 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2024/11/21 23:11:59 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2024/11/21 23:11:59 | WARN | db3 | WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE...
Behavior of cctrl During a Network Partition
Question 2: We observed that the cctrl output from one node displayed a fail-safe shun state, while other DB nodes in the same cluster showed Tungsten members as online. However, the issue did not auto-recover and was resolved only after performing a stopall and startall on the affected node, despite there being no network issues, as the manual restart resolved the problem.
Answer 2: If there is a network partition and db1-db2 are in one partition, and db3 is in a different partition then if you log into cctrl to db1 you will see db1-db2 online and db3 shunned. If you log to db3 you will see it online and db1-db2 shunned. After a time db3 will restart itself and will come back as shunned if it is still in minority.
Availability Checks
Question 3: By default, Tungsten uses ICMP for availability checks with a 30-second interval, which seems too aggressive for ICMP. Could you suggest how we can make this less aggressive while maintaining auto-recovery functionality?
Answer 3: In the Tungsten Manager you can set the availability check method to "ping" or "echo". The "ping" method will use the OS-supplied ping utility that sends ICMP packets. The "echo" uses the xinetd echo server that is TCP-based. The default method is "ping".
On all Managers you can configure the availability check method in the tungsten.ini
file to use echo
instead of ping
like this:
[defaults]
...
mgr-ping-method=echo
However before issuing tpm update this needs a bit more setup from the user:
After installing the xinetd
package the echo server can be enabled in the /etc/xinetd.d/echo-stream file modifying the disable = yes line to disable = no:
service echo
{
# This is for quick on or off of the service
disable = no
...
After this the restarting of the xinetd
service is needed with sudo systemctl restart xinetd on ALL hosts. A tpm update is needed also on all hosts. If this gives an error like:
ERROR >> db1 >> Unable to contact db1 via TCP Echo (ManagerPingMethodCheck)
This error means that the echo servers are not up and running, are still disabled or the firewall prevents them from reaching it. You can check on the command line with:
echo hello | nc db2 7
Ncat: Connection refused.
echo hello | nc db2 7
hello
The first case shows an error and the second one shows that the echo server is up and running.
To verify that everything is working in cctrl issue cluster validate command.
[LOGICAL] /west > cluster validate
========================================================================
CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
POTENTIAL QUORUM SET MEMBERS ARE: db1, db3, db2
SIMPLE MAJORITY SIZE: 2
GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3
VALIDATED DB MEMBERS ARE: db1, db3, db2
REACHABLE DB MEMBERS ARE: db1, db3, db2
========================================================================
CONCLUSION: I AM IN A PRIMARY PARTITION OF 3 DB MEMBERS OUT OF THE REQUIRED MAJORITY OF 2
VALIDATION STATUS=VALID CLUSTER
ACTION=NONE
The REACHABLE DB MEMBERS shows that the ping was successful to all members.
Here are the relevant timeouts and threshold values from the manager.properties file with explanations:
-
Timeout used in group communication to ping a manager (msec):
policy.liveness.manager.ping.timeout=2000
-
Timeout used in host ping method (sec):
policy.liveness.hostPingTimeout=2
-
After how many failed ping attempts will a datasource fail:
policy.liveness.hostPing.fail.threshold=3
These are the default values. While pinging a manager through group communication we wait 2 seconds (2000 ms) for the response. If there is no answer in 2 seconds the manager is considered down. If the manager is down the host is checked for availability (either using echo or the OS ping method). If the check does not return in 2 seconds then the host is considered down. Every 10 seconds for the fail.threshold times (default is 3) we re-try the ping. If it is still failing after 3x10=30 seconds then the datasource will go to the FAILED state. If it is the Primary a FAILOVER will happen.
Here is an example from the logs of what happens (using the default values) after we stop the Manager on db2 and also stop the xinetd
service on that host, so from the manager point of view this host will be unreachable using ping.
09:43:55.548 Member Heartbeat Gap detected
09:43:58.553 Checking for quorum
09:43:58.553 Checking for unreachable host 1
09:44:08.770 Checking for unreachable host 2
09:44:19.486 Checking for unreachable host 3
09:44:28.500 Datasource DB2 set to failed
Approx. 30 seconds later DB2 goes into the FAILED state.
Now let's change the values in the tungsten.ini to higher values:
property=policy.liveness.hostPing.fail.threshold=5
property=policy.liveness.manager.ping.timeout=5000
property=policy.liveness.hostPingTimeout=5
With these values set, the following events are visible in the logs:
09:57:38.008 Member Heartbeat Gap detected
09:57:40.113 Checking for quorum
09:57:40.214 Checking for unreachable host 1
09:57:51.533 Checking for unreachable host 2
09:58:00.552 Checking for unreachable host 3
09:58:10.370 Checking for unreachable host 4
09:58:20.889 Checking for unreachable host 5
09:58:31.111 Datasource DB2 set to failed
Approximately 50 seconds (5x10) later DB2 goes to a FAILED state. Note that here we also increased the ping timeouts both for group communication and for host ping from 2 seconds to 5 seconds.
The timeout values should be less than 10 seconds because we ping every 10 seconds for the threshold times.
+---------------------------------------------------------------------------------+
|db2(slave:FAILED(NODE 'db2' IS UNREACHABLE), progress=13, latency=0.185) |
|STATUS [CRITICAL] [2024/12/04 09:58:31 AM CET][SSL] |
|REASON[NODE 'db2' IS UNREACHABLE] |
+---------------------------------------------------------------------------------+
| MANAGER(state=STOPPED) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
But after starting the xinetd
server on host db2 and starting the manager the cluster will fix itself if it is in AUTOMATIC mode.
+---------------------------------------------------------------------------------+
|db2(slave:ONLINE, progress=13, latency=0.185) |
|STATUS [OK] [2024/12/04 10:59:43 AM CET][SSL] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
Summary:
- We showed how the default ping method can be changed from ICMP to TCP one.
- We showed how setting different timeouts and threshold values will postpone the fail / failover from happening.
Wrap-Up
In this blog post, we explored the Tungsten Manager failover process in depth, provided some examples and explained some of the ways to tune the rule parameters.
Smooth sailing!
Comments
Add new comment