Continuent Blog: Optimizing Tungsten Node Availability Checks: Balancing Recovery and Aggressiveness for Cluster Stability

Blog

Introduction

Tungsten Clustering provides database failover capabilities.

For this to work, the Tungsten Manager layer is responsible for health-checking the cluster nodes and reacting to issues based on a defined set of rules.

These rules control what to test for, how often to check, and what to do in the event that a situation triggers a response.

In this blog post, we will explore this process in depth, provide some examples and explain some of the ways to tune the rule parameters.

This post was created in response to a customer case requesting more information about how failover works.

Heartbeat Gaps

Question 1: We are aware that the heartbeat gap could potentially be due to a network connectivity issue. However, we would like to understand the exact heartbeat gap detection logic on the Tungsten side. Specifically:

What is a heartbeat gap?
What checks are performed between Tungsten nodes internally before concluding a network connectivity issue?
How many retries or attempts are made before declaring a heartbeat gap?

Answer 1:

A heartbeat gap is created when the Manager detects a potential missing cluster member
Managers will send a PING (default) every 3 seconds (or a TCP Echo, if configured) to the other managers.
If the PING fails, then at every 10 seconds a MemberHeartbeatGapAlarm will be raised / incremented. At this time MemberShipValidity will be checked and a MembershipInvalidAlarm will be raised / incremented if necessary.
After the 3rd MembershipInvalidAlarm (3x10 = 30 seconds), if the Manager finds itself still in a non-primary partition (for 3 members the majority is 2) then it will restart itself.

From the logs we can see on db1 that db1 restarted:

2024/11/21 23:11:58 | ========================================================================
2024/11/21 23:11:58 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2024/11/21 23:11:58 | POTENTIAL QUORUM SET MEMBERS ARE: db3, db2, db1
2024/11/21 23:11:58 | SIMPLE MAJORITY SIZE: 2
2024/11/21 23:11:58 | GC VIEW OF CURRENT DB MEMBERS IS: db1, db2
2024/11/21 23:11:58 | VALIDATED DB MEMBERS ARE: db1
2024/11/21 23:11:58 | REACHABLE DB MEMBERS ARE: db2, db1
2024/11/21 23:11:58 | ========================================================================
2024/11/21 23:11:58 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2024/11/21 23:11:58 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2024/11/21 23:11:58 | WARN | db1 |  WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE...

On db2 we can see that also restarted itself because wasn't able to reach any other members online:

2024/11/21 23:11:57 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2024/11/21 23:11:57 | POTENTIAL QUORUM SET MEMBERS ARE: db3, db2, db1
2024/11/21 23:11:57 | SIMPLE MAJORITY SIZE: 2
2024/11/21 23:11:57 | GC VIEW OF CURRENT DB MEMBERS IS: db2
2024/11/21 23:11:57 | VALIDATED DB MEMBERS ARE: db2
2024/11/21 23:11:57 | REACHABLE DB MEMBERS ARE: db2
2024/11/21 23:11:57 | ========================================================================
2024/11/21 23:11:57 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2024/11/21 23:11:57 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2024/11/21 23:11:57 | INFO | db2 |  INFO [EnterpriseRulesConsequenceDispatcher] - SHUTTING DOWN ROUTER GATEWAY DUE TO LOST QUORUM

On db3 we can see that db3 wasn't able to reach the other nodes and restarted itself:

2024/11/21 23:11:59 | ========================================================================
2024/11/21 23:11:59 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2024/11/21 23:11:59 | POTENTIAL QUORUM SET MEMBERS ARE: db3, db2, db1
2024/11/21 23:11:59 | SIMPLE MAJORITY SIZE: 2
2024/11/21 23:11:59 | GC VIEW OF CURRENT DB MEMBERS IS: db3
2024/11/21 23:11:59 | VALIDATED DB MEMBERS ARE: db3
2024/11/21 23:11:59 | REACHABLE DB MEMBERS ARE: db3
2024/11/21 23:11:59 | ========================================================================
2024/11/21 23:11:59 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2024/11/21 23:11:59 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2024/11/21 23:11:59 | WARN | db3 |  WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE...

Behavior of cctrl During a Network Partition

Question 2: We observed that the cctrl output from one node displayed a fail-safe shun state, while other DB nodes in the same cluster showed Tungsten members as online. However, the issue did not auto-recover and was resolved only after performing a stopall and startall on the affected node, despite there being no network issues, as the manual restart resolved the problem.

Answer 2: If there is a network partition and db1-db2 are in one partition, and db3 is in a different partition then if you log into cctrl to db1 you will see db1-db2 online and db3 shunned. If you log to db3 you will see it online and db1-db2 shunned. After a time db3 will restart itself and will come back as shunned if it is still in minority.

Availability Checks

Question 3: By default, Tungsten uses ICMP for availability checks with a 30-second interval, which seems too aggressive for ICMP. Could you suggest how we can make this less aggressive while maintaining auto-recovery functionality?

Answer 3: In the Tungsten Manager you can set the availability check method to "ping" or "echo". The "ping" method will use the OS-supplied ping utility that sends ICMP packets. The "echo" uses the xinetd echo server that is TCP-based. The default method is "ping".

On all Managers you can configure the availability check method in the tungsten.ini file to use echo instead of ping like this:

[defaults]
...
mgr-ping-method=echo

However before issuing tpm update this needs a bit more setup from the user:

After installing the xinetd package the echo server can be enabled in the /etc/xinetd.d/echo-stream file modifying the disable = yes line to disable = no:

service echo
{
# This is for quick on or off of the service
disable         = no
...

After this the restarting of the xinetd service is needed with sudo systemctl restart xinetd on ALL hosts. A tpm update is needed also on all hosts. If this gives an error like:

ERROR >> db1 >> Unable to contact db1 via TCP Echo (ManagerPingMethodCheck)

This error means that the echo servers are not up and running, are still disabled or the firewall prevents them from reaching it. You can check on the command line with:

echo hello | nc db2 7
Ncat: Connection refused.
echo hello | nc db2 7
hello

The first case shows an error and the second one shows that the echo server is up and running.

To verify that everything is working in cctrl issue cluster validate command.

[LOGICAL] /west > cluster validate
========================================================================
CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
POTENTIAL QUORUM SET MEMBERS ARE: db1, db3, db2
SIMPLE MAJORITY SIZE: 2
GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3
VALIDATED DB MEMBERS ARE: db1, db3, db2
REACHABLE DB MEMBERS ARE: db1, db3, db2
========================================================================
CONCLUSION: I AM IN A PRIMARY PARTITION OF 3 DB MEMBERS OUT OF THE REQUIRED MAJORITY OF 2
VALIDATION STATUS=VALID CLUSTER
ACTION=NONE

The REACHABLE DB MEMBERS shows that the ping was successful to all members.

Here are the relevant timeouts and threshold values from the manager.properties file with explanations:

Timeout used in group communication to ping a manager (msec):
```
policy.liveness.manager.ping.timeout=2000
```
Timeout used in host ping method (sec):
```
policy.liveness.hostPingTimeout=2
```
After how many failed ping attempts will a datasource fail:
```
policy.liveness.hostPing.fail.threshold=3
```

These are the default values. While pinging a manager through group communication we wait 2 seconds (2000 ms) for the response. If there is no answer in 2 seconds the manager is considered down. If the manager is down the host is checked for availability (either using echo or the OS ping method). If the check does not return in 2 seconds then the host is considered down. Every 10 seconds for the fail.threshold times (default is 3) we re-try the ping. If it is still failing after 3x10=30 seconds then the datasource will go to the FAILED state. If it is the Primary a FAILOVER will happen.

Here is an example from the logs of what happens (using the default values) after we stop the Manager on db2 and also stop the xinetd service on that host, so from the manager point of view this host will be unreachable using ping.

09:43:55.548 Member Heartbeat Gap detected
09:43:58.553 Checking for quorum
09:43:58.553 Checking for unreachable host 1
09:44:08.770 Checking for unreachable host 2
09:44:19.486 Checking for unreachable host 3
09:44:28.500 Datasource DB2 set to failed

Approx. 30 seconds later DB2 goes into the FAILED state.

Now let's change the values in the tungsten.ini to higher values:

property=policy.liveness.hostPing.fail.threshold=5
property=policy.liveness.manager.ping.timeout=5000
property=policy.liveness.hostPingTimeout=5

With these values set, the following events are visible in the logs:

09:57:38.008 Member Heartbeat Gap detected
09:57:40.113 Checking for quorum
09:57:40.214 Checking for unreachable host 1
09:57:51.533 Checking for unreachable host 2
09:58:00.552 Checking for unreachable host 3
09:58:10.370 Checking for unreachable host 4
09:58:20.889 Checking for unreachable host 5
09:58:31.111  Datasource DB2 set to failed

Approximately 50 seconds (5x10) later DB2 goes to a FAILED state. Note that here we also increased the ping timeouts both for group communication and for host ping from 2 seconds to 5 seconds.

Note

The timeout values should be less than 10 seconds because we ping every 10 seconds for the threshold times.

+---------------------------------------------------------------------------------+
|db2(slave:FAILED(NODE 'db2' IS UNREACHABLE), progress=13, latency=0.185)         |
|STATUS [CRITICAL] [2024/12/04 09:58:31 AM CET][SSL]                              |
|REASON[NODE 'db2' IS UNREACHABLE]                                                |
+---------------------------------------------------------------------------------+
|  MANAGER(state=STOPPED)                                                         |
|  REPLICATOR(role=slave, master=db1, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+

But after starting the xinetd server on host db2 and starting the manager the cluster will fix itself if it is in AUTOMATIC mode.

+---------------------------------------------------------------------------------+
|db2(slave:ONLINE, progress=13, latency=0.185)                                    |
|STATUS [OK] [2024/12/04 10:59:43 AM CET][SSL]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db1, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+

Summary:

We showed how the default ping method can be changed from ICMP to TCP one.
We showed how setting different timeouts and threshold values will postpone the fail / failover from happening.

Important Note

Setting the timeouts or thresholds to high values will affect the failover process, and will also affect the decision made in a split-brain cause. We suggest experimenting with these values on a test cluster before applying to production.

Wrap-Up

In this blog post, we explored the Tungsten Manager failover process in depth, provided some examples and explained some of the ways to tune the rule parameters.

Smooth sailing!

Published In

Categories:

Cluster Management, Database Administration, Performance

Series:

Tungsten University

Tags:

tungsten clustering, database failover, High Availability, cluster health monitoring, failover tuning

Authors

Eric M. Stone

COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 40 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

View All Eric M.’s Posts

Csaba Simon

CTO

Csaba has over 25 years experience in software engineering, working for Ericsson, Intermarketing Oy, Tekla, and VMware amongst others. Passionate about technology, Csaba has an agile, teamwork-driven style of development. With many years working as a software engineer, software architect and team manager, his skills include Unix, Databases, C, C++, Java, Python, Perl, Lisp, Go, Software Architecture and Agile development.

View All Csaba’s Posts

Optimizing Tungsten Node Availability Checks: Balancing Recovery and Aggressiveness for Cluster Stability