The Question
Recently, a customer asked us:
What would cause a node switch to fail in a Tungsten Cluster?
For example, we saw the following during a recent session where a
failed:cctrl> switch to db3 SELECTED SLAVE: db3@alpha SET POLICY: MAINTENANCE => MAINTENANCE PURGE REMAINING ACTIVE SESSIONS ON CURRENT MASTER 'db1@alpha' PURGED A TOTAL OF 0 ACTIVE SESSIONS ON MASTER 'db1@alpha' FLUSH TRANSACTIONS ON CURRENT MASTER 'db1@alpha' Exception encountered during SWITCH. Failed while setting the replicator 'db1' role to 'slave' ClusterManagerException: Exception while executing command 'replicatorStatus' on manager 'db1' Exception=Failed to execute '/alpha/db1/manager/ClusterManagementHelper/replicatorStatus alpha db3' Reason= CLUSTER_MEMBER(true) STATUS(FAIL) +----------------------------------------------------------------------------+ |alpha | +----------------------------------------------------------------------------+ |Handler Exception: SYSTEM | |Cause:Exception | | MANAGER | |CLUSTER_MEMBER(true) | |STATUS(FAIL) | |Exception: ConnectionException | |Message: getResponseQueue():No response queue found for id: 1552059204364 |
The Answer
The Tungsten Manager is unable to communicate with a remote resource or has insufficient memory
Here are some possibilities to consider:
- Network blockage - if the Manager is unable to communicate with the target layer (i.e. Replicator or another manager), then the above error will occur
- Manager tuning - if a Manager restart on all nodes clears the issue, then this indicates that the Manager is starved for resources
The Solution
So what may be done to alleviate the problem?
- Manager tuning - earlier versions of Tungsten Clustering did not allocate sufficient resources to the Java JVM, so make the following three configuration changes via
tpm update
- Network blockage - Make sure the replicators are all online and caught up, and check the manager's view of the cluster using the following commands on every node:
cctrl> ls cctrl> members cctrl> ping cctrl> ls resources cctrl> cluster validate cctrl> show alarms
Here are examples:
tungsten@db1:/home/tungsten # cctrl [LOGICAL] /east > members east/db1(ONLINE)/ east/db2(ONLINE)/ east/db3(ONLINE)/ [LOGICAL] /east > ping NETWORK CONNECTIVITY: PING TIMEOUT=2 NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY HOST db1/ ALIVE (ping) result: true, duration: 0.01s, notes: ping -c 1 -w 2 NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2' HOST db2/ ALIVE (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db3' HOST db3/ ALIVE (ping) result: true, duration: 0.01s, notes: ping -c 1 -w 2 [LOGICAL] /east > cluster validate ======================================================================== CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS QUORUM SET MEMBERS ARE: db1, db3, db2 SIMPLE MAJORITY SIZE: 2 GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3 VALIDATED DB MEMBERS ARE: db1, db3, db2 REACHABLE DB MEMBERS ARE: db1, db3, db2 ======================================================================== MEMBERSHIP IS VALID BASED ON VIEW/VALIDATED CONSOLIDATED MEMBERS CONSISTENCY CONCLUSION: I AM IN A PRIMARY PARTITION OF 3 DB MEMBERS OUT OF THE REQUIRED MAJORITY OF 2 VALIDATION STATUS=VALID CLUSTER ACTION=NONE [LOGICAL] /east > ls resources +----------------------------------------------------------------------------+ |RESOURCES | +----------------------------------------------------------------------------+ | db1:DATASERVER: ONLINE | | db1:MANAGER: ONLINE | | db1:MEMBER: ONLINE | | db1:REPLICATOR: ONLINE | | db2:DATASERVER: ONLINE | | db2:MANAGER: ONLINE | | db2:MEMBER: ONLINE | | db2:REPLICATOR: ONLINE | | db3:DATASERVER: ONLINE | | db3:MEMBER: ONLINE | | db3:REPLICATOR: ONLINE | | west:CLUSTER: ONLINE | +----------------------------------------------------------------------------+ [LOGICAL] /east > show alarms +----------------------------------------------------------------------------+ |ALARMS | +----------------------------------------------------------------------------+ | | +----------------------------------------------------------------------------+
The Wrap-Up
In this blog post we discussed what would cause a node switch to fail in a Tungsten Cluster and what may be done about it.
To learn about Continuent solutions in general, check out
The Library
Please read the docs!
For more information about Tungsten clusters, please visit
Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!
For more information, please visit
Want to learn more or run a POC? Contact us
Add new comment