The Question
Recently, a customer asked us:
What would cause a node switch to fail in a Tungsten Cluster?
For example, we saw the following during a recent session where a
switch
failed:cctrl> switch to db3 SELECTED SLAVE: db3@alpha SET POLICY: MAINTENANCE => MAINTENANCE PURGE REMAINING ACTIVE SESSIONS ON CURRENT MASTER 'db1@alpha' PURGED A TOTAL OF 0 ACTIVE SESSIONS ON MASTER 'db1@alpha' FLUSH TRANSACTIONS ON CURRENT MASTER 'db1@alpha' Exception encountered during SWITCH. Failed while setting the replicator 'db1' role to 'slave' ClusterManagerException: Exception while executing command 'replicatorStatus' on manager 'db1' Exception=Failed to execute '/alpha/db1/manager/ClusterManagementHelper/replicatorStatus alpha db3' Reason= CLUSTER_MEMBER(true) STATUS(FAIL) +----------------------------------------------------------------------------+ |alpha | +----------------------------------------------------------------------------+ |Handler Exception: SYSTEM | |Cause:Exception | |Message:javax.management.MBeanException: MANAGER | |CLUSTER_MEMBER(true) | |STATUS(FAIL) | |Exception: ConnectionException | |Message: getResponseQueue():No response queue found for id: 1552059204364 |
The Answer
The Tungsten Manager is unable to communicate with a remote resource or has insufficient memory
Here are some possibilities to consider:
- Network blockage - if the Manager is unable to communicate with the target layer (i.e. Replicator or another manager), then the above error will occur
- Manager tuning - if a Manager restart on all nodes clears the issue, then this indicates that the Manager is starved for resources
The Solution
So what may be done to alleviate the problem?
- Manager tuning - earlier versions of Tungsten Clustering did not allocate sufficient resources to the Java JVM, so make the following three configuration changes via
tpm update
:mgr-heap-threshold=200
property=wrapper.java.initmemory=80
mgr-java-mem-size=250
- Network blockage - Make sure the replicators are all online and caught up, and check the manager's view of the cluster using the following commands on every node:
cctrl> ls cctrl> members cctrl> ping cctrl> ls resources cctrl> cluster validate cctrl> show alarms
Here are examples:
tungsten@db1:/home/tungsten # cctrl [LOGICAL] /east > members east/db1(ONLINE)/10.0.0.126:7800 east/db2(ONLINE)/10.0.0.185:7800 east/db3(ONLINE)/10.0.0.7:7800 [LOGICAL] /east > ping NETWORK CONNECTIVITY: PING TIMEOUT=2 NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY HOST db1/10.0.0.126: ALIVE (ping) result: true, duration: 0.01s, notes: ping -c 1 -w 2 10.0.0.126 NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2' HOST db2/10.0.0.185: ALIVE (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 10.0.0.185 NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db3' HOST db3/10.0.0.7: ALIVE (ping) result: true, duration: 0.01s, notes: ping -c 1 -w 2 10.0.0.7 [LOGICAL] /east > cluster validate ======================================================================== CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS QUORUM SET MEMBERS ARE: db1, db3, db2 SIMPLE MAJORITY SIZE: 2 GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3 VALIDATED DB MEMBERS ARE: db1, db3, db2 REACHABLE DB MEMBERS ARE: db1, db3, db2 ======================================================================== MEMBERSHIP IS VALID BASED ON VIEW/VALIDATED CONSOLIDATED MEMBERS CONSISTENCY CONCLUSION: I AM IN A PRIMARY PARTITION OF 3 DB MEMBERS OUT OF THE REQUIRED MAJORITY OF 2 VALIDATION STATUS=VALID CLUSTER ACTION=NONE [LOGICAL] /east > ls resources +----------------------------------------------------------------------------+ |RESOURCES | +----------------------------------------------------------------------------+ | db1:DATASERVER: ONLINE | | db1:MANAGER: ONLINE | | db1:MEMBER: ONLINE | | db1:REPLICATOR: ONLINE | | db2:DATASERVER: ONLINE | | db2:MANAGER: ONLINE | | db2:MEMBER: ONLINE | | db2:REPLICATOR: ONLINE | | db3:DATASERVER: ONLINE | | db3:MEMBER: ONLINE | | db3:REPLICATOR: ONLINE | | west:CLUSTER: ONLINE | +----------------------------------------------------------------------------+ [LOGICAL] /east > show alarms +----------------------------------------------------------------------------+ |ALARMS | +----------------------------------------------------------------------------+ | | +----------------------------------------------------------------------------+
Summary
The Wrap-Up
In this blog post we discussed what would cause a node switch to fail in a Tungsten Cluster and what may be done about it.
To learn about Continuent solutions in general, check out https://www.continuent.com/solutions
The Library
Please read the docs!
For more information about Tungsten clusters, please visit https://docs.continuent.com
Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!
For more information, please visit https://www.continuent.com/solutions
Want to learn more or run a POC? Contact us
Comments
Add new comment