Overview
The Skinny
In this blog post we will discuss how to best integrate various Continuent-bundled cluster monitoring solutions with PagerDuty (pagerduty.com), a popular alerting service.
Agenda
What's Here?
- Briefly explore the bundled cluster monitoring tools
- Describe the procedure for establishing alerting via PagerDuty
- Examine some of the multiple monitoring tools included with the Continuent Tungsten Clustering software, and provide examples of how to send an email to PagerDuty from each of the tools.
Exploring the Bundled Cluster Monitoring Tools
A Brief Summary
Continuent provides multiple methods out of the box to monitor the cluster health. The most popular is the suite of Nagios/NRPE scripts (i.e. cluster-home/bin/check_tungsten_*
). We also have Zabbix scripts (i.e. cluster-home/bin/zabbix_tungsten_*
). Additionally, there is a standalone script available, tungsten_monitor
, based upon the shared Ruby-based tpm
libraries. We also include a very old shell script called check_tungsten.sh
, but it is obsolete.
Implementing a Simple PagerDuty Alert
How To Add a PagerDuty Email Endpoint for Alerting
- Create a new user to get the alerts:
Configuration -> Users -> Click on the [+ Add Users] button- Enter the desired email address and invite. Be sure to respond to the invitation before proceeding.
- Create a new escalation policy:
Configuration -> Escalation Policies -> Click on the [+ New Escalation Policy] button- Enter the policy name at the top, i.e. Continuent Alert Escalation Policy
- "Notify the following users or schedules" - click in the box and select the new user created in the first step
- "escalates after" Set to 1 minute, or your desired value
- "If no one acknowledges, repeat this policy X times" - set to 1 time, or your desired value
- Finally, click on the green [Save] button at the bottom
- Create a new service:
Configuration -> Services -> Click on the [+ New Service] button- General Settings: Name - Enter the service name, i.e. Continuent Alert Emails from Monitoring (what you type in this box will automatically populate the
- Integration Settings: Integration Type - Click on the second radio choice "Integrate via email"
- Integration Settings: Integration Name - Email (automatically set for you, no action needed here)
- Integration Settings: Integration Email - Adjust this email address, i.e. alerts, then copy this email address into a notepad for use later
- Incident Settings: Escalation Policy - Select the Escalation Policy you created in the third step, i.e. "Continuent Alert Escalation Policy"
- Incident Settings: Incident Timeouts - Check the box in front of Auto-resolution
- Finally, click on the green [Add Service] button at the bottom
At this point, you should have an email address like "alerts@yourCompany.pagerduty.com" available for testing.
Go ahead and send a test email to that email address to make sure the alerting is working.
If the test works, you have successfully setup a PagerDuty email endpoint to use for alerting, congratulations!
How to Send Alerts to PagerDuty using the tungsten_monitor Script
Invoking the Bundled Script via cron
The tungsten_monitor
script provides a mechanism for monitoring the cluster state when monitoring tools like Nagios aren't available.
Each time the tungsten_monitor
runs, it will execute a standard set of checks:
- Check that all Tungsten services for this host are running
- Check that all replication services and datasources are ONLINE
- Check that replication latency does not exceed a specified amount
- Check that the local connector is responsive
- Check disk usage
Additional checks may be enabled using various command line options.
The tungsten_monitor is able to send you an email when problems are found.
It is suggested that you run the script as root so it is able to use the mail program without warnings.
Alerts are cached to prevent them from being sent multiple times and flooding your inbox. You may pass --reset
to clear out the cache or --lock-timeout
to adjust the amount of time this cache is kept. The default is 3 hours.
An example root crontab entry to run tungsten_monitor
every five minutes:
*/5 * * * * /opt/continuent/tungsten/cluster-home/bin/tungsten_monitor --from=you@yourCompany.com --to=alerts@yourCompany.pagerduty.com >/dev/null 2>/dev/null
An alternate example root crontab entry to run tungsten_monitor every five minutes in case your version of cron does not support the new syntax:
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /opt/continuent/tungsten/cluster-home/bin/tungsten_monitor --from=you@yourCompany.com --to=alerts@yourCompany.pagerduty.com >/dev/null 2>/dev/null
All messages will be sent to /opt/continuent/share/tungsten_monitor/lastrun.log
The online documentation is here:
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-tungsten_monitor.html
Big Brother is Watching You!
The Power of Nagios and the check_tungsten_* scripts
We have two very descriptive blog posts about how to implement the Nagios-based cluster monitoring solution:
https://www.continuent.com/global-multimaster-cluster-monitoring-using-nagios/
https://www.continuent.com/essential-cluster-monitoring-using-nagios-and-nrpe/
We also have Nagios-specific documentation to assist with configuration:
http://docs.continuent.com/tungsten-clustering-6.0/ecosystem-nagios.html
In the event you are unable to get Nagios working with Tungsten Clustering, please open a support case via our ZenDesk-based support portal https://continuent.zendesk.com/
For more information about getting support, visit https://docs.continuent.com/support-process/troubleshooting-support.html
There are many available NRPE-based check scripts, and the online documentation for each is listed below:
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-tungsten_health_check.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_services.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_progress.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_policy.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_online.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_latency.html
Big Brother Tells You
Tell the Nagios server how to contact PagerDuty
The key is to have a contact defined for PagerDuty-specific email address, which is handled by the Nagios configuration file /opt/local/etc/nagios/objects/contacts.cfg
:
objects/contacts.cfg
define contact{
use generic-contact
contact_name pagerduty
alias PagerDuty Alerting Service Endpoint
email alerts@yourCompany.pagerduty.com
}
define contactgroup{
contactgroup_name admin
alias PagerDuty Alerts
members pagerduty,anotherContactIfDesired,etc
}
Teach the Targets
Tell NRPE on the Database Nodes What To Do
The NRPE commands are defined in the /etc/nagios/nrpe.cfg
file on each monitored database node:
/etc/nagios/nrpe.cfg
command[check_tungsten_online]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_online
command[check_tungsten_latency]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_latency -w 2.5 -c 4.0
command[check_tungsten_progress_alpha]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress -t 5 -s alpha
command[check_tungsten_progress_beta]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress -t 5 -s beta
command[check_tungsten_progress_gamma]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress -t 5 -s gamma
Note that sudo is in use to give the nrpe
user access as the tungsten
user to the tungsten-owned check scripts using the sudo wildcard configuration.
Additionally, there is no harm in defining commands that may not be called, which allows for simple administration - keep the master copy in one place and then just push updates to all nodes as needed then restart nrpe.
Big Brother Sees You
Tell the Nagios server to begin watching
Here are the service check definitions for the /opt/local/etc/nagios/objects/services.cfg
file:
objects/services.cfg
# Service definition
define service{
service_description check_tungsten_online for all cluster nodes
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
check_command check_nrpe!check_tungsten_online
contact_groups admin
use generic-service
}
# Service definition
define service{
service_description check_tungsten_latency for all cluster nodes
host_name db1,db2,db3,db4,db5,db7,db8,db9
check_command check_nrpe!check_tungsten_latency
contact_groups admin
use generic-service
}
# Service definition
define service{
service_description check_tungsten_progress for alpha
host_name db1,db2,db3
check_command check_nrpe!check_tungsten_progress_alpha
contact_groups admin
use generic-service
}
# Service definition
define service{
service_description check_tungsten_progress for beta
host_name db4,db5,db6
check_command check_nrpe!check_tungsten_progress_beta
contact_groups admin
use generic-service
}
# Service definition
define service{
service_description check_tungsten_progress for gamma
host_name db7,db8,db9
check_command check_nrpe!check_tungsten_progress_gamma
contact_groups admin
use generic-service
}
Summary
The Wrap-Up
In this blog post we discussed how to best integrate various cluster monitoring solutions with PagerDuty (pagerduty.com), a popular alerting service.
To learn about Continuent solutions in general, check out https://www.continuent.com/solutions
The Library
Please read the docs!
For more information about monitoring Tungsten clusters, please visit https://docs.continuent.com/tungsten-clustering-6.0/ecosystem-nagios.html.
Below are a list of Nagios NRPE plugin scripts provided by Tungsten Clustering. Click on each to be taken to the associated documentation page.
- check_tungsten_latency - reports warning or critical status based on the replication latency levels provided.
- check_tungsten_online - checks whether all the hosts in a given service are online and running. This command only needs to be run on one node within the service; the command returns the status for all nodes. The service name may be specified by using the -s SVCNAME option.
- check_tungsten_policy - checks whether the policy is in AUTOMATIC mode and returns a CRITICAL if not./
- check_tungsten_progress - executes a heartbeat operation and validates that the sequence number has incremented within a specific time period. The default is one (1) second, and may be changed using the -t SECS option.
- check_tungsten_services - confirms that the services and processes are running; their state is not confirmed. To check state with a similar interface, use the
check_tungsten_online
command.
Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!
For more information, please visit https://www.continuent.com/solutions
Want to learn more or run a POC? Contact us.
Comments
skptricks (not verified)
Tue, 04/23/2019 - 21:06
great to know about cluster monitoring tool and their integration technique…
Add new comment