useful tools
Monitoring your infrastructure – Zabbix
Hi there – I’m afraid I’ve neglected the blog for the last few months – it’s been a busy spring and summer. I’ll try to post more regular articles now that the evenings are closing in again!
As you grow your infrastructure – one of the growing pains you’ll encounter is how to keep an eye on how your systems are running. Sure, you can come in every morning and login to each of your servers, maybe scan the logs and run a few commands like htop and dstat to verify that things are working ok. This approach doesn’t scale very well (it might work for 2 or 3 machines, but will be problematic with 50 machines). So you need something to monitor your infrastructure, ideally something that will do all of the following,
- Monitor all of your systems and notify you if there is a “problem” on any system.
- Store historical data for some key performance parameters on the system (it is useful to understand what kind of loads your systems normally run and whether these loads are increasing over time).
- Provide 1. and 2. via an easy to configure and use interface.
- Provide a nice graphical display of this data – for scanning for performance problems.
- Automatically perform actions on systems being monitored in response to certain events.
There are many open source and commercial applications for monitoring systems which meet the above requirements – see Wikipedia’s Comparison of network monitoring systems for a partial list of so called Network Monitoring / Management Systems.HP OpenView is the 800 lb gorilla of commercial network/system management tools but seems to have morphed into a whole suite of management and monitoring tools now. In Linux circles, the traditional solution for this problem has been Nagios. It has a reputation for being stable and reliable and has a huge community of users. On the other hand (based on my experiences while evaluating a number of different tools), it is configured with a series of text files which take some getting to grips with and a lot of functionality (like graphing and database storage) is provided through plugins (which themselves require installation and configuration). I found the default configuration to ugly and a little unfriendly and while there is a large community to help you – the core documentation is not great. There was a fork of Nagios, called Icinga which set out to address some of those problems – I haven’t checked how it’s progressing (but a quick look at their website suggests they have made a few releases). Kris Buytaert has a nice presentation about some of the main open source system monitoring tools from 2008 (which still seems pretty relevant).
After evaluating a few different systems, I settled on Zabbix as one which seemed to meet most of my requirements. It is a GPL licensed network management system. One of the main reasons I went with Zabbix is because it includes a very nice, fully functional web interface. The agents for Zabbix (the part of Zabbix that sits on the system being monitored) are included in most common distributions (and while the distributions don’t always include the most recent release of the Zabbix agent, newer releases of Zabbix work well with older releases of the agents). Also, Zabbix is backed by a commercial/support entity which continues to make regular releases, which is a good sign. For those with really large infrastructures, Zabbix also seems to include a nicely scalable architecture. I only plan on using it to monitor about 100 systems so this functionality isn’t particularly important to me yet.
While our chosen distributions (Ubuntu and Debian) include recent Zabbix releases, I opted to install the latest stable release by hand directly from Zabbix – as some of the most recent functionality and performance improvements were of interest to me. We configured Zabbix to work with our MySQL database but it should work with Postgres or Oracle equally well. It does put a reasonable load on your database but that can be tuned depending on how much data you want to store, for how long and so on.
I’ve been using Zabbix for about 18 months now in production mode. As of this morning, it tells me it is monitoring 112 servers and 7070 specific parameters from those servers. The servers are mainly Linux servers although Zabbix does have support for monitoring Windows systems also and we do have one token Windows system (to make fun of ). Zabbix also allows us to monitor system health outside of the operating system level if a server supports the Intelligent Platform Management Interface (IPMI). We’re using this to closely monitor the temperature, power and fan performance on one of our more critical systems (a 24TB NAS from Scalable Informatics). Finally, as well as monitoring OS and system health parameters, Zabbix includes Web monitoring functionality which allows you to monitor the availability and performance of web based services over time. This functionality allows Zabbix to periodically log into a web application and run through a series of typical steps that a customer would perform. We’ve found this really useful for monitoring the availability and behaviour of our web apps over time (we’re monitoring 20 different web applications with a bunch of different scenarios).
As well as monitoring our systems and providing useful graphs to analyse performance over time, we are using Zabbix to send alerts when key services or systems become unavailable or error conditions like disks filling up or systems becoming overloaded occur. At the moment we are only sending email alerts but Zabbix also includes support for SMS and Jabber notifications depending on what support arrangements your organisation has.
On the downside, Zabbix’s best feature (from my perspective) is also the source of a few of it’s biggest problems – the web interface makes it really easy to begin using Zabbix – but it does have limitations and can make configuring large numbers of systems a little tiresome (although Zabbix does include a templating system to apply a series of checks or tests to a group of similar systems). While Zabbix comes with excellent documentation, some things can take a while to figure out (the part of Zabbix for sending alerts can be confusing to configure). To be fair to the Zabbix team, they are receptive to bugs and suggestions and are continuously improving the interface and addressing these limitations.
At the end of the day, I doesn’t matter so much what software you are using to monitor your systems. What is important is that you have basic monitoring functionality in place. There are a number of very good free and commercial solutions in place. While it can take time to put monitoring in place for everything in your infrastructure, even tracking the availability of your main production servers can reap huge benefits – and may allow you to rectify many problems before your customers (or indeed management) notice that a service has gone down. Personally, I’d recommend Zabbix – it has done a great job for us – but there are many great alternatives out there too. For those of you reading this and already using a monitoring system – what you are using and are you happy with it?
Parallel ssh
I’m increasingly working on clusters of systems – be they traditional HPC clusters running some MPI based software or less traditional clusters running software such as Hadoop‘s HDFS and MapReduce.
In both cases, the underlying operating systems are largely the same – pretty standard Linux systems running one of the main Linux distributions (Debian, Ubuntu, Red Hat Enterprise Linux, CentOS, SuSE Linux Enterprise Server or OpenSUSE).
There are various tools for creating standard system images and pushing those to each of the cluster nodes – and I use those (more in a future post), but often, you need to perform the same task on a bunch of the cluster nodes or maybe all of them. This task is best achieved by simply ssh’ing into each of the nodes and running some command (be it a status command such as uptime, or ps or a command to install a new piece of software).
Normally, one or more users on the cluster will have been configured to use password-less logins with ssh so a first-pass at running ssh commands on multiple systems would be to script the ssh calls from a management cluster node. The following is an example script for checking the uptime on each node of our example cluster (which has nodes from cluster02 to cluster20, I’m assuming we’re running on cluster01).
#!/bin/bash # for addr in {2..20} do num=`printf "%02d" $addr` echo -n "cluster${num}:" && ssh cluster${num} uptime done
The script works, the downside is you have to create a new script each time you have a new command to run, or a slightly different sequence of actions you want to perform (you could improve the above by passing the command to be run as an argument to the script but even then the approach is limited).
What you really need at this stage is a parallel ssh, an ssh command which can be instructed to run the same command against multiple nodes. Ideally, the ssh command can merge the output from multiple systems if the output is the same – making it easier for the person running the parallel ssh command to understand which cluster nodes share the same status.
A quick search through Debian’s packages and a Google for parallel ssh turns up a few candidates,
This linux.com article reviews a number of these shells.
After looking at a few of these, I’ve settled on using pdsh. Each of the tools listed above use slightly different approaches – some provide multiple xterms in which to run commands – some provide a lot of flexibility in how the output is combined. What I like about pdsh is that it provides a pretty straightforward syntax to invoke commands and, most importantly for me, it cleanly merges the output from multiple hosts – allowing me to very quickly see the differences in a command’s output from different hosts.
You will need to configure your password-less ssh operation as normal. Once you have done that, on Debian or Ubuntu, edit /etc/pdsh/rcmd_default and change the contents of this file to a single line containing the following,
ssh
(create the file it it doesn’t exist).
Now you can run a command, such as date (to verify if NTP is working correctly) on multiple hosts with the following,
pdsh -w cluster[01-05] date
This runs date on cluster01,cluster02,cluster03,cluster04 and cluster05 and returns the output. To consolidate the output from multiple nodes into a compact display format, pdsh comes with a second tool called dshbak, used as follows,
pdsh -w cluster[01-05] date | dshbak -c
Personally, I find this output most readable. On Debian and Ubuntu systems, to invoke dshbak by default, edit /usr/bin/pdsh (which is a shell script wrapper) and change the invocation line from
exec -a pdsh /usr/bin/pdsh.bin "$@"
to
exec -a pdsh /usr/bin/pdsh.bin "$@" | /usr/bin/dshbak -c
Now when you invoke pdsh by default, it’s output will be piped through dshbak.
sudo via ssh
By default, if you attempt to run sudo through ssh, when you respond to the password prompt from sudo – it will echo the password back on your console. To avoid this, provide the -t option to ssh which forces the remote session to behave as if run through a normal tty (and thus masks the password), for example,
ssh -t foo.example.com sudo shutdown -r now
Categories
Archives
- September 2010
- February 2010
- November 2009
- September 2009
- August 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- November 2007
- September 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- September 2006
- July 2006
- June 2006
- April 2006