Monitoring your infrastructure – Zabbix

Tuesday, September 14th, 2010 | hardware, linux, useful tools | 2 Comments

Hi there – I’m afraid I’ve neglected the blog for the last few months – it’s been a busy spring and summer. I’ll try to post more regular articles now that the evenings are closing in again!

As you grow your infrastructure – one of the growing pains you’ll encounter is how to keep an eye on how your systems are running. Sure, you can come in every morning and login to each of your servers, maybe scan the logs and run a few commands like htop and dstat to verify that things are working ok. This approach doesn’t scale very well (it might work for 2 or 3 machines, but will be problematic with 50 machines). So you need something to monitor your infrastructure, ideally something that will do all of the following,

  1. Monitor all of your systems and notify you if there is a “problem” on any system.
  2. Store historical data for some key performance parameters on the system (it is useful to understand what kind of loads your systems normally run and whether these loads are increasing over time).
  3. Provide 1. and 2. via an easy to configure and use interface.
  4. Provide a nice graphical display of this data – for scanning for performance problems.
  5. Automatically perform actions on systems being monitored in response to certain events.

There are many open source and commercial applications for monitoring systems which meet the above requirements – see Wikipedia’s Comparison of network monitoring systems for a partial list of so called Network Monitoring / Management Systems.HP OpenView is the 800 lb gorilla of commercial network/system management tools but seems to have morphed into a whole suite of management and monitoring tools now. In Linux circles, the traditional solution for this problem has been Nagios. It has a reputation for being stable and reliable and has a huge community of users. On the other hand (based on my experiences while evaluating a number of different tools), it is configured with a series of text files which take some getting to grips with and a lot of functionality (like graphing and database storage) is provided through plugins (which themselves require installation and configuration). I found the default configuration to ugly and a little unfriendly and while there is a large community to help you – the core documentation is not great. There was a fork of Nagios, called Icinga which set out to address some of those problems – I haven’t checked how it’s progressing (but a quick look at their website suggests they have made a few releases). Kris Buytaert has a nice presentation about some of the main open source system monitoring tools from 2008 (which still seems pretty relevant).

After evaluating a few different systems, I settled on Zabbix as one which seemed to meet most of my requirements.  It is a GPL licensed network management system. One of the main reasons I went with Zabbix is because it includes a very nice, fully functional web interface. The agents for Zabbix (the part of Zabbix that sits on the system being monitored) are included in most common distributions (and while the distributions don’t always include the most recent release of the Zabbix agent, newer releases of Zabbix work well with older releases of the agents). Also, Zabbix is backed by a commercial/support entity which continues to make regular releases, which is a good sign. For those with really large infrastructures, Zabbix also seems to include a nicely scalable architecture. I only plan on using it to monitor about 100 systems so this functionality isn’t particularly important to me yet.

While our chosen distributions (Ubuntu and Debian) include recent Zabbix releases, I opted to install the latest stable release by hand directly from Zabbix – as some of the most recent functionality and performance improvements were of interest to me. We configured Zabbix to work with our MySQL database but it should work with Postgres or Oracle equally well. It does put a reasonable load on your database but that can be tuned depending on how much data you want to store, for how long and so on.

I’ve been using Zabbix for about 18 months now in production mode. As of this morning, it tells me it is monitoring 112 servers and 7070 specific parameters from those servers. The servers are mainly Linux servers although Zabbix does have support for monitoring Windows systems also and we do have one token Windows system (to make fun of ). Zabbix also allows us to monitor system health outside of the operating system level if a server supports the Intelligent Platform Management Interface (IPMI). We’re using this to closely monitor the temperature, power and fan performance on one of our more critical systems (a 24TB NAS from Scalable Informatics). Finally, as well as monitoring OS and system health parameters, Zabbix includes Web monitoring functionality which allows you to monitor the availability and performance of web based services over time. This functionality allows Zabbix to periodically log into a web application and run through a series of typical steps that a customer would perform. We’ve found this really useful for monitoring the availability and behaviour of our web apps over time (we’re monitoring 20 different web applications with a bunch of different scenarios).

As well as monitoring our systems and providing useful graphs to analyse performance over time, we are using Zabbix to send alerts when key services or systems become unavailable or error conditions like disks filling up or systems becoming overloaded occur. At the moment we are only sending email alerts but Zabbix also includes support for SMS and Jabber notifications depending on what support arrangements your organisation has.

On the downside, Zabbix’s best feature (from my perspective) is also the source of a few of it’s biggest problems – the web interface makes it really easy to begin using Zabbix – but it does have limitations and can make configuring large numbers of systems a little tiresome (although Zabbix does include a templating system to apply a series of checks or tests to a group of similar systems). While Zabbix comes with excellent documentation, some things can take a while to figure out (the part of Zabbix for sending alerts can be confusing to configure). To be fair to the Zabbix team, they are receptive to bugs and suggestions and are continuously improving the interface and addressing these limitations.

At the end of the day, I doesn’t matter so much what software you are using to monitor your systems. What is important is that you have basic monitoring functionality in place. There are a number of very good free and commercial solutions in place. While it can take time to put monitoring in place for everything in your infrastructure, even tracking the availability of your main production servers can reap huge benefits – and may allow you to rectify many problems before your customers (or indeed management) notice that a service has gone down. Personally, I’d recommend Zabbix – it has done a great job for us – but there are many great alternatives out there too. For those of you reading this and already using a monitoring system – what you are using and are you happy with it?

Tags: , , , ,