Monitoring your infrastructure – Zabbix

Tuesday, September 14th, 2010 | hardware, linux, useful tools | 2 Comments

Hi there – I’m afraid I’ve neglected the blog for the last few months – it’s been a busy spring and summer. I’ll try to post more regular articles now that the evenings are closing in again!

As you grow your infrastructure – one of the growing pains you’ll encounter is how to keep an eye on how your systems are running. Sure, you can come in every morning and login to each of your servers, maybe scan the logs and run a few commands like htop and dstat to verify that things are working ok. This approach doesn’t scale very well (it might work for 2 or 3 machines, but will be problematic with 50 machines). So you need something to monitor your infrastructure, ideally something that will do all of the following,

  1. Monitor all of your systems and notify you if there is a “problem” on any system.
  2. Store historical data for some key performance parameters on the system (it is useful to understand what kind of loads your systems normally run and whether these loads are increasing over time).
  3. Provide 1. and 2. via an easy to configure and use interface.
  4. Provide a nice graphical display of this data – for scanning for performance problems.
  5. Automatically perform actions on systems being monitored in response to certain events.

There are many open source and commercial applications for monitoring systems which meet the above requirements – see Wikipedia’s Comparison of network monitoring systems for a partial list of so called Network Monitoring / Management Systems.HP OpenView is the 800 lb gorilla of commercial network/system management tools but seems to have morphed into a whole suite of management and monitoring tools now. In Linux circles, the traditional solution for this problem has been Nagios. It has a reputation for being stable and reliable and has a huge community of users. On the other hand (based on my experiences while evaluating a number of different tools), it is configured with a series of text files which take some getting to grips with and a lot of functionality (like graphing and database storage) is provided through plugins (which themselves require installation and configuration). I found the default configuration to ugly and a little unfriendly and while there is a large community to help you – the core documentation is not great. There was a fork of Nagios, called Icinga which set out to address some of those problems – I haven’t checked how it’s progressing (but a quick look at their website suggests they have made a few releases). Kris Buytaert has a nice presentation about some of the main open source system monitoring tools from 2008 (which still seems pretty relevant).

After evaluating a few different systems, I settled on Zabbix as one which seemed to meet most of my requirements.  It is a GPL licensed network management system. One of the main reasons I went with Zabbix is because it includes a very nice, fully functional web interface. The agents for Zabbix (the part of Zabbix that sits on the system being monitored) are included in most common distributions (and while the distributions don’t always include the most recent release of the Zabbix agent, newer releases of Zabbix work well with older releases of the agents). Also, Zabbix is backed by a commercial/support entity which continues to make regular releases, which is a good sign. For those with really large infrastructures, Zabbix also seems to include a nicely scalable architecture. I only plan on using it to monitor about 100 systems so this functionality isn’t particularly important to me yet.

While our chosen distributions (Ubuntu and Debian) include recent Zabbix releases, I opted to install the latest stable release by hand directly from Zabbix – as some of the most recent functionality and performance improvements were of interest to me. We configured Zabbix to work with our MySQL database but it should work with Postgres or Oracle equally well. It does put a reasonable load on your database but that can be tuned depending on how much data you want to store, for how long and so on.

I’ve been using Zabbix for about 18 months now in production mode. As of this morning, it tells me it is monitoring 112 servers and 7070 specific parameters from those servers. The servers are mainly Linux servers although Zabbix does have support for monitoring Windows systems also and we do have one token Windows system (to make fun of ). Zabbix also allows us to monitor system health outside of the operating system level if a server supports the Intelligent Platform Management Interface (IPMI). We’re using this to closely monitor the temperature, power and fan performance on one of our more critical systems (a 24TB NAS from Scalable Informatics). Finally, as well as monitoring OS and system health parameters, Zabbix includes Web monitoring functionality which allows you to monitor the availability and performance of web based services over time. This functionality allows Zabbix to periodically log into a web application and run through a series of typical steps that a customer would perform. We’ve found this really useful for monitoring the availability and behaviour of our web apps over time (we’re monitoring 20 different web applications with a bunch of different scenarios).

As well as monitoring our systems and providing useful graphs to analyse performance over time, we are using Zabbix to send alerts when key services or systems become unavailable or error conditions like disks filling up or systems becoming overloaded occur. At the moment we are only sending email alerts but Zabbix also includes support for SMS and Jabber notifications depending on what support arrangements your organisation has.

On the downside, Zabbix’s best feature (from my perspective) is also the source of a few of it’s biggest problems – the web interface makes it really easy to begin using Zabbix – but it does have limitations and can make configuring large numbers of systems a little tiresome (although Zabbix does include a templating system to apply a series of checks or tests to a group of similar systems). While Zabbix comes with excellent documentation, some things can take a while to figure out (the part of Zabbix for sending alerts can be confusing to configure). To be fair to the Zabbix team, they are receptive to bugs and suggestions and are continuously improving the interface and addressing these limitations.

At the end of the day, I doesn’t matter so much what software you are using to monitor your systems. What is important is that you have basic monitoring functionality in place. There are a number of very good free and commercial solutions in place. While it can take time to put monitoring in place for everything in your infrastructure, even tracking the availability of your main production servers can reap huge benefits – and may allow you to rectify many problems before your customers (or indeed management) notice that a service has gone down. Personally, I’d recommend Zabbix – it has done a great job for us – but there are many great alternatives out there too. For those of you reading this and already using a monitoring system – what you are using and are you happy with it?

Tags: , , , ,

Debian 5.0 (Lenny) install on Software RAID

Monday, September 28th, 2009 | linux | 10 Comments

As mentioned in previous posts, I’m a big fan of Linux Software RAID. Most of the Ubuntu servers I install these days are configured with two disks in a RAID1 configuration. Contrary to recommendations you’ll find elsewhere, I put all partitions on RAID1, not some (that includes swap, /boot and / – in fact I don’t normally create a separate /boot partition, leaving it on the same partition as /). I guess if you’re using RAID1, I think you should get the advantage of it for all of your data, not just the really, really important stuff on a single RAIDed partition.

When installing Ubuntu (certainly recent releases including 8.10 and 9.04) you can configure all of this through the standard installation process – creating your partitions first, flagged for use in RAID and then configuring software RAID and creating a number of software RAID volumes.

I was recently installing a Debian 5.0 server and wanted to go with a similar config to the following,

Physical device Size Software RAID device Filesystem Description
/dev/sda1 6GB /dev/md0 swap Double the system physical memory
/dev/sdb1 6GB
/dev/sda2 10GB /dev/md1 ext3, / You can split this into multiple partitions for /var, /home and so on
/dev/sdb2 10GB
/dev/sda3 40GB /dev/md2 ext3, /data Used for critical application data on this server
/dev/sdb3 40GB

When I followed a standard install of Debian using the above configuration, when it came to installing GRUB, it failed with an error. The error seemed to be related to the use of Software RAID. Searching the web for possible solutions mostly turned up suggestions to create a non-RAIDed /boot partition but since this works on Ubuntu I figured it should also work on Debian (from which Ubuntu is largely derived).

First, a little background to GRUB and Linux Software RAID. It seems that GRUB cannot read Linux software RAID devices (which it needs to do to start the boot process). What it can do, is read standard Linux partitions. Given that Linux software RAID1 places a standard copy of a Linux partition on each RAID device, you can simply configure GRUB against the Linux partition and, at a GRUB level, ignore the software RAID volume. This seems to be how the Ubuntu GRUB installer works. A GRUB configuration stanza like the following should thus work without problems,

title           Debian GNU/Linux, kernel 2.6.26-2-amd64
root            (hd0,1)
kernel          /boot/vmlinuz-2.6.26-2-amd64 root=/dev/md1 ro
initrd          /boot/initrd.img-2.6.26-2-amd64

When I tried a configuration like this on my first install of Debian on the new server, it failed with the aforementioned error. Comparing a similarly configured Ubuntu server with the newly installed Debian server, the only obvious difference I could see is that the partition table on the Ubuntu server uses the old msdos format while the partition table on the Debian server seems to be in GPT format. I can’t find any documentation on when this change was made in Debian (or indeed whether it was something in my configuration that specifically triggered the use of GPT) but it seems like this was the source of the problems for GRUB.

To circumvent the creation of a GPT partition table on both disks, I restarted the Debian installer in Expert mode and installed the optional parted partitioning module when prompted. Before proceeding to the partitioning disks stage of the Debian installation, I moved to a second virtual console (Alt-F2) and started parted against each disk and ran the mklabel command to create a new partition table. When prompted for the partition table type, I input msdos.

I then returned to the Debian installer (Alt-F1) and continued the installation in the normal way – the partitioner picks up that the disks already have an partition table and uses that rather than recreating it.

This time, when it came to the GRUB bootloader installation step, it proceeded without any errors and I completed the installation of a fully RAIDed system.

Tags: , , , , , , , ,