Thoughts on running an Irish Linux business

Linux RAID solutions (Part I)

Tuesday, July 25th, 2006 | hardware, linux | No Comments

We have a customer that expects to generate a few hundred gigabytes of data a week. They have small Linux cluster which we helped to build (I will blog about it soon, honest!) for oceanographic modelling – on a good week they expect to generate maybe 500GB of raw data. At the moment, they have 2 problems,

How to store it reliably.
How to back it up.

The cluster itself consists of a bunch of diskless dual-core opteron compute nodes and a head node with a few hundred GB of local storage. We don’t want to turn that into a general storage server because it might impact the performance of a modelling run on the cluster, and besides, it’s a 1U box so it doesn’t have room for many disks.

It’s been a while since I looked closely at storage. When I worked in HP, I got the chance to play around with some of their SAN technology including the EVA storage arrays which can store up to 84TB in a single cabinet. That kinda hardware, hooked up to each of your systems with fibre, is nice – you can scale your storage up as your needs increase, I/O speed is the same or close on local storage and you can add a tape autoloader and the SAN practically manages it’s own backups.

Of course convenience comes at a price – as does scalability. In the case of our customer, I’m not sure it makes sense at this stage of their project to burn their budget on an enterprise class storage solution like this. Sure, their data is important (maybe not quite as critical as payroll but even if the model can be rerun to regenerate the data, it could still represent a weeks work), but maybe not important enough to justify the those prices.

I’ve been investigating an alternative solution for them using off the shelf hardware plugged into a dedicated Linux storage server. It’s been a while since I looked at RAID h/w on Linux and SCSI RAID was the only option back then.
The RAID Controller

These days SATA RAID is looking pretty attractive from a price point of view. My research indicates that SATA RAID controllers are generally pretty well supported on Linux although you have to watch out for host-based or software RAID (also sometimes known as fake RAID) where the RAID controller driver actually does a lot of the work in software, rather than letting the RAID controller take care of all the work (dubbed hardware RAID). The 2 main issues with fake RAID from a Linux perspective are,

Linux kernel drivers may not be sophisticated enough to implement all the functionality that the RAID controller vendor delivers in their Windows driver (in other words your RAID controller may not actually work as RAID controller under Linux).
Even if the driver is capable, this means that the processor on your system, not the dedicated processor on the RAID controller will have to do a lot of the work. This may not be an issue if, like the average desktop user, your processor is actually idle most of the time, but if you have a busy storage server handling lots of network requests, it may lead to problems.

I guess I like my RAID storage to be a black box thats not subject to the system running out of memory, system crashes or other operating system level failures.

Various research indicates that both the Adaptec 2820-SA and 3Ware 9550SX-8LP are true hardware RAID controllers that are well supported on Linux. I’m opting for 8-port cards to give us plenty of room for expansion although initially I don’t expect to use more than 4 ports. I’m also opting for SATA-II cards to avail of the increased performance (we’re expecting this storage system to be in use for quite a while so it would be nice to be reasonably future-proof). I would normally lean towards Adaptec gear because I’ve had good experiences with them in the past but it’s not obvious from their support site how well Linux distributions other than Red Hat and SuSE are supported. There are no good technical reasons not to make an effort to support your hardware on all Linux distributions and our customer is using Debian on all of their other systems so I’m inclined to favour a hardware vendor that supports Linux in a general way.

3Ware give the impression of being interested in supporting you regardless of what distribution you are using. This makes sense to me for a hardware vendor and gives me hope that if I do experience a problem down the road I’m less likely to get a response from their support people telling me that the distribution I’m using isn’t supported.

The Enclosure

I’ve been burned by cheap cases and enclosures in the past so I’m inclined to go with reliable vendors for cases. Problems with cheap, unreliable power-supplies can cause endless headaches and can be very hard to track down. Both Supermicro and Chenbro seem to have a good reputation with at least some of the people on the beowulf mailing list. A final decision hasn’t been made on this yet but both the Supermicro SC833 and the Chenbro RM215 look good. The main thing is to ensure that the motherboard includes 133MHz PCI-X support.

The drives

There is considerable debate about the reliablity of SATA drives versus SCSI. Certainly, SCSI drives have traditionally been intended for enterprise use so manufacturers have worked on producing more reliable drives (at a higher price tag) than traditional consumer IDE drives. SATA drives are being used in both enterprise and consumer markets. The bottom line is that there is nothing inherently unreliable about SATA technology, but cheaper consumer drives certainly do have lower MTBF figures than enterprise SCSI drives. When it comes to purchasing SATA drives for our array, we’ll be looking to the more expensive SATA drives that come with a 3 or 5 year manufacturers warranty and the higher MTBF.

The configuration

We’re initially looking at logical storage of about 1TB and reliability is key so I’m opting for a RAID 1 configuration. Later on, we may investigate RAID 0+1 for increased performance. Lots of people seem to like RAID5 – it’s certainly cheap in terms number of disks available for storage but its write performance is poor. We’re going to start with 4 x 500GB drives and expand from there. This gives us 1TB of logical storage and room to expand.
In the second part of this article I’ll look at our options for backups and maybe discussing the implementation if we get it rolled out by then.

Analysing web traffic

Wednesday, June 14th, 2006 | linux, web | No Comments

Phew, I was gonna say it’s amazing how fast a month disappears and then I notice that it’s more like 2 months since I posted. Good to see Jim, Albert, Rob and William have kept the blog entries coming.

One of the things I’ve been meaning to do for some time is take a look at the statistics for our website and the blog to get an idea of what kind of traffic we’re seeing and maybe divine how to improve the site to attract more of the right kind of traffic (people interested in our services). I’m using the word divine intentionally, sometimes analysing webserver statistics feels a bit like reading tea-leaves (or tasseography to the tea-leaf reading crowd).

Being something of a data packrat, I do of course have a whole year of webserver access logs stored away and ready for analysis. I’ve been using analog on and off for about 10 years to analyse webstats. Back in the day, it was very fast and produced nice detailed reports which were easily customised. I fired that up on my 12 months of web data first and it performed as well as usual.

When we were reviewing the statistics afterwards and trying to identify trends, Albert pointed out that it doesn’t provide any information about visitors, focusing rather on hits or pageviews. I guess I hadn’t been keeping an eye on the state of the art in web server statistics analysis and as he said it, I realised it was true. Some of the current crop of webserver statistic tools give you more detailed information on visitors rather than focusing solely on hits. This lets you glean information such as how long people are staying on your site, what route they are following through the site and where they exit from the site.

A quick trawl through Debian’s package list turned up another likely candidate for web log analysis in awstats which sounded like it provides similar functionality to analog and details of unique visitors (this comparison is quite detailed). This sounded like just the ticket so I went ahead and installed it.

The usage model for awstats is a little different to that of analog. awstats expects to be run periodically (either as a cgi-bin or from the command-line) and analyse both cached data from previous runs and new data from the latest logfile. It took a bit more effort to configure up awstats to first analyse all of my archived weblogs and then parse the most recent one (you can’t throw the whole lot into the config file and let awstats decide which way to read gzipped files versus normal files) – analog takes whatever you throw at it and does the right thing. To be fair to awstats, it does document how to do this but the ramp-up time to generating the desired reports is a bit longer than with analog. When it did finish chugging through my data, it produced a pretty decent report .

On balance, the output from awstats is an improvement over analog in that it provides some visitor statistics. On the other hand, the default analog report reads better to me and the awstats default is possibly a little too long (you feel you have all of this information in front of you but aren’t really sure how to digest it). So I guess I’d use awstats if I need the visitor information, but as a tool I’d still have a preference for analog (especially if it ever supports visitor information).

Google threw a spanner in the works a week ago when they finally sent me an invite for Google Analytics. I requested an invitation for Google Analytics a few months ago but it sounds like they were initially swamped with demand and have only opened things up again lately. Google Analytics takes an entirely different approach to web statistics than tools like analog and awstats. It requires you to insert a small piece of javascript into any page you want to track (a technique called page tagging) – anytime the page is loaded, the javascript passes details of the page visitor back to Google Analytics for analysis. The wikipedia web analytics entry discusses the advantages of each approach. I guess the command-line Linux hacker in me is concerned that only visitors using a browser that supports Javascript and that has it enabled will show up in the statistics. In practice, I guess this is a small minority for most sites but it’s a nagging concern all the same. The privacy advocate in me is a little concerned at how much data we’re shovelling to Google – pretty soon they’ll know everything about you!. But hey, the Google Analytics interface and reports look neat and I’m sucker for new Google software anyways 🙂

We’ve only been running it for 2 weeks, so the amount of data in the Google Analytics reports is still quite minimal but I’m definitely impressed with the interface. It gives you lots of different views of the data and includes some nice toys like a display of where in the world visitors to your site are coming from (I don’t know how accurate this can be, I haven’t gone looking at how this is done yet). It also beats awstats in terms of how much visitor-focused information it gives you, down to where visitors are entering your site, how long they are staying around and what page they are exiting from – which is really the information I should be thinking about when it comes to redesigning our website or adding new material.

I haven’t made up my mind yet as to which tool I’ll be using in the future. The Google Analytics one is easy to check on once every few days and gives nice information at a glance. I’ll probably still run awstats every few months for the moment, and if analog start supporting visitor patterns I’ll probably go back to using that.

Now to figure out exactly what I want to know about our website 🙂

Stress testing a PC

Monday, April 10th, 2006 | hardware, linux | 1 Comment

We’ve been working with one of our customers to roll out a small compute cluster for oceanographic modelling. The cluster consists of a series of dual processor, dual core systems. The cluster is up and running and has been generating useful data for our customer, but they have been experiencing occasional problems on one member of the cluster.

In order to investigate this problem further and isolate the cause of the problem I’ve been looking at ways to stress test the system. I’ve found memtest86+ to be pretty good at identifying problems with memory so I’ve gone and run that for a few hours and noticed no significant memory problems. After that, I’ve been looking at ways of stress testing the processors. StressCPU is one example of a program which runs some code in a loop which should show up problems when run for a period of time.

Prime95 is another interesting piece of software for stress-testing processors. The software is written for the purpose of finding Mersenne Prime numbers but turns out to be a good exerciser for the processor by virtue of the calculations it performs. It is widely used by overclockers and others intested in testing the stability of their systems and includes a mode specifically for stress testing systems which runs the program and compares the results with known good answers. The software is available for Linux and Windows and can also be found on the Ultimate Boot CD allowing it to be run on a system at boot-up.

I’ll be trying this on the system in the next few weeks and will report back on the results.

Atlantic Linux Blog