Thoughts on running an Irish Linux business

I/O error, dev hdb, sector 92813943

Friday, September 28th, 2007 | hardware, linux | No Comments

That message in your Linux server’s logs is a bad start to any day. Those of you who’ve read my previous posts on Linux and storage will know that I’m a big fan of RAID for storage. In our case, we’re in the middle of migrating our main server from using straight disks to a proper hardware RAID configuration. Of course, Murphy (or possibly even O’Toole) being the Patron saint of System administrators everywhere … it was inevitable that we’d suffer the kind of storage error I’ve been dreading just weeks before we move to our new configuration. We do have backups though, but I’d prefer to not have to face a restore if possible.

We’ve been noticing periodic spikes in load on our main server – without any accompanying cpu activity as observed in top or htop (I’ve been using htop a little more lately, it has some nice subtle improvements over top including better default graphical summary of cpu and memory usage). Some further digging with vmstat revealed that the cpu was spending lots of time in I/O wait. This could be quite normal if your system is doing a lot of I/O but I/O wait of 70-90% for minutes at a time suggested something else was up, particularly given that the system, while acting as our main server, isn’t that busy consistently (unless all of our developers have decided to check out all of their CVS and Subversion trees simultaneously!).

The next step was to take a look at the system log files and see if there were any clues there. Unfortunately there were, in the form of this rather unwelcome message,

Aug 27 22:48:45 duck kernel: hdb: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=92813943, high=5, low=8927863, sector=92813943
Aug 27 22:48:45 duck kernel: ide: failed opcode was: unknown
Aug 27 22:48:45 duck kernel: end_request: I/O error, dev hdb, sector 92813943

A quick Google on “bad sectors” will tell you that a bad sector or 2 isn’t all that bad. In fact, the occasional bad sector happens naturally on a hard drive – and the electronics in the hard drive manage these in the background. If you’re seeing bad sectors at the operating system level – things may not be quite right with your hard drive, regardless of the job the electronics are doing at managing bad sectors. Anthony Ciani has a very nice description of whats going on in the drive when you get a bad sector. From my perspective, any time I’ve seen bad sectors on a drive in the past – the drive hasn’t lasted long after that, so my initial reaction was to save what I could from the drive.

Thankfully, this is one of a number of disks in our main server (one of the newer ones, curiously enough) so I had room on one of the other disks to move off any important data. We did see a few more bad sector messages but they were the same sectors suggesting that some of the data we were copying off resided on those sectors. I didn’t want to do any extensive checking or any further writing to the drive until we had moved off any data – the less work you do on a failing drive the better.

After recovering all of the data off of the drive, it was time to move it to one of our test servers with a view to determining what files had been affected by the bad sectors, the extent of the damage and whether the drive was heading towards becoming a paperweight or a piece of hard drive art.

I haven’t dug around filesystems at the level of mapping individual blocks to files since working with AdvFS on Tru64 systems. Daniel Alvarez blog pointed me to a document written by Bruce Allen BadBlockHowTo.txt which details how to identify the file associated with an unreadable disk sector. Using the notes from there, I prepared a basic OpenOffice.org Calc spreadsheet which quickly let me identify the file system block number of the failing sector. To identify the file affected by the failing sector requires some intermediate steps as described by Bruce.

The operating system logs an error in relation to what disk sector is failing (the disk sector is a physical location on the drive). You must first map this back to the filesystem block number (the filesystem block is a logical location). Only at this point is it possible to map the filesystem block back to an actual file in the filesystem. Bruce uses the following formula,

b = (int)((L-S)*512/B)

where,

b = File System block number (what we want)
B = File system block size in bytes
L = LBA of bad sector (what we have from /var/log/messages)
S = Starting sector of partition as shown by fdisk -lu
and (int) denotes the integer part.

The spreadsheet for doing this is available to download. You just fill in the values and it outputs the filesystem block number. Once you have that you can use debugfs to identify the file using the following steps,

Start debugfs.
star:~# debugfs
debugfs 1.40-WIP (14-Nov-2006)
Run debugfs against the partition with the bad sectors.
debugfs: open /dev/hdb1
Identify the inode from the filesystem block number.
debugfs: icheck 11601735
Block Inode number
11601735 5783564
Identify the file from the inode.
debugfs: ncheck 5783564
Inode Pathname
5783564 /mobyrne/.spamassassin/auto-whitelist

In our case – the file turned out to be a file automatically generated by spamassasin so its loss was inconsequential.

After this piece of forensic work – I tried running an fsck -c -c on the drive to identify any bad blocks and other errors. It showed up various errors and after fixing them, another fsck showed some more errors suggesting the drive is slowly failing. I verified that there were errors on the drive using the manufacturer’s own disk-checking tool and since it’s still under warranty, I’ll be sending it back for a replacement later today (after securely erasing it using the excellently named Darik’s Boot and Nuke tool).

Conclusions from this exercise?

Hard drives fail – plan for that eventuality.
RAID 1 is a good idea – it will at least protect you from these kind of failures.
Backups are a very good idea – make sure you perform them regularly, and make sure you test them regularly.

Congratulations to the Debian project

Monday, April 9th, 2007 | linux | No Comments

Hi,

Just a quickie today – I’m still enjoying the Easter holiday but I noticed that the Debian project have made their release of Debian GNU/Linux 4.0. I’d like to extend my congratulations and thanks to the Debian team – this is a huge achievement. They missed their original release date by a few months but given the size and scope of a Linux distribution I think this is still an amazing achievement for a community of volunteers (there are plenty of commercial software projects with a much smaller scope which experience slippage of more than 4 months).

We’ve been using Debian for our office servers and developer desktops since we opened for business. I’m looking forward to standardising our infrastructure on Debian 4.0 for at least the new few months (we’ve been using a pre-release of 4.0 in the form of Debian testing for some time now and have found it extremely stable and reliable).

So thanks again to the folks at Debian for some great work – for those that like living on the bleeding edge the next few months should be pretty interesting if you’re using Debian testing as new packages start showing up again.

-stephen

Simple Samba Printserver on Debian GNU/Linux 4.0 (Etch)

Friday, March 30th, 2007 | linux | 3 Comments

[Updated to fix some strings that got mangled by wordpress – the instructions should make more sense now]

This is a follow on to my previous blog –
Simple Samba PDC on Debian GNU/Linux 4.0 (Etch) where we looked at how to configure Samba as a simple PDC for your network. One of the benefits of having a domain controller like this is that it can simplify the configuration of printers by your users.

Rather than each user having to install their own drivers, configure up the printer settings and generally struggle with such administration tasks, you can configure your Samba server to push these configuration details to the users.

In the previous article you saw how we had the following lines in our smb.conf

load printers = yes
printing = cups
printcap name = cups

These ensure that printers configured in our cups server are available for use through Samba.

We also have the following special shares defined in our smb.conf

[printers]
comment = All Printers
browseable = no
path = /var/spool/samba
printable = yes
public = yes
writable = no
create mode = 0700

[print$]
comment = Printer Drivers
path = /var/lib/samba/printers
write list = root, @ntadmin
printer admin = root, @ntadmin

The first one is the share used for spooling print-jobs and the second is a special share used to serve printer drivers for automatic installation when users add a new printer from the Samba server.

I’ve tried to set this up in the past and had problems with permissions, thanks to the helpful people on the Samba mailing list, particularly Martin Zielinski and Dale Schroeder I’ve managed to surmount those problems and now have a working configuration.

Summarising the tips from Martin and Dale here is the procedure for configuring printing with Samba,

Check your permissions

On the Samba server, Note that /var/lib/samba/printers needs to be writeable by users in the ntadmin group (unless you are going to do all your print management as the root user which is inadvisable).
```
chgrp -R ntadmin /var/lib/samba/printers
```
and
```
chmod -R g+w /var/lib/samba/printers
```
should ensure permissions are correct on the share.
On Windows, login as a user in the ntadmin group and verify the permissions on the print$ share by clicking on Start and then Run… and executing \\sambaserver\print$. At this point, if you’re using the Debian Samba packages you should see 2 subdirectories: W32X86 and WIN40.
Try adding a new folder here as a test. If you can’t, you’ll need to check your permissions (is the user you are logged in as in the ntadmin group and does the ntadmin group have write permissions in that folder?).

Install the printer driver on the server

Firstly, on the Samba server, ensure the user you are connecting with on Windows has the rights to add new printer drivers. You may need to run the following to grant those privileges,
```
net rpc rights grant smulcahy SePrintOperatorPrivilege
```
On Windows, open the printer properties window by clicking on Start and then Run… and executing \\sambaserver.
Change into the Printers and Faxes.
Right click on the blank/white area of the resulting window and select Server Properties from the drop-down menu.
Select the middle tab – Drivers.
Click Add and follow the instructions to add the printer driver to your system. In my case I was installing a driver for a HP Color LaserJet 2500l. I initially downloaded the driver from HP’s website and ran the installer. It placed the driver files in a subdirectory of C:\Program Files. When adding the printer driver, I selected the Have Disk option and then provided the path to this subdirectory.

Configure your printer

On Windows, again change into the Printers and Faxes folder.
Right-click on a printer, select Properties and click the Advanced tab.
Select the driver you installed previously from the drop-down box.
You can also configure various other properties (including the default paper size) for the printer at this point by selecting the various tabs.
When finished click Ok.

Your printer is new configured with an automatically installing driver and your desired default settings! These same settings will be used on any new clients added to your network and connected to that printer.

Atlantic Linux Blog