Archive for June, 2006

Analysing web traffic

Wednesday, June 14th, 2006 | linux, web | No Comments

Phew, I was gonna say it’s amazing how fast a month disappears and then I notice that it’s more like 2 months since I posted. Good to see Jim, Albert, Rob and William have kept the blog entries coming.

One of the things I’ve been meaning to do for some time is take a look at the statistics for our website and the blog to get an idea of what kind of traffic we’re seeing and maybe divine how to improve the site to attract more of the right kind of traffic (people interested in our services). I’m using the word divine intentionally, sometimes analysing webserver statistics feels a bit like reading tea-leaves (or tasseography to the tea-leaf reading crowd).

Being something of a data packrat, I do of course have a whole year of webserver access logs stored away and ready for analysis. I’ve been using analog on and off for about 10 years to analyse webstats. Back in the day, it was very fast and produced nice detailed reports which were easily customised. I fired that up on my 12 months of web data first and it performed as well as usual.

When we were reviewing the statistics afterwards and trying to identify trends, Albert pointed out that it doesn’t provide any information about visitors, focusing rather on hits or pageviews. I guess I hadn’t been keeping an eye on the state of the art in web server statistics analysis and as he said it, I realised it was true. Some of the current crop of webserver statistic tools give you more detailed information on visitors rather than focusing solely on hits. This lets you glean information such as how long people are staying on your site, what route they are following through the site and where they exit from the site.

A quick trawl through Debian’s package list turned up another likely candidate for web log analysis in awstats which sounded like it provides similar functionality to analog and details of unique visitors (this comparison is quite detailed). This sounded like just the ticket so I went ahead and installed it.

The usage model for awstats is a little different to that of analog. awstats expects to be run periodically (either as a cgi-bin or from the command-line) and analyse both cached data from previous runs and new data from the latest logfile. It took a bit more effort to configure up awstats to first analyse all of my archived weblogs and then parse the most recent one (you can’t throw the whole lot into the config file and let awstats decide which way to read gzipped files versus normal files) – analog takes whatever you throw at it and does the right thing. To be fair to awstats, it does document how to do this but the ramp-up time to generating the desired reports is a bit longer than with analog. When it did finish chugging through my data, it produced a pretty decent report .

On balance, the output from awstats is an improvement over analog in that it provides some visitor statistics. On the other hand, the default analog report reads better to me and the awstats default is possibly a little too long (you feel you have all of this information in front of you but aren’t really sure how to digest it). So I guess I’d use awstats if I need the visitor information, but as a tool I’d still have a preference for analog (especially if it ever supports visitor information).

Google threw a spanner in the works a week ago when they finally sent me an invite for Google Analytics. I requested an invitation for Google Analytics a few months ago but it sounds like they were initially swamped with demand and have only opened things up again lately. Google Analytics takes an entirely different approach to web statistics than tools like analog and awstats. It requires you to insert a small piece of javascript into any page you want to track (a technique called page tagging) – anytime the page is loaded, the javascript passes details of the page visitor back to Google Analytics for analysis. The wikipedia web analytics entry discusses the advantages of each approach. I guess the command-line Linux hacker in me is concerned that only visitors using a browser that supports Javascript and that has it enabled will show up in the statistics. In practice, I guess this is a small minority for most sites but it’s a nagging concern all the same. The privacy advocate in me is a little concerned at how much data we’re shovelling to Google – pretty soon they’ll know everything about you!. But hey, the Google Analytics interface and reports look neat and I’m sucker for new Google software anyways 🙂

We’ve only been running it for 2 weeks, so the amount of data in the Google Analytics reports is still quite minimal but I’m definitely impressed with the interface. It gives you lots of different views of the data and includes some nice toys like a display of where in the world visitors to your site are coming from (I don’t know how accurate this can be, I haven’t gone looking at how this is done yet). It also beats awstats in terms of how much visitor-focused information it gives you, down to where visitors are entering your site, how long they are staying around and what page they are exiting from – which is really the information I should be thinking about when it comes to redesigning our website or adding new material.

I haven’t made up my mind yet as to which tool I’ll be using in the future. The Google Analytics one is easy to check on once every few days and gives nice information at a glance. I’ll probably still run awstats every few months for the moment, and if analog start supporting visitor patterns I’ll probably go back to using that.

Now to figure out exactly what I want to know about our website 🙂