Who/what is doing all this file I/O ? (aka stupid userspace)

Posted: November 5th, 2006 | Filed under: Uncategorized | No Comments »

So I had a little spare time the other day and decided to do some experimentation with SystemTAP again. Previously I’ve looked at which processes were doing at boottime; this needs updating and re-visiting to see how things have changed since FC5, however, I decided to leave that task for later. Instead I decided to look at what’s going on in my ‘idle’ GNOME desktop session over a 5 minute period. The SystemTap script I wrote for bootprobes was trivially adapted to serve this new task. Most of the work was in writing a new perl routine to post-process / analyse the data.

The systemtap script simply output a line every time a process closed a file detailing how many bytes it read & how many it wrote. A very short sample is shown below, the first field is milliseconds since start of the script, then the pid, then command line, then filename & bytes read & written:

60 2224 (pcscd) access /dev/bus/usb/005/001 read: 0 write: 0
60 2224 (pcscd) access /dev/bus/usb/005/006 read: 0 write: 0
112 2588 (hald-addon-stor) access /dev/hda read: 0 write: 0
120 1888 (ipw3945d) access /sys/bus/pci/drivers/ipw3945/0000:03:00.0/cmd read: 3008 write: 0
223 1888 (ipw3945d) access /sys/bus/pci/drivers/ipw3945/0000:03:00.0/cmd read: 3008 write: 0
324 1888 (ipw3945d) access /sys/bus/pci/drivers/ipw3945/0000:03:00.0/cmd read: 3008 write: 0
428 2837 (python) access /proc/xen/privcmd read: 0 write: 0
428 1888 (ipw3945d) access /sys/bus/pci/drivers/ipw3945/0000:03:00.0/cmd read: 3008 write: 0
428 2831 (xenstored) access /var/lib/xenstored/tdb.0x9e1e9b8 read: 0 write: 73728

Processing this I correlated parent-child proesses to see who’s spawning who. More importantly for each unique (pid, filename) pair the I tracked how many times the file was opened, the time of each open & bytes read/written. For files opened more than once, a crude frequency of opening was calculated & average I/O size. This produces output like:

 Pid 2060 irqbalance run 292374+ ms...
   reads /proc/interrupts 30 times, once every 9667 ms avg 8192 bytes
   writes /proc/irq/12/smp_affinity once only avg 1 bytes
   writes /proc/irq/23/smp_affinity 10 times, once every 27002 ms avg 1 bytes

Hall of Shame

Anyway, so looking through the output a few examples really stand out for awards in the hall of shame

1. pcscd – smart card daeon

Top of the hall of shame is pcscd a daemon for dealing with smart card readers. Once per second, it scans all USB devices on the system checking to see if any of them are newly plugged in smartcard readers. For the 99.99999% of people who will never see a smartcard reader this is top kwalitee crack. Perhaps someone might like to help pcscd learn to love HAL which is perfectly capable of notifying it when a device is plugged in…

 Pid 2224 pcscd run 301599+ ms...
   reads /dev/bus/usb/001/001 301 times, once every 1000 ms avg 43 bytes
   reads /dev/bus/usb/002/001 301 times, once every 1000 ms avg 43 bytes
   reads /dev/bus/usb/003/001 301 times, once every 1000 ms avg 43 bytes
   reads /dev/bus/usb/004/001 301 times, once every 1000 ms avg 43 bytes
   reads /dev/bus/usb/004/002 301 times, once every 1000 ms avg 57 bytes
   reads /dev/bus/usb/005/001 301 times, once every 1000 ms avg 43 bytes
   reads /dev/bus/usb/005/002 301 times, once every 1000 ms avg 43 bytes
   reads /dev/bus/usb/005/005 301 times, once every 1000 ms avg 52 bytes
   reads /dev/bus/usb/005/006 301 times, once every 1000 ms avg 52 bytes
   reads /dev/bus/usb/005/008 301 times, once every 1000 ms avg 50 bytes

2. ipw3945d – binary regulatory daemon for ipw3945 wifi

Next up in the hall of shame is Intel’s closed source, binary regulatory daemon for the ipw3945 series of wifi cards. Reading data from a sysfs node corresponding to the wifi PCI device, on average every 100 ms

 Pid 1888 ipw3945d run 301538+ ms...
   reads /sys/bus/pci/drivers/ipw3945/0000:03:00.0/cmd 2954 times, once every 102 ms avg 3008 bytes
   writes /sys/bus/pci/drivers/ipw3945/0000:03:00.0/cmd 9 times, once every 29654 ms avg 20.3333333333333 bytes

This activity doesn’t appear to be at all related to level of network I/O, pretty much a constant rate of access regardless of what the system is doing. Please, pretty please, just move this stuff back into the kernel module or firmware & kill this stupid daemon.

3. multiload-applet – GNOME system monitor applet

It is perhaps a little unfair to put this in the hall of shame because I did make an explicit choice to turn this daemon on, and its entire raison d’etre is to sample system state every ‘n’ seconds. The reason its here, however, is because its default out-of-the-box sampling frequency is a mere 500 ms and because of the sheer number of files it feels the need to poll:

 Pid 3290 multiload-apple run 301191+ ms...
   reads /etc/mtab 603 times, once every 499 ms avg 8192 bytes
   reads /proc/loadavg 603 times, once every 499 ms avg 8191 bytes
   reads /proc/meminfo 1206 times, once every 249 ms avg 8191 bytes
   reads /proc/mounts 3618 times, once every 83 ms avg 4096 bytes
   reads /proc/net/dev 603 times, once every 499 ms avg 8192 bytes
   reads /proc/partitions 3618 times, once every 83 ms avg 5461.33333333333 bytes
   reads /proc/stat 603 times, once every 499 ms avg 8191 bytes
   reads /proc/vmstat 603 times, once every 499 ms avg 8191 bytes
   reads /sys/block/dm-0/stat 603 times, once every 499 ms avg 8191 bytes
   reads /sys/block/dm-1/stat 603 times, once every 499 ms avg 8191 bytes
   reads /sys/block/dm-5/stat 603 times, once every 499 ms avg 8191 bytes
   reads /sys/block/sda/sda1/stat 603 times, once every 499 ms avg 8191 bytes
   reads /sys/class/net/eth0/statistics/collisions 603 times, once every 499 ms avg 8191 bytes
   reads /sys/class/net/eth0/statistics/rx_bytes 603 times, once every 499 ms avg 8191 bytes
   reads /sys/class/net/eth0/statistics/rx_errors 603 times, once every 499 ms avg 8191 bytes
   reads /sys/class/net/eth0/statistics/rx_packets 603 times, once every 499 ms avg 8191 bytes
   reads /sys/class/net/eth0/statistics/tx_bytes 603 times, once every 499 ms avg 8191 bytes
   reads /sys/class/net/eth0/statistics/tx_errors 603 times, once every 499 ms avg 8191 bytes
   reads /sys/class/net/eth0/statistics/tx_packets 603 times, once every 499 ms avg 8191 bytes
[snip another 100 files for other network devices vif0, vif1, etc]

This machine is running Xen so there are a number of virtual interfaces – most of them are not active though. This doesn’t stop the applet for reading their tx & rx statistics once every 500ms. It is not entirely clear why it has to read the tx & rx data for each individual network device, since it only ever displays the aggregate total network traffic for the machine – which it has already obtained from /proc/net/dev.

4. xenstored – Xen metadata database

The xenstored process is a ‘special’ program – and I don’t mean that in a good way.
A few weeks back I was wondering why merely monitoring the runtime state of a handful of domains was causing Xenstored to consume so much CPU time. The long answer is archived, but the short story is that its implementation of transactions requires it to take a complete copy of the database backend file everytime it starts & ends a transaction. With XenD performing about 16 transactions every time you ask for info about a domain, if you monitor 20 guest domains once a second, this translates into xenstored doing 80 MB of disk I/O every second.

 Pid 2831 xenstored run 301230+ ms...
   reads /var/lib/xenstored/tdb.0x9e1e9b8 404 times, once every 742 ms avg 196884 bytes
   reads /var/lib/xenstored/tdb.0x9e1f708 403 times, once every 744 ms avg 196884 bytes
   writes /var/lib/xenstored/tdb.0x9e1e9b8 404 times, once every 742 ms avg 73728 bytes
   writes /var/lib/xenstored/tdb.0x9e1f708 404 times, once every 742 ms avg 73728 bytes

Future…

Next task is to get this setup to run at boot again so we see just what files are contributing to I/O during startup. Also be useful to extend this to track ioctl() calls too, because just tracking read/write is certainly missing some interesting stuff.

Test-AutoBuild release 1.2.0 / virt-manager release 0.2.0

Posted: August 24th, 2006 | Filed under: Uncategorized | No Comments »

It has been a busy start to the week, both on work & non-work related projects. In the former case we made a new snapshot of virt-manager available – version 0.2.0. The major milestone in this release is the wizard for easy creation of Xen guest virtual machines. RPMs are not yet built into rawhide, but in the meantime there are screenshots of the installation process.

Even more important to me than virt-manager though, is the (long delayed) availability of the new Test-AutoBuild stable release series. Test-AutoBuild is a build & test automation harness aimed at upstream developmers, enabling “continuous integration” over a codebase. ie 24×7 the build harness checks latest code out of SCM repository, builds it, tests it, publishes the results, and starts process again. If you work in the Java world you may be aware of a project called CruiseControl – well, Test-AutoBuild is like CruiseControl on steriods – rather than just building & testing a single code module, it can be setup to perform full regression testing across a suite of dependant software packages.

For example, consider how it might apply to the virt-manager application – this depends on libvirt & xeninst; xeninst in turn depends on libvirt; libvirt in turn depends on xen. Because it is bleeding edge development, virt-manager is tracking latest upstream development branches of libvirt, xeninst & xen. Test-AutoBuild would enable build & test of the entire stack – checking out & building xen, then libvirt, then xeninst & finally virt-manager. Doing this 24×7 mean we can get feedback on any changes/breakage in the dependancy stack in a matter of hours if not sooner. If one weren’t building & testing the against the full stack the breakage might not come to light until months later, by which time resolving it could become a much more time consuming issue. The sooner you know about a problem, the sooner it can be resolved, and the less time is wasted debugging in the long term.

Anyway back to my original point – the new release. Its hard to believe we first started work on this release branch about 2 years back with major re-factoring of the build workflow engine. Shortly after we started work to support caching an arbitrary number of historic biulds, extraction of SCM repository changelogs, and a whole host of other features. We could have released the code much sooner, with so much code change it seemed only prudent to invest an equally large amount of time ensuring the code is stable / reliable. So I spent as much time as possible over the following year basically just writing unit tests for the code & fixing the issues they uncovered.

The upshot is that this new release will provide a very good foundation for further development of Test-AutoBuild – with such an increased level of test coverage it ought to be possible to get new releases cut on a much more frequent basis – back to the classic open source mantra of release early, release often. With recent switch of DBus bindings from CVS to GIT, one of my highest priority development tasks is going to be the addition of GIT support (Test-AutoBuild currently supports Subversion, CVS, Mercurial, GNU Arch, Perforce & local filesystem).

Oh and if you’re thinking, I’m only one person I don’t need such automation of my code, think again. I develop on Fedora normally, but my autobuild server runs on Debian – I’ve lost count of number of occassions it has highlighted some incompatibility I mistakenly introduced – for example using a feature only supported in a very new version of a particular library or app, not commonly available.

Why aren’t more applications using gnome-keyring?

Posted: August 22nd, 2006 | Filed under: Uncategorized | No Comments »

Why aren’t more applications using gnome-keyring? is the question asked on the main page of the pam-keyring module. Their posited answer is that it is inconvenient for users / lacks integration with the authentication proess. Now this is indeed true – it would be very desirable from a user’s POV if they didn’t have to separately unlock the keyring – it should just be unlocked when logging in. This is not, however, the reason why more applications don’t use gnome-keyring – it is merely a user inconvenience. Having just spent some time adding support for gnome-keyring to virt-manager, the real answer is crystal clear: There is zero documentation about either the C or Python APIs for gnome-keyring. Even having read the source from existing apps using it like NetworkManager, many aspects of the API are far from clear – particularly all the asynchronous methods. Its no wonder all apps I can find use the synchronous methods instead – you can at least take a reasonable guess at their API contract in most cases. That said when the API is all about securely managing passwords, it seems very wrong to be guessing about the best way to use it. If someone were to document gnome-keyring, the barrier to use for application developers would be dramatically lower.

Out with SpamAssassin; In with a multi-pronged “total annihilation” of spam

Posted: August 20th, 2006 | Filed under: Uncategorized | No Comments »

For several years now I’ve been using SpamAssassin as my primary tool to fight the spam arriving at my mail server. The performance of SpamAssassin has never been great (and using spamd/spamc is not a real fix – it merely avoids the startup overhead, doing nothing for the speed of the classification engine). Thus, about 6 months ago, to lower the overhead on my machine (a UserModeLinux host from bytemark with a mere 64 MB of RAM), I moved the RBL checks from SpamAssassin into Exim. This was a big help, but detecting and filtering spam from blacklisted domains only stops approx 30-40% of spam – the rest still goes through the heavy SA.

To compound problems, spammers have increasingly been finding ways to get around the Bayesian DB I’ve trained with SpamAssassin, which means I’ve spent a large amount of time re-training SpamAssassin with mails it got wrong. Well SpamAssassin is a really slow learner and today I’ve had enough, thrown it away and decided to pursue a new strategy of “total annihilation”. What this means in practice is that I’m now using 3 bayesian classifiers at once – BMF, QSF and BogoFilter – all of which are written in C, so crazy fast (training on a few thousand mails takes seconds instead of minutes). Every incoming message is fed through all three of these engines and if 2 or more of them agree on it being spam it gets sent to a spam folder. If one of them didn’t agree, but the other two gave it a 100% spam rating, that dissenting one will be auto-trained. In addition I’ve also added Rik’s PSBL blacklist to my exim config, to backup my existing use of SpamHaus.

NB. I can’t really take credit for having come up with this plan – I’ve used the procmail recipes from acme.com‘s bayesian mail filtering pages with minimal change. Hopefully this will prove to be a winning formula – certainly early indications are very promising, but of course only time will tell.

Maintaining a ‘little britain’ in the US

Posted: August 12th, 2006 | Filed under: Uncategorized | No Comments »

So at the start of the year I moved from London over to Boston. In that time I’ve slowly figured out the way of life over here, but of course at the same time I’ve been trying to keep in touch with British culture. So what does my ‘little britain’ consist of…. The first stage is to find a house in a place with a solid English name – in my case Cambridge. Next up is finding a good source of news (ie not Fox) – the excellant BBC news is great for online news, and more recently I got the BBC World Service streaming straight to my SqueezeBox. On that topic, SqueezeBox is setup out of the box to receive 100’s of radio stations from all over the world, so I can still listen to favourites like Virgin Radio – although it tends to lead to timezone confusion – from about 7 in the evening US time, you’re listening to the UK graveyard shift of DJ’s – oh, and you get all the UK advertising still which is a little amuzing. Back to news though – I discovered you can also get a The Guardian Weekly edition delivered anywhere in the world – a high density / condensed round-up of the weeks happenings.

While back in the UK a couple of weeks ago I paid a visit to the Plymouth Gin Distillery which is England’s oldest Gin distiller having been in operation since at least 1793. Even today every Royal Navy ship commissioned is provided with a wooden casket containing two bottles of the special “Navy Strength” (100% proof – 57% by volume) gin – so named because it is the concentration at which gin can be spilt on gunpowder and still be able to ignite the powder. Well you can’t get the navy strength version over here, but a number of liquor stores carry the regular Plymouth Gin. For a summer’s afternoon though, one should really be drinking Pimm’s – the drink of choice at garden parties, or events like Royal Ascot & Wimbledon – its not well known over here, but its surprisingly easily available none the less.

Back to Fedora related news, we’ve been working hard developing the capabilities in libvirt and virt-manager, racing to be ready with a really useful tool for managing Xen VMs in Fedora Core 6. Virt-manager didn’t make it into FC6-test2 (primarily thanks to an inconveniently timed 2-week vacation on my part ;-), but I’m hopeful that we’ll be able to slip it in before test3. Since the 0.1.5 release 2 weeks back, we’ve got the serial console embedded directly in the app, added ability to tune memory allocation & number of CPUs on the fly, and designed & mostly implemented a wizard for creating new VMs (a GUI equivalent of the xenguest-install.py script), so things are looking very promising on the features front.