Posts Tagged update

Sometimes, Gentoo bites

I had a failure of my Cacti system over the weekend, entirely caused by bad Gentoo emerges. Two different problems, both caused by bad upgrades of packages brought in from ~amd64 or ~x86, made Cacti colourfully dysfunctional for a couple of days.

The first was an update to the spine resource poller, part of the Cacti project but installed separately (it used to be called cactid). Turns out that somewhere between 0.8.7a and 0.8.7b, bugs were introduced that made spine unreliable on 64-bit systems. The update brought in a SVN version of spine which, while still labelled 0.8.7a, must have been somewhere after one or more of the bugs came in. The symptom was that every data value obtained via SNMP was garbage and ignored.

The second issue was strange — graphs were getting generated (even those for which there was no data) but there was no text on them! Titles, margins, legend, axes, all were blank. Some posts pointed to a problem accessing the TTF font file provided with rrdtool, but the actual problem turned out to be the upgrade to rrdtool 1.2.28 which introduced different parameters for the configuration of text attributes in graphs — and a corresponding “feature” that suppressed any text output if the new parameters were missing.

So what does “~” have to do with this? The software on your system is built according to the architecture of your machine. In Gentoo, this is called your “arch” (for architecture) and is usually “x86” or “amd64”. Gentoo implements a “testing branch” in an arch which starts with “~”; if a pre-release version of a package exists in portage you can bring it in with the “~x86” keyword. The nice thing about this is that you don’t have to enable a testing repository across your whole system — you can enable the ~ keyword for specific packages on your system, and everything else stays stable.

Unfortunately, this flexibility has a cost. The “amd64” arch seems to lag a bit behind “x86” in terms of packages being marked stable or just simply having packages available. This means that just to get things installed, it’s necessary to flag packages with “x86”, “~amd64” or even “~x86”. This flagging is easily done — almost too easy in fact, as it creates a problem later on when the package you actually set the keyword for eventually becomes stable and you don’t need the keyword set any more. It’s a manual process to revisit the keywords you’ve set and verify that they are still needed (and you know how well manual processes work).

Some time ago I started adding comments to the Portage config file where keywords are set, trying to explain why I set the flag: “to bring in version 1.2.34” for example. That way, if I ever do get around to manually auditing the package.keywords file, I’ll be able to check if some of the keywords are still needed. Still a manual review though.

So in the case of rrdtool and spine, I had set the “~” keyword some time in the past for some reason, possibly to get early access to a bug-fix ebuild. With no established method to revisit the keywords, I continued to pull in unstable versions of packages long after the packages I really needed had been marked stable. Eventually, it bit me.

The pre- and post-upgrade chacklist grows some more…  🙂

Tags: , ,

When Upgrades Go Wrong

I’m running Debian on a Linksys NSLU2 storage device, and it works really well in general. So well in fact that a lot of the time I forget the thing is even there! It’s sitting in the garage minding its own business, serving out video and music files, and storing backups of the other systems in the house. Just occasionally, however, the thought pops into my head to run a system update over it — a habit I’ve gotten into for the Gentoo systems in the house, but “the Slug” usually misses out. About a fortnight ago however I decided to do the “apt-get shuffle”. Timing, as they say in sport and comedy, is everything.

I’ve become fairly complacent about system updates. All the distros I use now have got excellent tools for keeping everything up-to-date, and for making sure that things don’t go wrong in the process. It’s all just software, however, and it’s all too easy for something to get missed or for a bug to creep in. One such bug that did exactly that is this one. Unreported at the time I did my update, it rendered my Slug unbootable after the update I gave it.

It took me a day to realise that the Slug was off the network. The failure of the nightly backups was my first clue. Next was the inability to stream any of the media files stored on it. For the next week, on-and-off, I tried a dozen things in an attempt to get it working again. I finally arrived at a process that used the Debian Installer firmware image as a way to get a running system onto the device, allowing me to then access the hard disk and try and reflash earlier kernel and initrd images to it.

I started trying to work on the boot disk, but I couldn’t see it for some reason. Then I discovered that the power supply of the USB2 disk enclosure that holds it was playing up! Now, I had two problems–was one related to the other? Was my boot problem just a hard disk problem all along? Turns out that the power supply failure was a coincidence–replacing the power supply got the disk working again but made no improvement in the bootup scenario.

The NSLU2 boots differently to a PC. On a PC, the BIOS locates some boot code on a storage device and executes that, which usually is a program like LILO or GRUB that has more intelligence and (in the case of GRUB) a way to interact with it. These boot loader programs then load in the kernel and start executing it. With the NSLU2, however, the kernel and the “initial root device” are written into the flash memory of the device–they more-or-less are the BIOS.

On a PC, if there’s a problem with the kernel or initrd you can generally select another one from a list. Worst-case would have you installing the hard-disk in a different PC and fixing the problem from there. On a NSLU2, however, any problem with the kernel or initrd can’t be fixed by changing the hard disk because the kernel and initrd aren’t read from the hard disk but from the flash memory instead. There’s also no option for selecting another kernel, since the NSLU2 is a “headless” device with no console (besides, there’d be no room in the flash memory for two copies of kernel and initrd).

Once I’d been able to get my Slug booting (by writing out a previous version of a kernel and initrd) I was going to leave it alone… but curiosity got the better of me. I’d suspected a bad update to the utility that generates the initrd, and sure enough an “apt-get update && apt-get upgrade” revealed a pending update to the initramfs-tools package. Google led me then to the above bug report. With fingers crossed I did the update, reflashed, and rebooted… successfully!

The Slug is now back in its usual place, quietly going about its business of entertaining us and keeping critical data safe. I might at least think twice before doing a kernel update on the poor beast in future though!

Tags: , , , ,

Gentoo Linux wastes a bit more of my life

I like Gentoo Linux, but sometimes I find it’s not really applicable for some of what I’m using it for.  Like my main server.  This machine is one of the two machines at my place that just HAVE to work (the firewall/phone server is the other), and there’s been a few instances recently where Gentoo has let me down a bit…

First a bit of history: this machine is a dual-processor Opteron system, and as far as free (as-in beer) Linux distros Gentoo was about the only one that had a x64_64 port available at the time.  Over time it’s grown to have a lot of stuff on it (applications, not just the data), so changing to a different distro will be FAR from trivial.  I know that Gentoo isn’t really a server distro, but this install has a lot of momentum behind it now…

Where was I?  That’s right: VNC.  Something else I really like is VNC.  I had a neat setup on my box that worked like a terminal server: you connect using your VNC client, get a login window from X, do some work, then log out when you’re done.  No having to set up a permanently-running X desktop for every user that might want to connect!  This was set up and working really well, until I just went to use it (after having not used it for a while) and found it broken.  Seems that some other change I’d made since last using it caused the Xvnc process to start segfaulting.  Rebuilding it made no difference.

This led me on a wild ride through Google searches, fora and mailing list archives (with a detour throug the LKML, which I’ll meniton later) to discover that in current versions TightVNC doesn’t play well on 64-bit distributions and that it’s been a known problem for months with no real end in sight.  On someone’s recommendation I removed TightVNC and switched to the RealVNC package, and things started working again (once I fixed a different problem in KDM caused by Gentoo’s configuration file management).

I’m finding more and more that I have less and less time to frig around with this stuff.  I need this kit to JUST WORK, and a bleeding edge distro like Gentoo isn’t helping me.  Perhaps I need to change to using the Gentoo Reference Platform (GRP), which is a pre-built-binary version of Gentoo.  But with the GRP, much of the advantage of Gentoo (custom-built packages, flexibility) is lost.

I guess I’ve been wanting to have my cake and eat it too — I want nicely-tuned custom-built packages, but I want stability and proven integration as well!  I’m going to have to give something up, and I think that stability is going to win.

I’m attracted to CentOS, the respin of Red Hat Enterprise Linux.  I guess I could have a play with that on some other kit and see how it goes…

Tags: , ,