Posts Tagged problem

Another round of Gentoo fun

A little while back I did an “emerge system” on my VPS and didn’t think much more about it.  First time back to the box today to emerge something else, and was greeted with this:

>>> Unpacking source…
>>> Unpacking traceroute-2.0.15.tar.gz to /var/tmp/portage/net-analyzer/traceroute-2.0.15/work
touch: setting times of `/var/tmp/portage/net-analyzer/traceroute-2.0.15/.unpacked’: No such file or directory

…and the emerge error output.  Took me a little while to get the answer, but it was (of course) caused by a new version of something that came in with the system update.  This bug comment had the crude hack I needed to get back working again, but longer-term I obviously need to fix the mismatch between the version of linux-headers and the kernel version my VPS is using (it’s Xen on RHEL5).

Tags: , , , , ,

Nagios service check for IAX

I’ve been using Nagios for ages to monitor the Crossed Wires campus network, but it’s fallen into a little disrepair.  Nothing worse than your monitoring needing monitoring…  so I set about tidying it up. Network topology changes, removal of old kit, and some fixes to service checks no longer working correctly.

One of the problems I needed to fix was the service check for IAX connections into my Asterisk box.  The script (the standard check_asterisk.pl from the Nagios Plugins package) was set up correctly, but it would fail with a “Got no reply” message.

I started doing traces and “iax2 debug” in Asterisk, but got nowhere — Asterisk was rejecting the packet from the check script.  Finally I decided to JFGI, and eventually I found this page with the explanation and the fix.  Basically, sometime in the 1.6 stream Asterisk toughened up security on the control message the Nagios service check used to use.  Thankfully, at the same time a new control message specifically designed for availability checking was implemented, and the fix is to update the script to use the new control message.  Easy!

BTW, while on Nagios, I got burned by the so-called “vconfig patch” which broke the check_ping script.  I’ve had to mask version 1.4.14-r2 and above of the nagios-plugins package until the issue is fixed.

Tags: , , , , , ,

Comments and Downtime

Observant readers will notice that they are no longer able to respond to posts. The blog-spammers have won the battle but, as they say in the classics, they will not win the war…

I've turned off the comment capability, until I can get something in place to bring the rubbish under control (a recent update to PolarBlog helped a bit, in that the crap doesn't display on the site any more, but when I log on I get to see the mess). I'm thinking of a new site, where I can discuss technical stuff a bit more and thoroughly while keeping the private stuff separate if I need to.

The site has had a bit of downtime recently, due to my non-existent monitoring of what's happening on my hosted server. This will change shortly, and I'm looking forward to things returning to the stability they had when I was self-hosting.

Tags: ,

The difference between pipe and redirection

Newcomers to UNIX-like operating systems are often confused by the difference between the shell operations pipe and redirection. The difference is easily explained with an example, in the context of web development. The shell command echo "st=1" | ./lifeswork.pl shows how a pipe is used to supply command line input to a script usually invoked via CGI in a web server. This allows the script to be more easily debugged by testing at the command line. The shell command echo "st=1" > ./lifeswork.pl shows how redirection uses command line input to overwrite a script file, destroying the file and the web developer's sanity. Hopefully this example illustrates the difference between pipe and redirect, and helps you avoid the idiotic mistake I just made.

Tags: ,

Security blows

I was about to post about how pleased I was with Synergy in helping me tidy up my desktop clutter (by removing a keyboard and mouse from the surface). Ironically, I’m instead posting about a problem with the configuration that will cause me to throw it out and look for something else. Why the title? Because the default configuration of a Linux distribution nowadays has given me no way to fix this ridiculously simple problem without powering off the running PC, VMware guests and all.

The problem is that Synergy and the VMware console don’t play well together (I could have sworn that when I first started using Synergy I had no trouble with it, but there are a few hits around that describe problems like I’ve now hit). The problems people are reporting are that keys like Shift and Ctrl are not passed to the VM (some described here and here).

My problem is slightly different: the screen of my Synergy client (the one that’s running VMware) locked while a VMware guest had focus. Now, the Shift and Ctrl keys are not picked up by gnome-screensaver to unlock the screen. Even the real keyboard attached directly via USB doesn’t work. Big problem, for the following reasons:

* Thanks to password strength rules enforced on the Linux build I use, my password now has a Shift-obtained punctuation character.
* I can’t switch to a virtual console, since that requires Ctrl (e.g. Ctrl-Alt-F1).

Okay, so the keyboard doesn’t work. This client machine just happens to be a tablet PC, and I had hacked gnome-screensaver (to display the onscreen keyboard to allow the screen to be unlocked in tablet mode). I grabbed the pen and tapped out my password, but it *still* didn’t work: even the output of the virtual keyboard gets the Shift modifier dropped. Hmm… Starting to fume now.

Never mind, I’ll connect via the network…

* Fedora does not start SSH by default (okay, yes, and I didn’t make sure it gets started after I’d finished the install).
* There is no remote desktop (VNC server, XDMCP) configured.
* The shiny web-based management interface on VMware Server 2.0 only listens on 127.0.0.1 (or is being blocked by the Fedora firewall).

So with no way to get access to the machine to try and fix it, a power-off is the only solution. Some readers are probably thinking “boo-hoo, diddums had to kill-switch his widdle poota, how tewwible,” but I hate having to do that; not because the system doesn’t recover, but it’s “problem resolution, Windows-style”.

Even though the real problem was between Synergy and VMware, I’m blaming the (perceived) need for security since without that I wouldn’t have a cryptic password that I can’t enter without Shift and a system I can’t administer over the network. Red Hat and Fedora doing everything in their power to ensure I don’t fall prey to nasty Internet fiends (rich analogies to governmental nannying, but that’s probably over-thinking things).

So in summary: Synergy is great, just as long as you’re not using VMware console and have a password with punctuation or uppercase… Remember to have your SSH or other network access enabled before you play!

Tags: , , , ,

Scourges of the Universe: Blog Spam, and ISPs

If you can read this, it means that Round 3 of my fight with my ISP is over and my ADSL is back up, which is a good thing because it means that I can tell you about why my ClustrMaps image has so many red dots on it suddenly…

Every so often I found that some random junk would show up in comments to my blog posts. When I saw it I’d just delete it, and it didn’t occur often so I didn’t really think much of it.

This was until I spied a comment that I actually needed to reply to, and found I couldn’t. I started looking at why the record number of the comment was so high, and found that my blog of little-more than 100 entries had become home to over 13000 items of blog-spam. 🙁

I blame myself, obviously, as the software I use had introduced spam-filtering techniques a couple of versions ago and I hadn’t kept up.

In cleaning up the garbage, behind the red mist of rage I saw at having my blog being violated so, I noted something interesting. The issue had been going on for some time, and I realised that in front of me, in my humble little blog, I had a snapshot of the evolution of blog-spam.

The early stuff was primitive, and easily identified by querying for the names of erectile dysfunction drugs and other medications. The later stuff was harder and harder to detect until I was virtually picking it record-by-record out of the database. Some of it made absolutely no point to me at all: strings of random alphabetics with not even a URL in sight; maybe this was a worm just looking for the kudos of a DOS.

The thought occurred to me that perhaps I should have kept it, in much the same way as someone I know keeps copies of PC viruses and worms in a little (hopefully isolated) folder. Then I realised two things:

* Preserving something, or putting it in a museum, gives it some legitimacy. I don’t want to legitimise blog-spam; and

* The art (if any) in blog-spam is in the code that generates it, not in the crap it leaves behind.

As for all the hits on my ClustrMap, I figure 80% are the spambots infecting the blog and about 19% are the poor folk that got drawn to my site as a result of the spam. I had been thinking of a different blog platform, perhaps this episode shows that I need something a little harder.

Of course another way to fight blog-spam is to get your network disconnected from the ‘Net, and my ever-so-unfriendly ISP went out of their way to do that for me this weekend. Unsolicited, of course, which is even better. On a Friday afternoon, too — better still, as if you do actually manage to get someone on the phone it’s too late for them to find anyone who can do anything about it (apparently).

Recommendations of a good ADSL ISP accepted: although keep it to yourself if your ISP’s called wwkjukhkkjlpuggh or qjkdfsdfaksjkulkfhg… 🙂

Tags: , ,

Sometimes, Gentoo bites

I had a failure of my Cacti system over the weekend, entirely caused by bad Gentoo emerges. Two different problems, both caused by bad upgrades of packages brought in from ~amd64 or ~x86, made Cacti colourfully dysfunctional for a couple of days.

The first was an update to the spine resource poller, part of the Cacti project but installed separately (it used to be called cactid). Turns out that somewhere between 0.8.7a and 0.8.7b, bugs were introduced that made spine unreliable on 64-bit systems. The update brought in a SVN version of spine which, while still labelled 0.8.7a, must have been somewhere after one or more of the bugs came in. The symptom was that every data value obtained via SNMP was garbage and ignored.

The second issue was strange — graphs were getting generated (even those for which there was no data) but there was no text on them! Titles, margins, legend, axes, all were blank. Some posts pointed to a problem accessing the TTF font file provided with rrdtool, but the actual problem turned out to be the upgrade to rrdtool 1.2.28 which introduced different parameters for the configuration of text attributes in graphs — and a corresponding “feature” that suppressed any text output if the new parameters were missing.

So what does “~” have to do with this? The software on your system is built according to the architecture of your machine. In Gentoo, this is called your “arch” (for architecture) and is usually “x86” or “amd64”. Gentoo implements a “testing branch” in an arch which starts with “~”; if a pre-release version of a package exists in portage you can bring it in with the “~x86” keyword. The nice thing about this is that you don’t have to enable a testing repository across your whole system — you can enable the ~ keyword for specific packages on your system, and everything else stays stable.

Unfortunately, this flexibility has a cost. The “amd64” arch seems to lag a bit behind “x86” in terms of packages being marked stable or just simply having packages available. This means that just to get things installed, it’s necessary to flag packages with “x86”, “~amd64” or even “~x86”. This flagging is easily done — almost too easy in fact, as it creates a problem later on when the package you actually set the keyword for eventually becomes stable and you don’t need the keyword set any more. It’s a manual process to revisit the keywords you’ve set and verify that they are still needed (and you know how well manual processes work).

Some time ago I started adding comments to the Portage config file where keywords are set, trying to explain why I set the flag: “to bring in version 1.2.34” for example. That way, if I ever do get around to manually auditing the package.keywords file, I’ll be able to check if some of the keywords are still needed. Still a manual review though.

So in the case of rrdtool and spine, I had set the “~” keyword some time in the past for some reason, possibly to get early access to a bug-fix ebuild. With no established method to revisit the keywords, I continued to pull in unstable versions of packages long after the packages I really needed had been marked stable. Eventually, it bit me.

The pre- and post-upgrade chacklist grows some more…  🙂

Tags: , ,

Ubuntu 8.04 Wireless Weirdness

Over the last fortnight I finally got the wriggle-on to upgrade all my (K)Ubuntu systems to Hardy Heron. Various issues occurred with each of them, but overall the entire exercise went smoothly (my wife’s little old Fujitsu Lifebook was probably smoothest of the lot). I had one rather vexing issue however, on my old (I’m tempted to say “ancient”) Vaio laptop.

The onboard wireless on this thing is an ipw2100, hence only 802.11b, and I had a PCMCIA 802.11g NIC lying around (actually it came from the Lifebook, liberated from there after I bought it a Mini-PCI 802.11g card on eBay). On Gutsy, I used the hardware kill-switch to disable the onboard adapter to make double-sure that it wouldn’t try and drag the network down to 11Mbps.

This laptop was the last machine I upgraded to Hardy, and I was playing with KDE 4 on it so I was looking forward to seeing what KDE4-ness made it into Hardy. While the upgrade was taking place the wi-fi connection dropped out, but I didn’t think anything of it since Ubuntu upgrades try and restart the new versions of things and I figured NetworkManager had fallen and couldn’t get up. After the reboot, however, KNetworkManager (still the KDE3 version, don’t get me started there) could find no networks — could find no adapters, in fact.

I logged back into KDE3 and poked. Still no wireless (as if the desktop would make a difference, but I had to make *some* start on pruning the fault tree). The Hardware Drivers Manager was reporting that the Atheros driver was active (for the PCMCIA card), and an unplug-plug cycle generated all kinds of good kernel messages.

On a whim, I flicked the hardware kill-switch for the onboard wifi[1]. Almost instantly, KNetworkManager prompted to get my wallet unlocked — it had found my network and wanted the WPA passphrase. I provided it, and got a connection: via the PCMCIA NIC.

“That’s odd”, I thought, and flicked the switch. A few seconds passed, and the link dropped. Flicked the switch on, link came back. Flicked the switch off again: this time a few minutes went past, but again the link failed. Tried it several times again, and the same thing happened. The state of the kill-switch for the onboard NIC was influencing the other NIC too!

It seems that this is altered behaviour in NetworkManager, applying the state of the hardware switch to all wi-fi adapters. If it annoys me significantly I’d like to think I’ll trawl changelogs, or even better lodge something on Launchpad… more likely though I’ll forget all about it having found a kludgy workaround.

I’ve now added ipw2100 to the module blacklist and things work okay (presumably because the state of the onboard switch can’t be reported any more). I’ll also have a think about whether a few dollars for another g-capable Mini-PCI NIC will be throwing good money after bad, as this laptop really is quite long-in-the-tooth.

Oh yes, that’s right… KDE 4. Next time perhaps. 🙂

[1] I can’t think why I did this. I knew that I’d disabled 802.11b in my access point, to make triple-sure an 802.11b device wouldn’t slow my network down… The onboard 802.11b NIC would never successfully get a connection.

Tags: , ,

When Upgrades Go Wrong

I’m running Debian on a Linksys NSLU2 storage device, and it works really well in general. So well in fact that a lot of the time I forget the thing is even there! It’s sitting in the garage minding its own business, serving out video and music files, and storing backups of the other systems in the house. Just occasionally, however, the thought pops into my head to run a system update over it — a habit I’ve gotten into for the Gentoo systems in the house, but “the Slug” usually misses out. About a fortnight ago however I decided to do the “apt-get shuffle”. Timing, as they say in sport and comedy, is everything.

I’ve become fairly complacent about system updates. All the distros I use now have got excellent tools for keeping everything up-to-date, and for making sure that things don’t go wrong in the process. It’s all just software, however, and it’s all too easy for something to get missed or for a bug to creep in. One such bug that did exactly that is this one. Unreported at the time I did my update, it rendered my Slug unbootable after the update I gave it.

It took me a day to realise that the Slug was off the network. The failure of the nightly backups was my first clue. Next was the inability to stream any of the media files stored on it. For the next week, on-and-off, I tried a dozen things in an attempt to get it working again. I finally arrived at a process that used the Debian Installer firmware image as a way to get a running system onto the device, allowing me to then access the hard disk and try and reflash earlier kernel and initrd images to it.

I started trying to work on the boot disk, but I couldn’t see it for some reason. Then I discovered that the power supply of the USB2 disk enclosure that holds it was playing up! Now, I had two problems–was one related to the other? Was my boot problem just a hard disk problem all along? Turns out that the power supply failure was a coincidence–replacing the power supply got the disk working again but made no improvement in the bootup scenario.

The NSLU2 boots differently to a PC. On a PC, the BIOS locates some boot code on a storage device and executes that, which usually is a program like LILO or GRUB that has more intelligence and (in the case of GRUB) a way to interact with it. These boot loader programs then load in the kernel and start executing it. With the NSLU2, however, the kernel and the “initial root device” are written into the flash memory of the device–they more-or-less are the BIOS.

On a PC, if there’s a problem with the kernel or initrd you can generally select another one from a list. Worst-case would have you installing the hard-disk in a different PC and fixing the problem from there. On a NSLU2, however, any problem with the kernel or initrd can’t be fixed by changing the hard disk because the kernel and initrd aren’t read from the hard disk but from the flash memory instead. There’s also no option for selecting another kernel, since the NSLU2 is a “headless” device with no console (besides, there’d be no room in the flash memory for two copies of kernel and initrd).

Once I’d been able to get my Slug booting (by writing out a previous version of a kernel and initrd) I was going to leave it alone… but curiosity got the better of me. I’d suspected a bad update to the utility that generates the initrd, and sure enough an “apt-get update && apt-get upgrade” revealed a pending update to the initramfs-tools package. Google led me then to the above bug report. With fingers crossed I did the update, reflashed, and rebooted… successfully!

The Slug is now back in its usual place, quietly going about its business of entertaining us and keeping critical data safe. I might at least think twice before doing a kernel update on the poor beast in future though!

Tags: , , , ,

Gentoo Linux wastes a bit more of my life

I like Gentoo Linux, but sometimes I find it’s not really applicable for some of what I’m using it for.  Like my main server.  This machine is one of the two machines at my place that just HAVE to work (the firewall/phone server is the other), and there’s been a few instances recently where Gentoo has let me down a bit…

First a bit of history: this machine is a dual-processor Opteron system, and as far as free (as-in beer) Linux distros Gentoo was about the only one that had a x64_64 port available at the time.  Over time it’s grown to have a lot of stuff on it (applications, not just the data), so changing to a different distro will be FAR from trivial.  I know that Gentoo isn’t really a server distro, but this install has a lot of momentum behind it now…

Where was I?  That’s right: VNC.  Something else I really like is VNC.  I had a neat setup on my box that worked like a terminal server: you connect using your VNC client, get a login window from X, do some work, then log out when you’re done.  No having to set up a permanently-running X desktop for every user that might want to connect!  This was set up and working really well, until I just went to use it (after having not used it for a while) and found it broken.  Seems that some other change I’d made since last using it caused the Xvnc process to start segfaulting.  Rebuilding it made no difference.

This led me on a wild ride through Google searches, fora and mailing list archives (with a detour throug the LKML, which I’ll meniton later) to discover that in current versions TightVNC doesn’t play well on 64-bit distributions and that it’s been a known problem for months with no real end in sight.  On someone’s recommendation I removed TightVNC and switched to the RealVNC package, and things started working again (once I fixed a different problem in KDM caused by Gentoo’s configuration file management).

I’m finding more and more that I have less and less time to frig around with this stuff.  I need this kit to JUST WORK, and a bleeding edge distro like Gentoo isn’t helping me.  Perhaps I need to change to using the Gentoo Reference Platform (GRP), which is a pre-built-binary version of Gentoo.  But with the GRP, much of the advantage of Gentoo (custom-built packages, flexibility) is lost.

I guess I’ve been wanting to have my cake and eat it too — I want nicely-tuned custom-built packages, but I want stability and proven integration as well!  I’m going to have to give something up, and I think that stability is going to win.

I’m attracted to CentOS, the respin of Red Hat Enterprise Linux.  I guess I could have a play with that on some other kit and see how it goes…

Tags: , ,