Posts Tagged gentoo

Programming decisions

My Linux-based Large-Scale Cloning Grid experiment, which I’ve implemented four times now (don’t ask, unless you are ready for a few looooooong stories), has three main components to it.  The first, which provides the magic by which the experiment can even exist, is the z Systems hypervisor z/VM.  The second is the grist on the mill, both the evidence that the thing is working and the workload that drives it, the cluster monitoring tool Ganglia.  The third, the component that keeps it alive, starts it, stops it, checks on it, and more, is a Perl script of my design and construction.  This script is truly multifunctional, and at just under 1000 lines could be the second-most complex piece of code [1] I’ve written since my university days [2].

I think I might have written about this before — it’s the Perl script that provides an IRC bot which manages the operation of the grid.  It schedules the starting and stopping of guests in the grid, along with providing stats on resource usage.  The bot sends grid status updates to Twitter (follow @BODSzBot!), and recently I added generation of files which are used by some HTML code that generates a web page with gauges that summarise the status of the grid.  The first version of the script I wrote in Perl simply because the first programming example of an IRC bot I found was a Perl script; I had no special affinity for Perl as a language and, despite me finding a lot of useful modules that have helped me develop the code, Perl actually doesn’t really lend itself to the task especially well.  So it probably comes as little surprise to some readers that I’m having some trouble with my choice.

Initially the script had no real concern over the status of the guests.  It would fire off commands to start guests or shut them down, or check the z/VM paging rate.  The most intensive thing the script did was to get the list of all users logged on to z/VM so it could determine how many clone guests are actually logged on, and even this takes a fraction of a second now.  It was when I decided that the bot should handle this a bit more effectively, keeping track of the status of each guest and taking some more ownership over maintaining them in the requested up or down state, that things have started to come unstuck.

In the various iterations of this IRC bot, I have used Perl job queueing modules to keep track of the commands the bot issues.  I deal with sets of guests 16 or 240 at a time, and I don’t want to issue 2000 commands in IRC to start 2000 guests.  That’s what the bot is for: I tell it “start 240 guests” and it says “okay, boss” and keeps track of the 240 individual commands that achieve that.  This time around I’m using POE::Component::JobQueue, the main reason being that the IRC module I started to use was the PoCo (the short name for POE::Component) one.  It made sense to use a job queueing mechanism that used the same multitasking infrastructure that my IRC component was using.  (I used a very key word in that last sentence; guess which one it was, and see if you’re right by the end.)

With PoCo::JobQueue, you define queues which process work depending on the type of queue and how it’s defined (number of workers, etc).  In my case my queues are passive queues, which wait for work to be enqueued to them (the alternative is active queues, which poll for work).  The how-to examples for PoCo::JobQueue show that the usual method of use is a function that is called when work is enqueued, and that function then starts a PoCo session (PoCo terminology for a separate task) to actually handle the processing.  For the separate tasks that have to be done, I have five queues that each at various times will create sessions that run for the duration of the item of work (expected to be quite short), one session running the IRC interaction, and one long-running “main thread” session.

The problems I have experienced recently include commands queued but never being actioned, and the updating of the statistics files (for the HTTP code) not being updated.  There also seems to be a problem with the code that updates the IRC topic (which happens every minute if changes to the grid have occurred, such as guests being started or stopped, or every hour if the grid is steady-state) whereby the topic gets updated even though no guest changes have occurred.  When this starts to happen, the bot drops off IRC because it fails to respond to a server PING.  While it seems like the PoCo engine stops, some functions are actually still operating — before the IRC timeout happens the bot will respond to commands.  So it’s more like a “graceful degradation” than a total stop.

I noticed these problems because I fired a set of commands to the bot that would have resulted in 64 commands being enqueued to the command processing queue.  I have a “governor” that delays the actual processing of commands depending on the load on the system, so the queue would have grown to 64 items and over time the items would be processed.  What I found is that only 33 of the commands that should have been queued actually were run, and soon after that the updating of stats started to go wrong.  As I started to look into how to debug this I was reminded about a very important thing about Perl and threads.

Basically, there aren’t any.

Well that’s not entirely true — there is an implementation of threads for Perl, but it is known to cause issues with many modules.  Basically threading and Perl have a very chequered history, and even though the original Perl::Thread module has been replaced by ithreads there are many lingering issues.  On Gentoo, the ithreads use flag is disabled by default and carries the warning “has some compatibility problems” (which is an understatement as I understand it).

With my code, I was expecting that PoCo sessions were separate threads, or processes, or something, that would isolate potentially long running tasks like the process I had implemented to update the hash that maintains guest status.  I thought I was getting full multitasking (there’s the word, did you get it?), the pre-emptive kind, but it turns out that PoCo implements “only” cooperative multitasking.  Whatever PoCo uses to relinquish control between sessions (the “cooperative” part) is probably not getting tickled by my long-running task, and that is interrupting the other sessions and queues.

So I find myself having to look at redesigning it.  Maybe cramming all this function into something that also has to manage real-time interaction with an IRC server is too much, and I need to have a separate program handling the status management and some kind of IPC between them [3].  Maybe the logic is sound, and I need to switch from Perl and PoCo to a different language and runtime that supports pre-emptive multitasking (or, maybe I recompile my Perl with ithreads enabled and try my luck).  I may even find that this diagnosis is wrong, and that some other issue is actually the cause!

I continue to be amazed that this experiment, now in its fourth edition, has been more of an education in the ancillary components than the core technology in z/VM that makes it possible.  I’ve probably spent only 5% of the total effort involved in this project actually doing things in z/VM — the rest has been Perl programming for the bot, and getting Ganglia to work at-scale (the most time-consuming part!).  If you extend the z/VM time to include directory management issues, 5% becomes probably 10% — still almost a rounding error compared to the effort spent on the other parts.  z Systems and z/VM are incredible — every day I value being able to work with systems and software that Just Work.  I wonder where the next twist in this journey will take me…  Wish me luck.


[1] When I was working at Queensland Rail I worked with a guy who, when telling a story, always used to refer to the subject of the story as “the second-biggest”, or “the second-tallest”, or “the second-whatever”.  Seemed he wanted to make his story that little bit unique, rather than just echoing the usual stories about “biggest”, “tallest”, or “whatever”.  I quizzed him one day, when he said that I was wearing the second-ugliest tie he’d ever seen, what was the ugliest…  Turned out he had never anticipated anyone actually asking him, and he had no answer.  Shoutout to Peter (his Internet pseudonym) if you’re watching 😉

[2] Despite referencing [1], and being funny, it’s actually an honest and factual statement.  I did write one more complex piece of code since Uni, being the system I’d written to automatically generate and activate VTAM DRDS decks (also while I was at QR) for dynamically updating NCP definitions.  Between the ISPF panels (written in DTL) and the actual program logic (written in REXX) it was well over 1000 lines.  The other major coding efforts that I’ve done for z/VM cloning have probably been similarly complex to this project but had fewer actual lines of code.  Thinking about it, those other coding efforts drew on comparatively few external modules while this Perl script uses a ton of Perl modules; if you added the cumulative LOC of the modules along with my code, the effective LOC of this script would be even larger.

[3] The previous instanciation of this rig had the main IRC bot and another bot running in a CMS machine running REXX for doing stuff in DirMaint on z/VM.  The Perl bot and the REXX bot sent messages to each other over IRC as their IPC mechanism.  It was weird, but cute at the same time!  This time around I’m using SMAPI for the DirMaint interface, so no need for the REXX bot.

Tags: , , , , , , , , , , ,

Another round of Gentoo fun

A little while back I did an “emerge system” on my VPS and didn’t think much more about it.  First time back to the box today to emerge something else, and was greeted with this:

>>> Unpacking source…
>>> Unpacking traceroute-2.0.15.tar.gz to /var/tmp/portage/net-analyzer/traceroute-2.0.15/work
touch: setting times of `/var/tmp/portage/net-analyzer/traceroute-2.0.15/.unpacked’: No such file or directory

…and the emerge error output.  Took me a little while to get the answer, but it was (of course) caused by a new version of something that came in with the system update.  This bug comment had the crude hack I needed to get back working again, but longer-term I obviously need to fix the mismatch between the version of linux-headers and the kernel version my VPS is using (it’s Xen on RHEL5).

Tags: , , , , ,


Two of the four keynotes at LCA 2011 referenced the depletion of the IPv4 address space (and I reckon if I looked back through the other two I could find some reference in them as well).  I think there’s a good chance Geoff Huston was lobbying his APNIC colleagues to lodge the “final request” (for the two /8s that triggered the final allocation of the remaining 5, officially exhausting IANA) a week earlier than they did, as it would have made the message of his LCA keynote a bit stronger.  Not that it was a soft message: we went from Vint Cerf the day before, who said “I’m the guy who said that a 32-bit address would be enough, so, sorry ’bout that”, to Geoff Huston saying “Vint Cerf is a professional optimist.  I’m not.”.  But I digress…

I did a bit of playing with IPv6 over the years, but it was too early and too broken when I did (by “too broken” I refer to the immaturity of dual-stack implementations and the lack of anything actually reachable on the IPv6 net).  However, with the bell of IPv4 exhaustion tolling, I had another go.

Freenet6, who now goes alternatively as gogonet or gogo6, was my first point-of-call.  I had looked at Gogo6 most recently, and still had an account.  It was just a matter of deciding whether or not I needed to make a new account (hint: I did) and reconfiguring the gw6c process on my router box.  Easy-as, I had a tunnel — better still, my IPv6-capable systems on the LAN also had connectivity thanks to radvd.  From Firefox (and Safari, and Chrome) on the Mac I could score both 10/10 scores on!

My joy was short-lived, however.  gw6c was proving to be about as stable as a one-legged tripod, and not only that Gogo6 had changed the address range they allocated me.  That wouldn’t be too bad, except that all my IPv6-capable systems still had the old address and were trying to use that — looks like IPv6 auto-configuration doesn’t un-configure an address that’s no longer valid (at least by default).  I started to look for possible alternatives.

Like many who’ve looked at IPv6 I had come across Hurricane Electric — in the countdown to IPv4 exhaustion I used their iOS app “ByeBye v4”.  They offer free v6-over-v4 tunneling, and the configuration in Gentoo is very simple.  I also get a static allocation of an IPv6 address range that I can see in the web interface.  The only downside I can see is that I had to nominate which of their locations I wanted to terminate my tunnel; they have no presence in Australia, the geographically-nearest location being Singapore.  I went for Los Angeles, thinking that would probably be closest network-wise.  The performance has been quite good, and it has been quite reliable (although I do need to set up some kind of monitoring over the link, since everything that can talk IPv6 is now doing so).

In typical style, after I’d set up a stable tunnel and got everything working, I decided to learn more about what I’d done.  What is IPv6 anyways?  Is there substance to the anecdotes flying around that are saying that “every blade of grass on the planet can have an IPv6 address” and similar?  Well, a 128-bit address provides for an enormous range of addresses.  The ZFS guys are on the same track — ZFS uses 128-bit counters for blocks and inodes, and there have been ridiculous statements made about how much data could theoretically be stored in a filesystem that uses 128-bit block counters.  To quote the Hitchhiker’s Guide to the Galaxy:

Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist’s, but that’s just peanuts to space.

The Guide, The Hitchhiker’s Guide To The Galaxy, Douglas Adams, Pan Books 1979

Substitute IPv6 (or ZFS) for space.  To try and put into context just how big the IPv6 address range is, let’s use an example: the smallest common subnetwork.

When IPv4 was first developed, there were three address classes, named, somewhat unimaginatively, A B and C.  Class A was all the networks from 1.x.x.x to 127.x.x.x, and each had about 16 million addresses.  Class B was all the networks from 128.0.x.x to 191.255.x.x, each network with 65 534 usable addresses.  Class C went from 192.0.0.x to 223.255.255.x, and each had 254 usable addresses.  Other areas, such as 0.x.x.x and the networks after 224.x.x.x, have been reserved.  So, in the early days, the smallest network of hosts you could have was a network of 254 hosts.  After a while IP introduced something called Classless Inter-Domain Routing (CIDR) which meant that the fixed boundaries of the classes were eliminated and it became possible to “subnet” or “supernet” networks — divide or combine the networks to make networks that were just the right size for the number of hosts in the network (and, with careful planning, could be grown or shrunk as plans changed).  With CIDR, since the size of the network was now variable, addresses had to be written with the subnet mask — a format known as “CIDR notation” came into use, where an address would have the number of bits written after the address like this:

Fast-forward to today, with IPv6…  IPv4’s CIDR notation is used in IPv6 (mostly because the masks are so huge).  In IPv6, the smallest network that can be allocated is what is called a “/64”.  This means that out of the total 128-bit address range, 64 bits represent what network the address belongs to.  Let’s think about that for a second.  There are 32 bits in an IPv4 address — that means that the entire IPv4 Internet would fit in an IPv6 network with a /96 mask (128-32=96).  But the default smallest IPv6 subnet is /64 — the size of the existing IPv4 Internet squared!

Wait a second though, it gets better…  When I got my account with Gogo6, they offered me up to a /56 mask — that’s a range that covers 256 /64s, or 256 Internet-squareds!  Better still, the Hurricane Electric tunnel-broker account gave me one /64 and one /48Sixty-five thousand networks, each the size of the IPv4 Internet squared! And how much did I pay for any of these allocations?  Nothing!

I can’t help but think that folks are repeating similar mistakes from the early days of IPv4.  A seemingly limitless address range (Vint said that 32 bits would be enough, right?) was given away in vast chunks.  In the early days of IPv4 we had networks with two or three hosts on them using up a Class C because of the limitations of addressing — in IPv6 we have LANs of maybe no more than a hundred or so machines taking up an entire /64 because of the way we designed auto-configuration.  IPv6 implementations now will be characterised not by how well their dual-stack implementations work, or how much more secure transactions have become thanks to the elimination of NAT, but by how much of the addressable range they are wasting.  So, is IPv6 just Same Sh*t, Different Millennium?

Like the early days of IPv4 though, things will surely change as IPv6 matures.  I guess I’m just hoping that the folks in charge are thinking about it, and not just high on the amount of space they have to play with now.  Because one day all those blades of grass will want their IP addresses, and the Internet had better be ready.

Update 16 May 2011: I just listened to Episode 297 of the Security Now program…  Steve Gibson relates some of his experience getting IPv6 allocation from his upstream providers (he says he got a /48).  In describing how much address space that is, he made the same point (about the “wasteful” allocation of IPv6).  At about 44:51, he starts talking about the current “sky is falling” attitude regarding IPv4, and states “you’d think, maybe they’d learn the lesson, and be a little more parsimonious with these IPs…”.  He goes on to give the impression that the 128-bit range of IPv6 is so big that there’s just no need to worry about it.  I hope you’re right, Steve!

Tags: , , , , , ,

Nagios service check for IAX

I’ve been using Nagios for ages to monitor the Crossed Wires campus network, but it’s fallen into a little disrepair.  Nothing worse than your monitoring needing monitoring…  so I set about tidying it up. Network topology changes, removal of old kit, and some fixes to service checks no longer working correctly.

One of the problems I needed to fix was the service check for IAX connections into my Asterisk box.  The script (the standard from the Nagios Plugins package) was set up correctly, but it would fail with a “Got no reply” message.

I started doing traces and “iax2 debug” in Asterisk, but got nowhere — Asterisk was rejecting the packet from the check script.  Finally I decided to JFGI, and eventually I found this page with the explanation and the fix.  Basically, sometime in the 1.6 stream Asterisk toughened up security on the control message the Nagios service check used to use.  Thankfully, at the same time a new control message specifically designed for availability checking was implemented, and the fix is to update the script to use the new control message.  Easy!

BTW, while on Nagios, I got burned by the so-called “vconfig patch” which broke the check_ping script.  I’ve had to mask version 1.4.14-r2 and above of the nagios-plugins package until the issue is fixed.

Tags: , , , , , ,

Asterisk and a Patton SmartNode

It’s been ages since I did an update on the main network machine here, and I bit the bullet over the weekend. 250+ packages emerged with surprisingly little trouble, and all I was left to do was build the updated kernel and reboot.
I usually end up with something that doesn’t restart after the reboot, usually because of a kernel module that needs to be rebuilt after the kernel (because I forget to remerge the package before the reboot, oops). This time the culprit was Asterisk (the phone system), which I also often have trouble with after an update due to a couple of codec modules external to the Asterisk build. This time however the problem ended up being due to the Asterisk CAPI channel driver failing.
Thinking it was the usual didn’t-rebuild-the-module problem, I went looking for the package I had to rebuild… only to find it was masked. Turns out the driver for the ISDN card in the box, a FritzCard PCI, is no longer maintained and doesn’t build on modern kernels, which has resulted in the Gentoo folks hard-masking the entire set of AVM’s out-of-tree drivers.
Help was at hand in the form of a Patton SmartNode 4552 ISDN VoIP router I’d bought months ago to replace the Fritz card. Even though there isn’t much information about how to configure the SmartNode for Asterisk around, I managed to get the setup working in only a couple of hours. I even managed to get the outgoing routing for the work line set up right!
Eventually I’ll get something posted here that goes into a bit more detail about the configuration. Let me know in a comment if you need to hurry me up! 🙂

Tags: , , , ,

ppc Linux on the PowerMac G5

With Apple’s abandonment of PPC as of Snow Leopard, I began wondering what to do with the old PowerMac. It’s annoying that so (comparatively) recent a piece of equipment should be given up by its manufacturer, but that’s a rant for another day. Yes, we can still run Leopard until it goes out of support, but with S and I both on MacBook Pros with current OS I know that we would both become frustrated with a widening functionality gap between the systems.

I had always resisted runing Linux on the PowerMac, thinking that the last thing I needed was yet another Linux box in the house. I had tried a couple of times, but it was in the early days of support for the liquid cooling system in the dual-2.5Ghz model and those attempts failed dismally. I figured that by now those issues would be resolved and I would have a much better time.

I assumed that Yellow Dog was still the ‘benchmark’ PPC Linux distro, so I went to their site. I saw a lot of data there about PS3 and Cell; it seems that YDL is transitioning to the cluster and/or research market by focussing on Cell.

The next thing I discovered is the lack of distributions that have a PPC version, even as a secondary platform. My old standby Gentoo still supports PPC, as does Fedora (I think: I saw a reference to downloading a PPC install disk, bit didn’t follow it), but every other major distro has dropped it — openSUSE, for example, with their very latest release (their download page still has a picture of a disc labelled “ppc”, but no such download exists, oops). I guess that since the major producer of desktop PPC systems stopped doing so, the distros saw their potential install base disappear. Unfortunately for those distros, I can see the reverse happening: now that Apple has fully left PPC behind, plenty of folks like me who have moderately recent G4 and G5 hardware and who still want to run a current OS will come to Linux looking for an alternative… I guess time will tell who is right on this one.

So I went to install Gentoo, and to cut a long story short I had exactly the same problem as before: critical temperature condition leading to emergency system power-off. I found that if I capped the CPU speed to 2Ghz I could stay up long enough to get things built, but then the system refused to boot because it couldn’t find the root filesystem. Probably something to do with yaboot, SATA drives and OpenFirmware. So again I’m putting it aside.

My next plan was to treat it as a file server. Surely a BSD would support my G5 hardware: after all, Mac OS X is BSD at heart… Well, no. FreeBSD has no support for SATA on ppc, OpenBSD specifically mentioned liquid-cooled G5s as having no support, and I don’t think I saw any ppc support on NetBSD more recent than G3 [1].

This is one of the things that annoys me about the computer industry: that somehow it’s okay to so completely disregard your older releases. What if the automotive industry worked that way?

So I may yet try Fedora, or give the game away for another year or so and see what the situation looks like then.

[1] I may have mixed up a couple of these details.

Edit: Gentoo’s yaboot has managed to make it so that I can’t boot Mac OS X on the machine any more.  Oh dear.

Tags: , , , , ,

Sometimes, Gentoo bites

I had a failure of my Cacti system over the weekend, entirely caused by bad Gentoo emerges. Two different problems, both caused by bad upgrades of packages brought in from ~amd64 or ~x86, made Cacti colourfully dysfunctional for a couple of days.

The first was an update to the spine resource poller, part of the Cacti project but installed separately (it used to be called cactid). Turns out that somewhere between 0.8.7a and 0.8.7b, bugs were introduced that made spine unreliable on 64-bit systems. The update brought in a SVN version of spine which, while still labelled 0.8.7a, must have been somewhere after one or more of the bugs came in. The symptom was that every data value obtained via SNMP was garbage and ignored.

The second issue was strange — graphs were getting generated (even those for which there was no data) but there was no text on them! Titles, margins, legend, axes, all were blank. Some posts pointed to a problem accessing the TTF font file provided with rrdtool, but the actual problem turned out to be the upgrade to rrdtool 1.2.28 which introduced different parameters for the configuration of text attributes in graphs — and a corresponding “feature” that suppressed any text output if the new parameters were missing.

So what does “~” have to do with this? The software on your system is built according to the architecture of your machine. In Gentoo, this is called your “arch” (for architecture) and is usually “x86” or “amd64”. Gentoo implements a “testing branch” in an arch which starts with “~”; if a pre-release version of a package exists in portage you can bring it in with the “~x86” keyword. The nice thing about this is that you don’t have to enable a testing repository across your whole system — you can enable the ~ keyword for specific packages on your system, and everything else stays stable.

Unfortunately, this flexibility has a cost. The “amd64” arch seems to lag a bit behind “x86” in terms of packages being marked stable or just simply having packages available. This means that just to get things installed, it’s necessary to flag packages with “x86”, “~amd64” or even “~x86”. This flagging is easily done — almost too easy in fact, as it creates a problem later on when the package you actually set the keyword for eventually becomes stable and you don’t need the keyword set any more. It’s a manual process to revisit the keywords you’ve set and verify that they are still needed (and you know how well manual processes work).

Some time ago I started adding comments to the Portage config file where keywords are set, trying to explain why I set the flag: “to bring in version 1.2.34” for example. That way, if I ever do get around to manually auditing the package.keywords file, I’ll be able to check if some of the keywords are still needed. Still a manual review though.

So in the case of rrdtool and spine, I had set the “~” keyword some time in the past for some reason, possibly to get early access to a bug-fix ebuild. With no established method to revisit the keywords, I continued to pull in unstable versions of packages long after the packages I really needed had been marked stable. Eventually, it bit me.

The pre- and post-upgrade chacklist grows some more…  🙂

Tags: , ,

Thinking of a Gentoo desktop

I know I’m going to cop a beating on the Planet for this post, but here goes…
For a long time I ran a desktop system built on Gentoo Linux. A while back I tried Ubuntu, and I’ve been running that as my desktop ever since. Every now and then, though, I feel an inclination to pop back to Gentoo — usually it will be because of some package I want to be able to install, or later versions of packages that don’t make it into the usual binary-distro world without introducing “dependency hell” (I’m having this problem at work, with a distro based on RHEL 5.1 and hardware that’s just too new for it… Even if I wanted to build drivers from source, the libraries the drivers link against are too old as supplied, meaning I’d have to rebuild the libraries too, which probably means something else will be too old…).

I run Gentoo on both my “servers” at home. At the time I got my dual-Opteron, Gentoo was the only “free” distro around that had a x86_64 version ready to roll. When it came time to build my phone-and-TV server, it got Gentoo as well because it was the only way I could get the right combination of all the versions of code (Apache, PHP, Asterisk, MySQL, MythTV, ccxstream, etc) that I needed and have them all maintained in the distribution’s package management system (Debian has no ccxstream package, for instance). I don’t run Gentoo because I’m a ricer. Portage has the right package mix for me, and its ability to control the configuration of packages through USE flags gives me an opportunity to control the options that are enabled in the packages I install.

I have blogged previously about some hardware I bought that I haven’t been able to put to good use. I decided to give it another try by building a Gentoo system on it, because an ebuild for the bleeding-edge ATI driver that is supposed to support the graphics chipset in this clunker is in Portage.

Let me say, it’s been a while since I built a Gentoo system from scratch. You don’t even do it truly from scratch anymore either — the days of starting with a stage-1 tarball are over apparently, and stage-3 is always the way to go. Even so, this system took a whole weekend to get to the stage where I could log on and get a KDE desktop (to be fair though, there was a lot of kicking off an emerge, coming back to it a couple of hours later to find it had died ten minutes in, fixing the issue and restarting… so it wasn’t 48 hours solid time spent).

Unfortunately the ATI driver still doesn’t support XVideo on this chipset, so I still can’t use this board for its intended use as a MythTV frontend (I do have an old PCI nVidia 5200 card that, even though it’s at least three years old, I’m sure will run rings around this stinking ATI 1250). So the point of the whole exercise was, unfortunately, lost. But I did get a refresher in the amount of effort a Gentoo build would take.

After that weekend’s effort, I was a bit put off by the thought of building up an entire desktop system from scratch. When I thought about it though, my concerns were for nothing. The compiling? The kind of systems I’m building on (modern dual-core chips) will chew through compiling most software in a snap — heck, for simple packages I can install on a Gentoo system quicker than yumex can initialise its repositories. I’ve got running systems I can use as a model to get USE flags right, and my NFS-shared Portage tree means that I sync once and use everywhere (even downloading source packages happens only once).

Plus, now, I know Gentoo. Sure, APT on a Debian-based distro is nice, but I still am lost when it comes to the right dpkg command to locate what package provides a certain file, for instance. I get frustrated when something fails to build on Gentoo because some other package wasn’t built with the right USE flag, but I know how to fix that, and its fixed in a flash. Likewise for rebuilding some system library that causes a bunch of other packages to fail without warning, and likewise for the strange b0rkedness that happens in Portage sometimes when packages change versions (gnupg is a recent example). I know how to fix Gentoo when it breaks — I can’t say that with much confidence for other distros.

Some might say “use a distro that doesn’t break in the first place”, which is a fair comment. But if I have to choose between an occasional hiccup and missing functionality, then hand me the Eno (Pepto-Bismol, Tums, etc)… 😉

Which brings me to my dilemma — apart from the fact that I have crappy unaccelerated non-video graphics and I haven’t been able to run Compiz for ages (a problem that Gentoo wouldn’t solve for me anyway), Ubuntu isn’t really broken for me. There’s not a compelling reason for me to throw Gutsy out, and with Hardy around the corner there’s even less reason to switch right now.

So, I’ll wait. And watch. Having to work on more Red Hat systems at work is reacquainting me with their particular mojo, perhaps even enough to try Fedora. Also, I’ve just scraped together some parts to make an openSUSE 10.3 build for something work-related so I’ll catch up with things there (since I haven’t really seen a SUSE system as a desktop since SuSE Linux 7).

I love this about Linux — freedom to choose!

Tags: ,

Gentoo + jabberd = aargh

I’ve been running jabberd2 from ~x86 for ages. Tonight I went to make some config changes, and stopped and started jabberd using the init script like usual. Things were different though, as the init script didn’t shut down all the Jabber tasks and I had to stop them manually. When I went to restart it, only two processes were shown and not all the separate processes I was used to.

Nothing was being logged either, as I was trying to find out what was going on and why the processes weren’t starting. It was as if it was suddenly ignoring all my configuration files!

Careful inspection of some output from eix showed the problem: Jabberd 2 has been moved to its own ebuild (jabberd2), and the highest version in the jabberd ebuild is now a 1.4.4-something. Not only that, they’ve hard-masked jabberd2:

# Krzysiek Pawlik  (08 Oct 2007)
# Masked untill the split from net-im/jabberd is complete.
# See bug #178055 and bug #195091

Looks like the last time I emerged I downgraded my Jabberd 2 to 1.4. No wonder the thing was not responding to me.

This is the kind of thing that happens on Gentoo from time-to-time. It’s why I started a regular sync of portage and email-output-of-emerge-pretend-world process: so that I didn’t get too far behind and have a heap of these things to sort out. This one got me off guard though.

Note to self: pay closer attention to emerge output in future!

Tags: ,

Gentoo “hardened” multilib?

I had some system problems yesterday.  My VMware guests just stopped.  Middle of the day and they just died.  I tried to run the management console or even the command line programs, but they all failed with the infamous “VMware is installed but is not configured for this system…” message and the prompt to run  I re-emerged vmware-server and vmware-modules with no luck. was failing trying to run vmware-vmx at the serial number check, the error was “No such file or directory”.  But there it was, right where it was supposed to be, permissions correct and everything…

Knowing that generic error can apply to a missing file that the program is trying to execute, I checked what type of file I was looking at: file reported a dynamically linked program.  Great, run ldd to find out what it wants: ldd reports “not a dynamic executable”.  Oh dear.  It was starting to look like a long night was ahead.

I jumped on the Googleweb and discovered that others had encountered the problem I was seeing, but the hits were all a couple of years old.  Their problems seemed to be caused by missing 32-bit libraries on a 64-bit system.  How could this happen?  In older Gentoo releases you had to choose multilib, but according to most of the doco all profiles are multilib unless you choose a “non-multilib” profile (this explained the fact there were few-to-no recent hits for the issue).

Recently I had switched to the hardened profile…  I had a look, and there is a separate “multilib” profile in hardened.  So is the doco wrong: are all profiles multilib except ones called “non-multilib” AND except hardened because they have a different rule?

I had two choices then, try out the hardened multilib profile, or switch back to the previous profile I used.  Considering I hadn’t enabled any Hardened features and don’t really have time to figure it all out at the moment any (I only did it to get rid of the “unsupported profile” warning I get every time I merge a package), I copped out and switched back to the old profile.

Then I had the next issue: I couldn’t use the non-multilib gcc and glibc to build multilib versions of gcc and glibc.  The gcc build complained about a missing 32-bit header (should have been part of glibc) and the glibc build complained that cpp failed sanity test.  Again the Googleweb came to the rescue, pointing me to a Gentoo repository containing binary packages of gcc and glibc that I could apply.  They allowed me to rebuild my own gcc and glibc.

At this point I found that the script could run again.  I was BACK!  I started VMware services, ran the managment console, and started my VMs.

I think I get a bit complacent with my home gear sometimes; switching profile to hardened was something I almost did on a whim, and it’s bitten me fairly badly.  Lesson learned.