Posts Tagged Linux

Programming decisions

My Linux-based Large-Scale Cloning Grid experiment, which I’ve implemented four times now (don’t ask, unless you are ready for a few looooooong stories), has three main components to it.  The first, which provides the magic by which the experiment can even exist, is the z Systems hypervisor z/VM.  The second is the grist on the mill, both the evidence that the thing is working and the workload that drives it, the cluster monitoring tool Ganglia.  The third, the component that keeps it alive, starts it, stops it, checks on it, and more, is a Perl script of my design and construction.  This script is truly multifunctional, and at just under 1000 lines could be the second-most complex piece of code [1] I’ve written since my university days [2].

I think I might have written about this before — it’s the Perl script that provides an IRC bot which manages the operation of the grid.  It schedules the starting and stopping of guests in the grid, along with providing stats on resource usage.  The bot sends grid status updates to Twitter (follow @BODSzBot!), and recently I added generation of files which are used by some HTML code that generates a web page with gauges that summarise the status of the grid.  The first version of the script I wrote in Perl simply because the first programming example of an IRC bot I found was a Perl script; I had no special affinity for Perl as a language and, despite me finding a lot of useful modules that have helped me develop the code, Perl actually doesn’t really lend itself to the task especially well.  So it probably comes as little surprise to some readers that I’m having some trouble with my choice.

Initially the script had no real concern over the status of the guests.  It would fire off commands to start guests or shut them down, or check the z/VM paging rate.  The most intensive thing the script did was to get the list of all users logged on to z/VM so it could determine how many clone guests are actually logged on, and even this takes a fraction of a second now.  It was when I decided that the bot should handle this a bit more effectively, keeping track of the status of each guest and taking some more ownership over maintaining them in the requested up or down state, that things have started to come unstuck.

In the various iterations of this IRC bot, I have used Perl job queueing modules to keep track of the commands the bot issues.  I deal with sets of guests 16 or 240 at a time, and I don’t want to issue 2000 commands in IRC to start 2000 guests.  That’s what the bot is for: I tell it “start 240 guests” and it says “okay, boss” and keeps track of the 240 individual commands that achieve that.  This time around I’m using POE::Component::JobQueue, the main reason being that the IRC module I started to use was the PoCo (the short name for POE::Component) one.  It made sense to use a job queueing mechanism that used the same multitasking infrastructure that my IRC component was using.  (I used a very key word in that last sentence; guess which one it was, and see if you’re right by the end.)

With PoCo::JobQueue, you define queues which process work depending on the type of queue and how it’s defined (number of workers, etc).  In my case my queues are passive queues, which wait for work to be enqueued to them (the alternative is active queues, which poll for work).  The how-to examples for PoCo::JobQueue show that the usual method of use is a function that is called when work is enqueued, and that function then starts a PoCo session (PoCo terminology for a separate task) to actually handle the processing.  For the separate tasks that have to be done, I have five queues that each at various times will create sessions that run for the duration of the item of work (expected to be quite short), one session running the IRC interaction, and one long-running “main thread” session.

The problems I have experienced recently include commands queued but never being actioned, and the updating of the statistics files (for the HTTP code) not being updated.  There also seems to be a problem with the code that updates the IRC topic (which happens every minute if changes to the grid have occurred, such as guests being started or stopped, or every hour if the grid is steady-state) whereby the topic gets updated even though no guest changes have occurred.  When this starts to happen, the bot drops off IRC because it fails to respond to a server PING.  While it seems like the PoCo engine stops, some functions are actually still operating — before the IRC timeout happens the bot will respond to commands.  So it’s more like a “graceful degradation” than a total stop.

I noticed these problems because I fired a set of commands to the bot that would have resulted in 64 commands being enqueued to the command processing queue.  I have a “governor” that delays the actual processing of commands depending on the load on the system, so the queue would have grown to 64 items and over time the items would be processed.  What I found is that only 33 of the commands that should have been queued actually were run, and soon after that the updating of stats started to go wrong.  As I started to look into how to debug this I was reminded about a very important thing about Perl and threads.

Basically, there aren’t any.

Well that’s not entirely true — there is an implementation of threads for Perl, but it is known to cause issues with many modules.  Basically threading and Perl have a very chequered history, and even though the original Perl::Thread module has been replaced by ithreads there are many lingering issues.  On Gentoo, the ithreads use flag is disabled by default and carries the warning “has some compatibility problems” (which is an understatement as I understand it).

With my code, I was expecting that PoCo sessions were separate threads, or processes, or something, that would isolate potentially long running tasks like the process I had implemented to update the hash that maintains guest status.  I thought I was getting full multitasking (there’s the word, did you get it?), the pre-emptive kind, but it turns out that PoCo implements “only” cooperative multitasking.  Whatever PoCo uses to relinquish control between sessions (the “cooperative” part) is probably not getting tickled by my long-running task, and that is interrupting the other sessions and queues.

So I find myself having to look at redesigning it.  Maybe cramming all this function into something that also has to manage real-time interaction with an IRC server is too much, and I need to have a separate program handling the status management and some kind of IPC between them [3].  Maybe the logic is sound, and I need to switch from Perl and PoCo to a different language and runtime that supports pre-emptive multitasking (or, maybe I recompile my Perl with ithreads enabled and try my luck).  I may even find that this diagnosis is wrong, and that some other issue is actually the cause!

I continue to be amazed that this experiment, now in its fourth edition, has been more of an education in the ancillary components than the core technology in z/VM that makes it possible.  I’ve probably spent only 5% of the total effort involved in this project actually doing things in z/VM — the rest has been Perl programming for the bot, and getting Ganglia to work at-scale (the most time-consuming part!).  If you extend the z/VM time to include directory management issues, 5% becomes probably 10% — still almost a rounding error compared to the effort spent on the other parts.  z Systems and z/VM are incredible — every day I value being able to work with systems and software that Just Work.  I wonder where the next twist in this journey will take me…  Wish me luck.

—-

[1] When I was working at Queensland Rail I worked with a guy who, when telling a story, always used to refer to the subject of the story as “the second-biggest”, or “the second-tallest”, or “the second-whatever”.  Seemed he wanted to make his story that little bit unique, rather than just echoing the usual stories about “biggest”, “tallest”, or “whatever”.  I quizzed him one day, when he said that I was wearing the second-ugliest tie he’d ever seen, what was the ugliest…  Turned out he had never anticipated anyone actually asking him, and he had no answer.  Shoutout to Peter (his Internet pseudonym) if you’re watching 😉

[2] Despite referencing [1], and being funny, it’s actually an honest and factual statement.  I did write one more complex piece of code since Uni, being the system I’d written to automatically generate and activate VTAM DRDS decks (also while I was at QR) for dynamically updating NCP definitions.  Between the ISPF panels (written in DTL) and the actual program logic (written in REXX) it was well over 1000 lines.  The other major coding efforts that I’ve done for z/VM cloning have probably been similarly complex to this project but had fewer actual lines of code.  Thinking about it, those other coding efforts drew on comparatively few external modules while this Perl script uses a ton of Perl modules; if you added the cumulative LOC of the modules along with my code, the effective LOC of this script would be even larger.

[3] The previous instanciation of this rig had the main IRC bot and another bot running in a CMS machine running REXX for doing stuff in DirMaint on z/VM.  The Perl bot and the REXX bot sent messages to each other over IRC as their IPC mechanism.  It was weird, but cute at the same time!  This time around I’m using SMAPI for the DirMaint interface, so no need for the REXX bot.

Tags: , , , , , , , , , , ,

Confidence

Among the coffee mugs in my cupboard at home is one I’ve had for over 20 years.  It was a gift; if I remember right, a semi-joke gift in an office “Secret Santa”.

"Works and plays well with others"

“Works and plays well with others”. O RLY?

The slogan on it reads “Works and plays well with others”, and it’s a reference to one of the standard phrases seen on children’s school report cards.  It’s one of the standard mugs in my hot beverage rotation, and every time I use it I can’t help but think back to when it was new, and of how much has changed since those days.

It’s easy to treat a silly slogan on a coffee mug as little more than just a few words designed to evoke a wry grin from a slightly antisocial co-worker.  Sometimes it can take on a deeper meaning, if you let it.

For the last 6 months or more I’ve been working on transferring the function of our former demonstration facility in Brisbane to a location in Melbourne.  This has been fraught with problems and delays, not the least of which was an intermittent network fault into the network our systems are connected to.  Steady-state things would be fine; I could have an IRC client connected to a server in our subnet for days at a time.  When I actually try to do anything else (SSH, HTTP, etc), within about 5 minutes all traffic to the subnet would stop for a few minutes.  When traffic would pass again, it would stay up for five or so minutes then fail.  Wash, rinse, repeat.

It looked like the problem you get when Path MTU Discovery (PMTUD) doesn’t work and you have an MTU mismatch[1].  I realised that we had a 1000BaseT network that was connected to a 100BaseT switch port, so went around all my systems and changed where I was trying to use jumbo frames, but that made no difference to the network dropouts.  I found Cisco references to problems with ARP caches filling, but I couldn’t imagine that the network was so big that MAC address learning would be a problem (and if general MAC learning was constrained, why no-one else was having a problem).

Everything I could think of was drawing blanks.  I approached the folks who run the network we uplink through, and all they said was “our network is fine”.  I was putting up with the problem, thinking that it was just something I was doing and that in time we would change over to a different uplink and we wouldn’t have to worry any more.  My frustration at having to move everything out of the wonderful environment we had in Brisbane down to Melbourne, with its non-functional network, multiplied every time an SSH connection failed.  I actually started to rationalise that it was pointless to continue with setting up the facility in Melbourne; I’d never be able to re-create what I’d built in Brisbane, it would never be as accessible and useful, and besides no-one other than me had ever made good use of the z Systems gear in the Brisbane lab anyway.  Basically, I had lost confidence in myself and my ability to get the network fixed and the Melbourne lab set up.

Confidence is a mental strength, like our muscles which provide our physical strength.  Just like muscle, confidence grows from active use and wastes if underused.  Chemicals can boost it, and trauma can damage it.  Importantly though, confidence can be a huge barrier to a person’s ability to “work and play well with others” — too little confidence and one lacks conviction and decision-making; too much confidence and they appear overbearing and dictatorial.

Last week I was in Singapore for the z/VM, Linux on z, and KVM “T3” event.  Whenever I go to something like this I get fired up by all of the things that I’d like to work on and have running to demo.  The motivation to get new things going in the lab overcame my pessimism about the network connection (and lack of confidence), and I got in touch with the intern in charge of the network we connect through.  All I need, I said, is to look and see what the configuration of the port we connect into looks like.  We agreed to get together when I returned from Singapore, and try to work out the problem.

We got into the meeting, and I went over the problem in terms of how we experience it — a steady state that could last for days, then activity leading to three-minute lockouts.  I asked if I could see the configuration of the port we attached to… after a little bit of discussion about which switch and port we might be on, a few lines of Cisco CatOS configuration statements appeared in our chat session.  Straight away I saw:

switchport port-security

W. T. F.

Within a few minutes I had Googled what this meant.  Sure enough, it told the switch to monitor the active MAC addresses on that port and disable the port if “unknown” MACs appear.  There were no configured MACs, so it just remembered the first one it saw.  It explained why I could have a session running to one system (the IRC server) for ages, and as soon as I connected to something else everything stopped — the default violation mode is “shutdown”.  It explained why the traffic would stay down for three minutes and then begin again — elsewhere in the switch configuration was this:

errdisable recovery cause psecure-violation 180

If the switch disabled a port due to port-security violation, it would be automatically recovered after 180 seconds.

The guys didn’t really understand what this all meant, but it made sense to me.  Encouraged by my confidence that this was indeed the problem, they gave me the passwords to log on to the switch and do what I thought was needed to remove the setting.  A couple of “no” commands later and it was gone… and our network link has functioned perfectly ever since.

The real mystery for the other network guys was: why has this suddenly become a problem?  None of them had changed the network port definition, so as far as anyone knew the port was always configured with Port Security.  The answer to this question is, in fact, on our side.  To z/VM and Linux on z Systems people, networks come in two modes: “Layer 3” or “IP” mode, where the system only deals with IP addresses, and “Layer 2” or “Ethernet” mode, where the system works with MAC addresses.  In Layer 3 mode, all the separate IP addresses that exist within Linux and z/VM systems actually exist behind the MAC address of the mainframe OSA card.  In Layer 2 mode however, each individual Linux guest or z/VM stack gets its own MAC address.  When we first set up this network link and the z/VM and Linux systems there the default operating mode was Layer 3, so the network switch only saw one or two MAC addresses.  Nowadays though the default mode is Layer 2.  When I built new systems for moving everything down from Brisbane, I built them in Layer 2 mode.  Suddenly the network switch port was seeing dozens of different MAC addresses where it used to only see one or two, and Port Security was being triggered constantly.

This has been a learning experience for me.  Usually I don’t have any trouble pointing out where I think a problem exists and how it needs to be fixed.  Deep down I knew the issue was outside our immediate network and yet this time for some reason I lacked the ability, motivation, nerve, or whatever, to chase the folks responsible and get it fixed.  The prospect of trying to work with a group of guys who, based on their previous comments, really strongly thought that their gear was not the problem, was so daunting that it became easier to think of reasons not to bother.  Maybe it’s because I didn’t know for certain that it wasn’t something on our side — there is another problem in our network that definitely is in our gear — so I kept looking for a problem on our side that wasn’t there.

For the want of a 15-minute phone meeting, we had endured months of a flaky network connection.

On this occasion it took me too long to become sufficiently confident to turn my thoughts into actions.  Once I got into action though, it was the confidence I displayed to the network team that got the problem fixed.   For me, lesson learned: sometimes I need a prod, but I am one who “works and plays well with others”.

 

[1] I get that all the time when using the Linux OpenVPN client to connect to this lab, and got into the habit of changing the MTU manually.  Tunnelblick on the Mac doesn’t suffer the same problem, because it has a clever MTU monitoring feature that keeps things going.

Tags: , , , , , , , ,

DisplayLink and x2x brings back Zaphod mode

Ever since work issued me a Lenovo T61 and I installed Fedora on it, I have lamented the loss of something that X afficionados referred to as “Zaphod mode”.  By gluing together a few different software and hardware components I managed to get close to the old Zaphod mode days — but first some background…

Usually when you set up a multi-monitor installation you get a single desktop that spans all the screens.  This is great when you have a single desktop, but on Linux multiple desktops are the norm.  When I started using multiple screens in Linux, I loved the extra screen real estate but the fact that switching virtual desktops caused *all* the windows on all the screens to switch really bugged me.  I wanted the ability to have something — like an email program, or a web browser — to stay on one screen while I switched between desktop views on the other screen.  Or better still, the ability for both screens to have virtual desktops that were independent of each other.

Enter “Zaphod mode”, named for Zaphod Beeblebrox from the Hitchhikers Guide to the Galaxy by Douglas Adams.  Beeblebrox, who was President of the Galaxy before he stole the Starship Heart Of Gold, had two heads that were independent of each other.  In X server terms, multiple display devices are often referred to as “heads”.  So you can probably deduce that “Zaphod mode” refers to an operating mode of the X server where the multiple “heads” or display devices function as different displays.

Go back far enough and you get to a point where that was the standard mode of operation of X.  The X extension “Xinerama” was developed to provide the merging of different X displays into a single screen.  NVidia also had a hardware/firmware based equivalent called TwinView, where multiple heads on an NVidia card (and even sometimes heads on different cards) could be joined.  These extensions came not without their problems however: it was common for windows and dialog boxes to get confused about what display to appear on.  You would almost always see dialog boxes that are meant to display in the middle of the screen being split across the two physical displays.  Also, there was the multiple desktop “inconvenience” of not being able to switch the desktops independently.

Zaphod mode fixed these problems.  Because the screens were separate, windows and dialog boxes always appeared in the centre of the physical screen.  You could leave a web browser on one screen while you switched between an e-mail client, an IRC client, and an SSH session in the other.  It wasn’t all beer-and-skittles though, since in Zaphod mode it was not possible to move an application from one screen to the other.  Plus, some applications like Firefox could not have windows running on both screens (the second one to start could not access the user profile).

Zaphod mode largely “went away” during the transition from XFree to Xorg.  The servers dropped support for multiple separate displays in the one server, and only gradually added it back in (with the Intel driver being one of the last to do so, and probably still has not).  Since laptops were the only place I still used multiple screens, and the laptops I used all had Intel integrated graphics, I had to do without Zaphod mode.

Today, I hardly use dual monitors at all.  I used to have a desktop system with a 21″ CRT flanked by 17″ LCDs on either side, but that all got replaced by a single 24″ LCD.  At work we don’t have assigned desks, so setting up a screen to plug the laptop into isn’t going to happen.  I guess I learned to live without Zaphod mode by just going back to a single screen.  I still remember my Zaphod-powered dual-screen days fondly though, and with almost every update to Xorg I would scan the feature list looking for something like “Support for configuration of multiple independent displays (Zaphod mode)”.

A while back I bought a DisplayLink USB to DVI adapter.  I didn’t really know what to do with it at the time, but recently I dug it out and tried setting it up.  Googling for “DisplayLink Fedora” sent me to a couple of very helpful pages and it didn’t take long to get the “green screen of life” that indicates that the DisplayLink driver was active.  It was when I was looking at how to make it work as an actual desktop — part of the process involves setting up a real xorg.conf (that’s right, something about the DisplayLink X server means it can’t be configured by the Xorg auto configuration magic) — that I realised I could do something wonderful.  Instead of making a config file that contained both my standard display and the DisplayLink device (and probably cause havoc for the 90% of times I boot without an additional screen) I would create a config file with *just* the DisplayLink device and start it as a second server.  Run a different window manager in there, and I would have two independent desktops — Zaphod mode!

I did a couple of little experiments just starting an xterm in the second X, and it worked fine (the more alert of you will realise that I’m taking a bit of artistic license with the word “fine” here, and know that three little letters in the title of this post are a clue to what wasn’t yet working…) with the desktop and the xterm appearing in the second monitor.  I installed XFCE, and configured it to start as the window manager of the second X server, which also worked well.

Something was missing though: there was no mouse input to the second screen.  In Zaphod mode, even though the two screens were separate X displays they were managed by the same server.  This meant that the input devices were shared between the two displays.  In this configuration, I was careful to exclude any mouse and keyboard devices from my second display config to avoid any conflicts.  So how was I to get input device data into the second server?  A second display is not much good if you can’t click and type on the applications that run on it…

I remembered about an old program called x2x that could transfer the mouse and keyboard events to a different X server when you moved the mouse to the edge of your display (and, inexplicably, I forgot all about a much younger program called Synergy that can do the same thing).  Since x2x isn’t built for Fedora I found the source and built it and started it up…  and it worked first time!  When I moved the mouse to the edge of the screen, it appeared on the other screen!  I could start apps and type into them exactly as I wanted.

It wasn’t perfect, however.  I found that when I returned the mouse to the primary screen, the second screen was still getting keyboard events.  I figured this would be particularly inconvenient when, for example, I was entering user and password details into an app on the primary screen while an editor or terminal program had focus on the second screen…  I checked the Xorg.1.log file, and found that even though I had not specified a “keyboard” input device Xorg was automatically defining one for me.  I turned off the udev options, but it still happened.  My initial enthusiasm was starting to fade.

What fixed it was to manually define a “dummy” keyboard device.  There must be some logic in Xorg that it refuses to allow a configuration with no configured keyboard (which makes sense), so in this rather unusual case where I don’t want a keyboard I have to define one but give it a dummy device definition.  Defining the dummy keyboard stopped Xorg from defining its automatic one, and everything worked as expected!  Even screensavers work more-or-less as designed (although I haven’t actually spent much time in front of the setup yet so haven’t had to unlock the screen that often).

I’m away from the computer in question right now, otherwise I would post configs and command lines (and even a pic of the end result).  I’ll update this post with the details — leave a comment if you think I need to hurry up!  🙂

 

Tags: , , , , ,

Oracle Database 11gR2 on Linux on System z

Earlier this year (30 March, to be precise) Oracle announced that Oracle Database 11gR2 was available as a fully-supported product for Linux on IBM System z.  A while before that they had announced E-Business Suite as available for Linux on System z, but at the time the database behind it had to be 10g.  Shortly after 30 March, they followed up the 11gR2 announcement with a statement of support for the Oracle 11gR2 database on Linux on System z as a backend for E-Business Suite — the complete, up-to-date Oracle stack was now available on Linux on System z!

In April this year I attended the zSeries Special Interest Group miniconf[1], part of the greater Independent Oracle Users Group (IOUG) event COLLABORATE 11.  I was amazed to discover that there are actually Oracle employees whose job it is to work on IBM technologies — just like there are IBM employees dedicated to selling and supporting the Oracle stack.  Never have I seen (close-up) a better example of the term “coopetition”.

On my return from the zSeries SIG and IOUG, I’ve become the local Oracle expert.  However, I’ve had no more training than the two days of workshops run at the conference!  The workshops were excellent (held at the Epcot Center at Walt Disney World, no less!) but they could not an expert make.  So I’ve been trying to build some systems and teach myself more about running Oracle.  I thought I’d gotten off to a good start too — I’d installed a standalone system, then went on to build a two-node RAC.  I communicated my success to one of my sales colleagues:

“I’ve got a two-node RAC setup running on the z9 in Brisbane!”

“Great!  Good work,” he said.  “So the two nodes are running in different LPARs, so we can demonstrate high-availability?”

” . . . ”

In my haste I’d built both virtual machines in the same LPAR.  Whoops.  (I’ve fixed that now, by the way.  The two RAC nodes are in different LPARs and seem to be performing better for it.)

Over the coming weeks, I’ll write up some of the things that have caught me out.  I still don’t really know how all this stuff works, but I’m getting better!

Links:

IBM System z: www.ibm.com/systems/z or www.ibm.com/systems/au/z

Linux on System z: www.ibm.com/systems/z/os/linux/index.html

Oracle zSeries SIG: www.zseriesoraclesig.org

Oracle Database: www.oracle.com/us/products/database/index.html

[1] Miniconf is a term I picked up from linux.conf.au — the zSeries SIG didn’t advertise its event as a miniconf, but as a convenient name for a “conference-in-a-conference” I’m using the term here.

 

 

 

Tags: , , , , ,

What a difference a working resolver makes

The next phase in tidying up my user authentication environment in the lab was to enable SSL/TLS on the z/VM LDAP server I use for my Linux authentication (I’ll discuss the process on the DeveloperWorks blog, and put a link here).  Apart from being the right way to do things, LDAP authentication appears to require SSL or TLS in Fedora 15.

After I got the Fedora system working, I thought it would be a good idea to have other systems in the complex using SSL/TLS also.  The process was moderately painless on a SLES 10 system, but on the first SLES 11 system I went to YaST froze while saving the changes.  I (foolishly) rebooted the image, and it hung during boot.  Not fun.

After a couple of attempts to fix up what I thought were the obvious problems (each attempt involving logging off the guest, connecting its disk to another guest, mounting the filesystem, making a change, unmounting and disconnecting, and re-IPLing) with no success, I went into /etc/nsswitch.conf and turned off LDAP for everything I could find.  This finally allowed the guest to complete its boot — but I had no LDAP now.  I did a test using ldapsearch, which reported it couldn’t reach the LDAP server.  I tried to ping the LDAP server by address, which worked.  I tried to lookup the hostname of the LDAP server, and name resolution failed with the traditional “no servers could be reached” message.  This was odd, as I knew I’d changed it since it was pointing to the wrong DNS server before…  I could ping the DNS by address, and another system resolved fine.

I thought it might have been a configuration problem — I had earlier had trouble with systems not being able to do recursive DNS lookups through my DNS server.  I went to YaST to configure the DNS Server, and it told me that I had to install the package “bind”.  WHAT?!?!?  How did the BIND package get uninstalled from the system…

Unless…  It’s the wrong system…

I checked /etc/resolv.conf on a working system and sure enough I had the IP address wrong.  I was pointing at a server that was NOT my DNS server.  Presumably the inability to resolve the name of the LDAP server I was trying to reach is what made the first attempt to enable TLS for LDAP fail in YaST, and whatever preload magic SLES uses to enable LDAP authentication got broken by the failure.  Setting the right DNS and re-running the LDAP Client module in YaST not only got LDAP authentication working but got me a bootable system again.

A simple fix in the end, but I’d forgotten the power of the resolver to cause untold and unpredictable havoc.  Now, pardon me while I lie in wait for the YaST-haters who will no doubt come out and sledge me…  🙂

Tags: , , , , , ,

RACF Native Authentication with z/VM

 In 2009 I was part of the team that produced the Redbook "Security for Linux on System z" (find it at http://www.redbooks.ibm.com/abstracts/sg247728.html).  Part of my contribution was a discussion about using the z/VM LDAP Server to provide Linux guests with a secure password authentication capability.  I probably went a little overboard with screenshots of phpLDAPadmin, but overall I think it was useful.

I’ve come back to implement some of what I’d put together then, and unfortunately found…  not errors as such, but things I perhaps could have discussed in a little more detail.  I’ve been using the z/VM LDAP Server on a couple of systems in my lab but had not enabled RACF.  I realised I need to "eat my own cooking" though, so decided to implement RACF and enable the SDBM backend as well as switch to using Native Authentication in the LDBM backend.

Native Authentication provides a way for security administrators to present a standard RFC 2307 (or equivalent) directory structure to clients while at the same time taking advantage of RACF as a password or pass phrase store.  Have a look in our Redbook for more detail, but basically the usual schema is loaded into LDAP and records are created using the usual object classes like inetOrgPerson, but the records do not contain the userPassword attribute.  Instead of comparing a presented password against the field contained in LDAP, the z/VM LDAP Server (when Native Authentication is enabled) issues a RACROUTE call to RACF to have it check the password.

In my existing LDAP database, I had user records that were working quite successfully to authenticate logons to Linux.  My plan was simply to enable RACF, creating users in RACF with the same userid as the uid field in LDAP (I have access to a userid convention that fits RACF’s 8-character restriction, so no need to change it).  After going through the steps in the RACF program directory, and various follow-up tasks to make sure that various service machines would work correctly, I did the LDAP reconfiguration to get Native Authentication.

At this point I probably need to clarify my userid plan.  The documentation for Native Authentication in the TCP/IP Planning and Administration manual says that the LDAP server needs to be able to work out which RACF userid corresponds to the user record in LDAP to be able to validate the password.  It does this by either having the RACF userid explicitly specified using the ibm-nativeId attribute (the object class ibm-NativeAuthentication has to be added to the user object), or by matching the existing uid attribute with RACF.  This is what I hoped to be able to do; by using the same ID in RACF as I was already using in LDAP, I planned to not require the extra object class and attribute.  In the Redbook, because my RACF ID was different from the LDAP one I went straight to using the ibm-nativeId attribute and didn’t go back and test the uid method.

So, I gave it a try.  I had to disable SSH public-key authentication so that my password would actually get used, and once I did that I found that I couldn’t log on.  It didn’t matter whether I tried with my password or pass phrase, neither was successful.  I read and re-read all the LDAP setup tasks and checked the setup, but it all looked fine.  In one of those "let’s just see" moments, I decided to see if it worked with the ibm-nativeId attribute specified in uppercase…  and it did!

Okay, so it appeared that the testing of uid against a RACF id was case-sensitive.  I decided to try creating a different ID, with an uppercase uid, in LDAP to double-check.  Since phpLDAPadmin wouldn’t let me create an uppercase version of my own userid (since that would be non-unique), I created a different LDAP id to test:

[viccross@laptop ~]$ ssh MAINT@zlinux1
Password:
Could not chdir to home directory /home/MAINT: No such file or directory
/usr/X11R6/bin/xauth:  error in locking authority file /home/MAINT/.Xauthority
MAINT@zlinux1:/>

My MAINT user in LDAP has no ibm-nativeId attribute, so the only operational difference is the uppercase uid (the error messages are caused by the LDAP userid not having a home directory; I use a NFS shared home directory had I hadn’t bothered setting up the homedir for a test userid).

The final test was to change the contents of the ibm-nativeId attribute in my LDAP user record to lower-case — and it broke my login.  So that would seem to indicate that the user check against RACF is case sensitive wherever LDAP gets the userid from.  I’m going to have a look through documentation to see if there’s something I need to change, but this looks like something to be aware of when using Native Authentication.

I also noticed that I didn’t describe the LDAP Server SSL/TLS support in the Redbook, but that’s a post for another day…

Tags: , , , , , , ,

OpenSSL speed revisited

 I realised I never came back and reported the results of my OpenSSL "speed" testing after our 2096 got upgraded.  For reference, here was the original chart, from when the system was sub-capacity:

image

… and the question was, does the CPACF run at the speed of the CP (i.e. it runs sub-capacity if the CP is sub-capacity) or does it run at full speed like an IFL, zIIP or zAAP.  If the latter, the result after the upgrade should be the same as before — that would indicate the speed of crypto operations does not change with the CP capacity, and that CPACF is always full speed.  If the former, we should see an improvement between pre- and post-upgrade, indicating that the speed of CPACF follows the speed of the CP.

Place your bets…  Okay, no more bets…  Here’s the chart:

image 
The graph compares the results from the first chart in blue (when the machine was at capacity setting F01) with the full-speed (capacity setting Z01) results in red.

Okay, so did you get it right?  If you know your z/Architecture you would have!  As the name suggests, the Central Processor Assist for Cryptographic Function (or CPACF) is pretty-much an adjunct to each CP, just like any standard execution unit (like the floating point unit, say).  It is not like the Crypto Express cards, which are actually an I/O device and totally separate from the CP.  Because it is directly associated with each CP, for sub-capacity CPs its CPACF is bound to the speed of that CP.

If you look closer, further evidence that CPACF performance scales with capacity setting can be seen in the respective growth rates of each set of data points.  To see this a little clearer (because I don’t know the right mathematical terms to describe the shape of the curve, so I’ll just show you) I drew a couple more graphs:

image image  

Looking at the left graph (which is the same as the bar graph above, just drawn in lines) you can see that in both the software and the CPACF case the lines for before and after the upgrade follow the same trend with respect to the block size.  If these lines followed different trends — for example if the Z01 CPACF line was flat across the block size range instead of a gently falling slope like the F01 line — I’d suspect something else was affecting the result.  Looked at a different way, the right-hand graph above shows the "times-X" improvement between software and CPACF.  You can see that the performance multiplier (i.e. the relative performance improvement between software and hardware; CPACF speed is 16x software at 8192 byte blocks) was the same for each block size.

Now, just to confuse things…  Although I’ve used OpenSSL on Linux as the testing platform for this experiment, most Linux customers will never see the effects I’ve demonstrated here.  Why?  Because Linux is usually run on IFLs, and the IFL always runs at full speed!  Even if there are sub-capacity CPs installed in a machine with IFLs, the IFLs run at full speed and so to does the CPACF associated with the IFLs.  I’ll say again: CPACF follows the speed of the associated CP, so if you’re running Linux on IFLs the CPACF on those IFLs will be full capacity just like the IFLs themselves.  If you have sub-capacity CPs for z/OS workload on the same machine as IFLs, the CPACF on the CPs will appear slower than CPACF on the IFLs.

As far as the actual peak number is concerned, it looks like a big number!  If I understand it right, 250MB/sec would be more than enough speed to have a server doing SSL/TLS traffic driving a Gigabit Ethernet at line speed (traffic over connected sessions, NOT the certificate exchange for connection establishment; the public key crypto for certificate verification takes more hardware than just CPACF, at least on the z9 anyway).  And that’s just one CP!  Enabling more CPs (or IFLs, of course) gives you that much more CPACF capacity again.  Keep in mind that these results are using hardware that is two generations old — I would expect z10 and z196 hardware to get higher results on any of these tests.  Regardless, these are not formal, official measurements and should not be treated as such — do NOT use any of these figures as input to system sizing estimates or other important business measurements!  Always engage IBM to work with you for sizing or performance evaluations.

Tags: , , , , ,

Another round of Gentoo fun

A little while back I did an “emerge system” on my VPS and didn’t think much more about it.  First time back to the box today to emerge something else, and was greeted with this:

>>> Unpacking source…
>>> Unpacking traceroute-2.0.15.tar.gz to /var/tmp/portage/net-analyzer/traceroute-2.0.15/work
touch: setting times of `/var/tmp/portage/net-analyzer/traceroute-2.0.15/.unpacked’: No such file or directory

…and the emerge error output.  Took me a little while to get the answer, but it was (of course) caused by a new version of something that came in with the system update.  This bug comment had the crude hack I needed to get back working again, but longer-term I obviously need to fix the mismatch between the version of linux-headers and the kernel version my VPS is using (it’s Xen on RHEL5).

Tags: , , , , ,

IPv6: SSDM?

Two of the four keynotes at LCA 2011 referenced the depletion of the IPv4 address space (and I reckon if I looked back through the other two I could find some reference in them as well).  I think there’s a good chance Geoff Huston was lobbying his APNIC colleagues to lodge the “final request” (for the two /8s that triggered the final allocation of the remaining 5, officially exhausting IANA) a week earlier than they did, as it would have made the message of his LCA keynote a bit stronger.  Not that it was a soft message: we went from Vint Cerf the day before, who said “I’m the guy who said that a 32-bit address would be enough, so, sorry ’bout that”, to Geoff Huston saying “Vint Cerf is a professional optimist.  I’m not.”.  But I digress…

I did a bit of playing with IPv6 over the years, but it was too early and too broken when I did (by “too broken” I refer to the immaturity of dual-stack implementations and the lack of anything actually reachable on the IPv6 net).  However, with the bell of IPv4 exhaustion tolling, I had another go.

Freenet6, who now goes alternatively as gogonet or gogo6, was my first point-of-call.  I had looked at Gogo6 most recently, and still had an account.  It was just a matter of deciding whether or not I needed to make a new account (hint: I did) and reconfiguring the gw6c process on my router box.  Easy-as, I had a tunnel — better still, my IPv6-capable systems on the LAN also had connectivity thanks to radvd.  From Firefox (and Safari, and Chrome) on the Mac I could score both 10/10 scores on http://test-ipv6.com!

My joy was short-lived, however.  gw6c was proving to be about as stable as a one-legged tripod, and not only that Gogo6 had changed the address range they allocated me.  That wouldn’t be too bad, except that all my IPv6-capable systems still had the old address and were trying to use that — looks like IPv6 auto-configuration doesn’t un-configure an address that’s no longer valid (at least by default).  I started to look for possible alternatives.

Like many who’ve looked at IPv6 I had come across Hurricane Electric — in the countdown to IPv4 exhaustion I used their iOS app “ByeBye v4”.  They offer free v6-over-v4 tunneling, and the configuration in Gentoo is very simple.  I also get a static allocation of an IPv6 address range that I can see in the web interface.  The only downside I can see is that I had to nominate which of their locations I wanted to terminate my tunnel; they have no presence in Australia, the geographically-nearest location being Singapore.  I went for Los Angeles, thinking that would probably be closest network-wise.  The performance has been quite good, and it has been quite reliable (although I do need to set up some kind of monitoring over the link, since everything that can talk IPv6 is now doing so).

In typical style, after I’d set up a stable tunnel and got everything working, I decided to learn more about what I’d done.  What is IPv6 anyways?  Is there substance to the anecdotes flying around that are saying that “every blade of grass on the planet can have an IPv6 address” and similar?  Well, a 128-bit address provides for an enormous range of addresses.  The ZFS guys are on the same track — ZFS uses 128-bit counters for blocks and inodes, and there have been ridiculous statements made about how much data could theoretically be stored in a filesystem that uses 128-bit block counters.  To quote the Hitchhiker’s Guide to the Galaxy:

Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist’s, but that’s just peanuts to space.

The Guide, The Hitchhiker’s Guide To The Galaxy, Douglas Adams, Pan Books 1979

Substitute IPv6 (or ZFS) for space.  To try and put into context just how big the IPv6 address range is, let’s use an example: the smallest common subnetwork.

When IPv4 was first developed, there were three address classes, named, somewhat unimaginatively, A B and C.  Class A was all the networks from 1.x.x.x to 127.x.x.x, and each had about 16 million addresses.  Class B was all the networks from 128.0.x.x to 191.255.x.x, each network with 65 534 usable addresses.  Class C went from 192.0.0.x to 223.255.255.x, and each had 254 usable addresses.  Other areas, such as 0.x.x.x and the networks after 224.x.x.x, have been reserved.  So, in the early days, the smallest network of hosts you could have was a network of 254 hosts.  After a while IP introduced something called Classless Inter-Domain Routing (CIDR) which meant that the fixed boundaries of the classes were eliminated and it became possible to “subnet” or “supernet” networks — divide or combine the networks to make networks that were just the right size for the number of hosts in the network (and, with careful planning, could be grown or shrunk as plans changed).  With CIDR, since the size of the network was now variable, addresses had to be written with the subnet mask — a format known as “CIDR notation” came into use, where an address would have the number of bits written after the address like this: 192.168.1.42/24.

Fast-forward to today, with IPv6…  IPv4’s CIDR notation is used in IPv6 (mostly because the masks are so huge).  In IPv6, the smallest network that can be allocated is what is called a “/64”.  This means that out of the total 128-bit address range, 64 bits represent what network the address belongs to.  Let’s think about that for a second.  There are 32 bits in an IPv4 address — that means that the entire IPv4 Internet would fit in an IPv6 network with a /96 mask (128-32=96).  But the default smallest IPv6 subnet is /64 — the size of the existing IPv4 Internet squared!

Wait a second though, it gets better…  When I got my account with Gogo6, they offered me up to a /56 mask — that’s a range that covers 256 /64s, or 256 Internet-squareds!  Better still, the Hurricane Electric tunnel-broker account gave me one /64 and one /48Sixty-five thousand networks, each the size of the IPv4 Internet squared! And how much did I pay for any of these allocations?  Nothing!

I can’t help but think that folks are repeating similar mistakes from the early days of IPv4.  A seemingly limitless address range (Vint said that 32 bits would be enough, right?) was given away in vast chunks.  In the early days of IPv4 we had networks with two or three hosts on them using up a Class C because of the limitations of addressing — in IPv6 we have LANs of maybe no more than a hundred or so machines taking up an entire /64 because of the way we designed auto-configuration.  IPv6 implementations now will be characterised not by how well their dual-stack implementations work, or how much more secure transactions have become thanks to the elimination of NAT, but by how much of the addressable range they are wasting.  So, is IPv6 just Same Sh*t, Different Millennium?

Like the early days of IPv4 though, things will surely change as IPv6 matures.  I guess I’m just hoping that the folks in charge are thinking about it, and not just high on the amount of space they have to play with now.  Because one day all those blades of grass will want their IP addresses, and the Internet had better be ready.

Update 16 May 2011: I just listened to Episode 297 of the Security Now program…  Steve Gibson relates some of his experience getting IPv6 allocation from his upstream providers (he says he got a /48).  In describing how much address space that is, he made the same point (about the “wasteful” allocation of IPv6).  At about 44:51, he starts talking about the current “sky is falling” attitude regarding IPv4, and states “you’d think, maybe they’d learn the lesson, and be a little more parsimonious with these IPs…”.  He goes on to give the impression that the 128-bit range of IPv6 is so big that there’s just no need to worry about it.  I hope you’re right, Steve!

Tags: , , , , , ,

Sharing an OSA port in Layer 2 mode

I posted on my developerWorks blog about an experience I had sharing an OSA port in Layer 2 mode.  Thrilling stuff.  What’s more thrilling is the context of where I had my OSA-port-sharing experience: my large-scale Linux on System z cloning experiment.  One of these days I’ll get around to writing that up.

Tags: , , ,