Among the coffee mugs in my cupboard at home is one I’ve had for over 20 years.  It was a gift; if I remember right, a semi-joke gift in an office “Secret Santa”.

"Works and plays well with others"

“Works and plays well with others”. O RLY?

The slogan on it reads “Works and plays well with others”, and it’s a reference to one of the standard phrases seen on children’s school report cards.  It’s one of the standard mugs in my hot beverage rotation, and every time I use it I can’t help but think back to when it was new, and of how much has changed since those days.

It’s easy to treat a silly slogan on a coffee mug as little more than just a few words designed to evoke a wry grin from a slightly antisocial co-worker.  Sometimes it can take on a deeper meaning, if you let it.

For the last 6 months or more I’ve been working on transferring the function of our former demonstration facility in Brisbane to a location in Melbourne.  This has been fraught with problems and delays, not the least of which was an intermittent network fault into the network our systems are connected to.  Steady-state things would be fine; I could have an IRC client connected to a server in our subnet for days at a time.  When I actually try to do anything else (SSH, HTTP, etc), within about 5 minutes all traffic to the subnet would stop for a few minutes.  When traffic would pass again, it would stay up for five or so minutes then fail.  Wash, rinse, repeat.

It looked like the problem you get when Path MTU Discovery (PMTUD) doesn’t work and you have an MTU mismatch[1].  I realised that we had a 1000BaseT network that was connected to a 100BaseT switch port, so went around all my systems and changed where I was trying to use jumbo frames, but that made no difference to the network dropouts.  I found Cisco references to problems with ARP caches filling, but I couldn’t imagine that the network was so big that MAC address learning would be a problem (and if general MAC learning was constrained, why no-one else was having a problem).

Everything I could think of was drawing blanks.  I approached the folks who run the network we uplink through, and all they said was “our network is fine”.  I was putting up with the problem, thinking that it was just something I was doing and that in time we would change over to a different uplink and we wouldn’t have to worry any more.  My frustration at having to move everything out of the wonderful environment we had in Brisbane down to Melbourne, with its non-functional network, multiplied every time an SSH connection failed.  I actually started to rationalise that it was pointless to continue with setting up the facility in Melbourne; I’d never be able to re-create what I’d built in Brisbane, it would never be as accessible and useful, and besides no-one other than me had ever made good use of the z Systems gear in the Brisbane lab anyway.  Basically, I had lost confidence in myself and my ability to get the network fixed and the Melbourne lab set up.

Confidence is a mental strength, like our muscles which provide our physical strength.  Just like muscle, confidence grows from active use and wastes if underused.  Chemicals can boost it, and trauma can damage it.  Importantly though, confidence can be a huge barrier to a person’s ability to “work and play well with others” — too little confidence and one lacks conviction and decision-making; too much confidence and they appear overbearing and dictatorial.

Last week I was in Singapore for the z/VM, Linux on z, and KVM “T3” event.  Whenever I go to something like this I get fired up by all of the things that I’d like to work on and have running to demo.  The motivation to get new things going in the lab overcame my pessimism about the network connection (and lack of confidence), and I got in touch with the intern in charge of the network we connect through.  All I need, I said, is to look and see what the configuration of the port we connect into looks like.  We agreed to get together when I returned from Singapore, and try to work out the problem.

We got into the meeting, and I went over the problem in terms of how we experience it — a steady state that could last for days, then activity leading to three-minute lockouts.  I asked if I could see the configuration of the port we attached to… after a little bit of discussion about which switch and port we might be on, a few lines of Cisco CatOS configuration statements appeared in our chat session.  Straight away I saw:

switchport port-security

W. T. F.

Within a few minutes I had Googled what this meant.  Sure enough, it told the switch to monitor the active MAC addresses on that port and disable the port if “unknown” MACs appear.  There were no configured MACs, so it just remembered the first one it saw.  It explained why I could have a session running to one system (the IRC server) for ages, and as soon as I connected to something else everything stopped — the default violation mode is “shutdown”.  It explained why the traffic would stay down for three minutes and then begin again — elsewhere in the switch configuration was this:

errdisable recovery cause psecure-violation 180

If the switch disabled a port due to port-security violation, it would be automatically recovered after 180 seconds.

The guys didn’t really understand what this all meant, but it made sense to me.  Encouraged by my confidence that this was indeed the problem, they gave me the passwords to log on to the switch and do what I thought was needed to remove the setting.  A couple of “no” commands later and it was gone… and our network link has functioned perfectly ever since.

The real mystery for the other network guys was: why has this suddenly become a problem?  None of them had changed the network port definition, so as far as anyone knew the port was always configured with Port Security.  The answer to this question is, in fact, on our side.  To z/VM and Linux on z Systems people, networks come in two modes: “Layer 3” or “IP” mode, where the system only deals with IP addresses, and “Layer 2” or “Ethernet” mode, where the system works with MAC addresses.  In Layer 3 mode, all the separate IP addresses that exist within Linux and z/VM systems actually exist behind the MAC address of the mainframe OSA card.  In Layer 2 mode however, each individual Linux guest or z/VM stack gets its own MAC address.  When we first set up this network link and the z/VM and Linux systems there the default operating mode was Layer 3, so the network switch only saw one or two MAC addresses.  Nowadays though the default mode is Layer 2.  When I built new systems for moving everything down from Brisbane, I built them in Layer 2 mode.  Suddenly the network switch port was seeing dozens of different MAC addresses where it used to only see one or two, and Port Security was being triggered constantly.

This has been a learning experience for me.  Usually I don’t have any trouble pointing out where I think a problem exists and how it needs to be fixed.  Deep down I knew the issue was outside our immediate network and yet this time for some reason I lacked the ability, motivation, nerve, or whatever, to chase the folks responsible and get it fixed.  The prospect of trying to work with a group of guys who, based on their previous comments, really strongly thought that their gear was not the problem, was so daunting that it became easier to think of reasons not to bother.  Maybe it’s because I didn’t know for certain that it wasn’t something on our side — there is another problem in our network that definitely is in our gear — so I kept looking for a problem on our side that wasn’t there.

For the want of a 15-minute phone meeting, we had endured months of a flaky network connection.

On this occasion it took me too long to become sufficiently confident to turn my thoughts into actions.  Once I got into action though, it was the confidence I displayed to the network team that got the problem fixed.   For me, lesson learned: sometimes I need a prod, but I am one who “works and plays well with others”.

 

[1] I get that all the time when using the Linux OpenVPN client to connect to this lab, and got into the habit of changing the MTU manually.  Tunnelblick on the Mac doesn’t suffer the same problem, because it has a clever MTU monitoring feature that keeps things going.

Tags: , , , , , , , ,