Keeping on the analogy theme…  This time, it’s an explanation of Simultaneous Multi-Threading (SMT).  SMT was introduced to the z Systems architecture with the z13, and many technical specialists (myself included) have struggled with the standard explanations of how SMT is meant to improve overall system performance.  Here’s yet another attempt at an explanation!  Some folks might be a bit affronted at the “compare and contrast” of z Systems and a fast food drive-through, but it’s just an analogy…

So in Brisbane just about every McDonalds has a drive-through.  They used to have a single lane, with the menu boards and a speaker for the operator inside the restaurant to take your order.  As the customer, once you placed your order you then would drive forward to the “first window” where the cashier would take payment, then you’d drive to the “next window” to receive your order and proceed away.  Apologies to anyone offended by me feeling the need to explain how a drive-through works, but I don’t know how they work in your part of the world so I’m just covering how they work in mine.

Many of these drive-throughs have been redeveloped to add a second lane and order taking station — complete with a second set of menu boards.  They didn’t duplicate anything else in the process though: same payment and collection windows, even in most cases a single cashier taking orders alternately from both stations.

A dual-lane McDonalds drive thru

A dual-lane McDonalds drive thru, AKA CPUs 0x00 and 0x01

Why did McDonalds do this?  Without duplicating anything else in the whole chain, what benefit does adding a queue provide?  If two cars arrive at the stations at the same time there’s going to be contention for the cashier.  They then have contention to enter the single lane going past the windows.  Not only that, the restaurant had to give up physical space to install the second station — perhaps they lost a few parking spaces, or a garden.

had passed through this kind of drive-through a few times, and never clearly saw the benefit.  Sometimes I’d drive up to the station with the shortest queue, only to be stuck behind someone who seemed to be ordering one of everything from the menu…  Other times I’d pull up to an empty station, next to the only other car in the system (at the other station), but because the car at the occupied station was already placing their order I still had to wait the same amount of time as I would have in a single-lane system.

Then I finally realised.  The multiple queues aren’t for me as a customer — they’re for the restaurant.  Specifically, they’re for the food production stations in back-of-house.  To understand why it makes sense to have two lanes, it’s critical to realise that the drive-through is not just the speakers and the lane and the windows, it’s the method by which instructions are given to the many and various individuals that make the food and package the orders.  Each of those individuals has their own role and contribution to the overall task of serving the customer; from the grillers to the fryers to the wrappers to the packers (sorry, I’ll bet there’s McDonalds formal names for each of the team member roles but I don’t know them).

Having multiple order stations means that the orders get to the burger makers and packers faster, making them more efficient and improving their throughput.  The beverage orders go to the little automatic drink-pouring machine instantly, so that everyone’s Cokes and Fantas are ready and waiting sooner.  One car wants a Chicken McWrap, the next just wants a McFlurry?  No contention there, those orders can be getting made at the same time.

Maybe you’re asking “so what does this have to do with SMT?”  Well, the order stations are our threads.  The cashiers and the packers are the fetch-and-store units, the parts of the processor that fetch instructions from memory and store the results back.  The cashier’s terminal is the instruction decode unit.  The food preparers in the back-of-house, they are the processor execution units; the integer units, the DFP and BFP, the SIMD unit, the CPACF, and more — that’s where the real work is done.  To a large extent all of those execution units operate independently, just like our McD food preparers.  SMT, like our two drive-through lanes, makes sure that all those execution units are as busy as possible.  One thread issues an integer add instruction, the other thread is doing a crypto hash using the CPACF?  They can be happening simultaneously.

We’ve been saying all along that SMT will likely decrease the perceived speed of an individual unit of work, but overall more work will get done across all units of work.  When I’ve been in a two-lane drive-through and placed my order, and then had to wait while I merged with the cars coming from the other lane, I have to agree that it seemed like the merging delayed me.  However, if that had been a single-lane drive-through, chances are I would have been in a longer queue of cars before even reaching the order station, and that metric isn’t even measured by the queue management built into McDonalds’ terminals.  Likewise, on a busy system without SMT, it’s difficult to say how long instructions are getting queued in the operating system scheduler before even making it to the processor to be dispatched.  Basically, I’m saying that we may see OS scheduler queuing reduce, and therefore improved “performance” at the OS level over and above the actual benefit of improved processor throughput, even if our SMT ratio doesn’t get anywhere near the impossible 2:1.

If ten cars line up at the two windows and they all want a Big Mac and a Hot Apple Pie then there’s probably not going to be much gain there.  Today’s McDonalds menu is quite diverse though, which means the chances of orders having “non-intersecting overlaps” are greatly improved.  On z Systems, ensuring a variety of workloads and transaction types would help to ensure a diversity in the instruction stream that would give SMT a good opportunity to yield benefit.  This means mixing applications as much as possible, and using Large Memory and CPU Pooling support in z/VM 6.3 to bring lots of different workloads into heterogenous LPARs.

I’ll bet that McDonalds worked out that simply adding an extra entry lane meant that they can move more food items in a given time — and McDonalds business is to sell food on a massive scale.  In the same way, the goal of z Systems has never been to make one single workload run as fast as possible, but to make the hundreds or thousands of workloads across an enterprise run at highest efficiency.