Category Archives: Human Error

Who’s Afraid Of The Big Bad Wolf?

by Scott Kantner, April 20th, 2009 in Cloud Computing, Data Center, Hosting, Human Error, Network Infrastructure, Servers, Storage, Uncategorized

wolf

News  of Cisco’s intent to enter the server market with its Unified Computing System offering has set the industry pundit’s hair ablaze.   “How will IBM & HP respond?”, “How much market share will be lost to Cisco?”, “Do you want a plumber building your servers?” and on it goes.  The FUD truly has been flying.  You would think the Big Bad Wolf had just come back to Grandma’s house.

So, what does the announcement of UCS mean to us here in the non-rarified air of business computing?  Will it help us run our shops better?

Listen to Cisco CEO Chambers closely…

We look at this as bringing virtualization to life…unleashing the power of virtualization.   We go about it catching market transitions and trying to set timing, first in the data center, but make no mistake about it [UCS will make it] all the way in the home… [emphasis added]

 

What market transitions, pray tell, is he referring to?  Could it be anything other than the transition to utility based computing? It’s fairly clear he’s not talking about our server rooms and data centers.  No, it would seem Cisco has its sights on something much larger. Chamber’s message is unmistakeable.  If the coming world of utility-based computing were to be compared to The Matrix, Cisco would not be found content with simply supplying the network plumbing – they want to be the Matrix itself. Having already tucked away the network, we now see a move into processors. Can storage be far behind? Perhaps the Big Bad Wolf already has that in the oven.

It doesn’t seem on the surface that UCS is intended for the typical IT shop, but let’s assume otherwise for a moment.  Is there a compelling reason for us to consider (or fear) UCS?    What would make us willing to try a  brand new brand?

In many ways, owning server hardware is a lot like owning a vehicle. First, you make your purchase based on size, looks, performance, the features you need, reliability, serviceability, and of course the price. Sometimes you’re looking to save gas (power), but not always. Maybe you decide to lease it. If you end up with a lemon, you know that very early in the game, and you get the vehicle fixed or replaced under warranty. From that point on, if you put in decent gasoline (clean UPS power), do regular maintenance (clean the fan grids, do disk defrags), and operate it within its design limits (proper cooling), it will run well for a long time.   When it wears out, or after you simply get tired of it and want something new and sexy, you buy a new one, sell or trade the old one, or possibly keep it and run it until the wheels fall off.

In the final analysis, whether you buy Chevy, Ford, Chrysler, or a brand you’ve never tried before really doesn’t matter. You go through the same decision process and ultimately you buy what you like or what you feel comfortable with.  The care, maintenance, and disposal process is the same no matter what you buy. And statistically, the reliability is pretty much the same across the board, despite the religious fervor that surrounds each brand. They all run well on balance, and they all have an occasional breakdown. For every hardware horror story out there, there are scores of identical hardware instances that run their entire lifetimes without a glitch.

Of course, if you absolutely must be the first kid on the block with a new hardware vendor, your mileage may vary.

Early UCS adopters on the phone with Cisco Tech Support

Early UCS adopters on the phone with Cisco Tech Support

For most of us, UCS is not going to help with the primary purpose of our infrastructure.  So what does make a difference in how well our business systems stay up and running?

If you put a good driver (software) behind the wheel of your vehicle, you can be confident it will stay on the road doing what you intend it to do.  If you put an unskilled, abusive or reckless driver behind the wheel, you can expect more mechanical breakdowns (minor outages), accidents (major outages), or worse (disaster declaration).

I resisted naming operating system names above, but ask yourself, when was the last time you had down time because an operating system or application went off into the weeds?   Do you schedule weekly or nightly reboots “just for good measure” because you can’t trust things to stay healthy?    It is an alarmingly common practice in our client base.

There’s a Red Hat 7.2 system that’s been hosting workload here for years that only comes down when we take it down to replace or upgrade the hardware.   We have a farm of VMWare ESX servers that behave just as well.   Yet we also have a number of Win32 servers running on the same hardware for which I can’t say the same. 

It’s not the hardware.

Lemon’s notwithstanding, the brand of hardware, be it IBM, HP, Dell, and now ostensibly Cisco, really is not the key factor in maintaining uptime.   In this day of clusters-everywhere and RAID-everything, it’s typically not the hardware that takes you down – it’s unreliable software, change  or human error.

As for UCS, it doesn’t look like the Big Bad Wolf is coming to our house anytime soon, but it is a good idea to keep a watchful eye on where he is going.  Cisco has cold hard cash and a big vision, but that vision seems cast for The Matrix, not our server rooms.

theciscomatrix

Buy what you’re comfortable with and put the right driver behind the wheel, or better yet, let us worry about that for you.

 

 

Floppies in the Data Center

by Scott Kantner, March 26th, 2009 in Data Center, Human Error

Reports of the floppy’s death are greatly exaggerated …

Data Center Hosting

Does this message strike fear into your heart?

Normally this sends our team into crash-cart mode as we assume the worst – hard drive failure on the boot drive. Not a Good Thing on a Friday morning. Imagine our surprise and disgust when the culprit was found to be unexpected media in the server’s floppy drive.

The natural inclination of our security guy was to run to the DVR to find out who the perp was, but the real bottom line was the downtime. The box, a non-production server which had been rebooted during routine patching activity, was offline for 7 hours and 10 minutes. Had this been a production server, the monthly SLA would have been shot and someone on the data center team would have had some ‘splainin to do.

It might be a good time to review good floppy and USB drive hygiene with your troops. We wouldn’t want anyone to get shot!

We often see this pie chart showing the causes of system (not data center) downtime. The colors of the charts change, but the percentages are basically always the same:

Data Center Hosting

One of the slices is always “human error”, but we never get a break down of how those errors cash out in terms of preventable vs. non-preventable (or “dumb” and “not-so-dumb,”, if you like) things that people did. Surely there’s got to be a Great Many Things in there – oh, say like….floppies left in a disk drive – that we can work on to drive that 32% number down.

So for starters, proper floppy and USB drive hygience is clearly on our list. I’ll explore some more of these human factors with you soon.

//spk