Category Archives: Human Factors

Gaining Perspective

by Scott Kantner, January 24th, 2011 in Disaster Recovery, Human Factors

Two people stand on opposite sides of a river.

First person (shouting across the river): How do I get to the other side of the river?
Second person: You are on the other side.

Clearly, perspective is key to proper understanding and subsequent action.

Applied to availability and disaster recovery, perspective is no less critical. Consider that disasters come in various flavors:

1. Catastrophic hardware failure (e.g. cooling system fails, resulting in cooked gear).
2. Network outage.
3. Power outage.
4. Fire or flood damage to your primary facility.
5. Bomb scare, leading to building evacuation.
6. Pandemic.

All of these represent real trouble, and how you view them will determine how you well (or if) you can recover from them.

Most folks I’ve talked with aren’t planning to recover from a pandemic, but they are very worried about fire and flood and have plans in place to handle those. Less appreciated are brief, but total power outages and extreme network latency. From a certain perspective however, such as your customer’s, these could be just as critical.

From what perspective are you planning?

//spk

Alright Then

by Scott Kantner, January 10th, 2011 in Human Factors

This has nothing to do with our UPS project I mentioned last time, other than that it’ll be a great stress reliever if things get tense. Thanks to Michael Hyatt for pointing this one out.

//spk

An iMusic Christmas

by Scott Kantner, December 11th, 2010 in Human Factors

Ok, so this really has nothing to do with infrastructure, but it’s simply too awesome not to share….Many thanks to my friend JasonD.

Merry Christmas!

//spk

The Cost of Things Not Seen

by Scott Kantner, September 10th, 2010 in Data Center, Human Factors

With the cast and crew of Extreme Home Makeover in the neighborhood last month, literally less than 1/2 mile away, I was motivated to take stock of my own home makeover projects. It’s funny (if not frightening) how your infrastructure can sometimes be falling apart right in front of your nose and yet go unnoticed. And when you do finally notice, the damage is often deeper than meets the eye. With Ty Pennington barking through a bullhorn off in the distance, I quickly found my priority #1 infrastructure project:

After 18 years, my self-built 15×15 storage shed needed some love.  A little diagnosis with a wrecking bar revealed that more than just a hinge replacement was in the offing.  Beside what you see in the picture, the sheathing beneath the door frame was rotted away as well the lower portion of the door itself.  The ensuing project resulted in an 8 hour effort in 96-degree heat to rebuild the door, restore the door frame, rehang the door, and caulk/paint the whole affair.

Dang it Jim, I’m an IT guy, not a carpenter

I’m always looking for a reason to use my Sawzall, but I did feel a bit like Dr. McCoy on this project. The patient was almost dead when I arrived on the scene, and extreme measures were required. It was way more work than it needed to be had I just been a little more diligent about what I had been observing over the last few years.  The door didn’t rot overnight - it happened slowly right before my eyes.

Why not take a walk through your IT shop today and use your monitoring tools to see what’s going on. Maybe take a closer look at the problems you’ve known about for a while, but haven’t done anything about because they didn’t look “too bad.”  How do they look now?  Do you see anything that needs attention? Is there anything that looks like it could be more evil than meets they eye?  The longer you wait, the more painful fixing it is likely to be. If you need someone to help do a little diagnosis with a wrecking bar, our professional services folks can be of  help. And if you don’t want to maintain your own shed anymore, you can gladly arrange for space in ours.

//spk

Great Expectations

by Scott Kantner, July 9th, 2010 in Data Center, Human Factors

What could possibly be more fun than standing outside in the 100+ degree heat here in southeastern PA?  Standing outside in 90+ degree heat while waiting in line at Disney World.  At least there’s the promise of something fun, and possibly something cool and wet at the end of the wait.

While recently wading through the sea of humanity and waiting in some of the infernal lines that define Disney at this time of year, I was struck by an interesting IT analogy early in the week (yes, I really did need a vacation, and by the end of the week I wasn’t thinking about IT at all).

In last June and early July, the number of baby-strollers per square foot in the Magic Kingdom increases to approximately 10x the normal rate.  This forces one to put up with gives one many opportunities to observe other people’s children under extreme conditions. It’s amazing to watch parents expect their 5-year olds to behave like perfect angels in subtropical queue lines for upwards of 45 minutes, or sprint from one end of a park to another on tiny, tired little legs to score a Toy Story Fast-Pass before they’re all gone. It’s amazing because you can tell these kids are perfect hellions under ideal conditions as well. Putting them under stress only intensifies the problems that already exist.

Likewise, if you’ve got poorly designed or neglected infrastructure, simply moving it to a colo facility isn’t going to improve up-time or performance significantly, if at all. Certainly you can improve environmentals, save capex, and get lower network latency with a colo move, but if application response time and reliability are sucking wind before the move because of bad design or sysadmin neglect, not much is going to change.

My point isn’t that you should avoid putting your infrastructure in a better home if you need to, but that you shouldn’t expect it to behave any differently just because you moved it. Moreover, move time is not the time to make drastic changes to your production systems. It’s not a “free” outage window.  The more changes you make during a move, the higher the risk of a failed, or at minimum a very stressful move.

On the other hand, a move can be an ideal time to upgrade to better hardware and legitimately raise your expectations. For example, you can set up new hardware next to your old, cluster it, and then move the new half of the cluster to a better home while the old half continues to run the business. After you complete the move and let the clusters resynchronize, you can turn down the old cluster and all activity will automatically switch over to the new hardware. Your users will never feel a thing. Very little pain, but very much gain.

Of course that all sounds good, and there are a lot of details involved in making it happen, but that’s what we do best. If you’re interested  in smoothly moving your critical IT gear to a new home and need some experienced help to get it done, give us call. Hardware prone to temper tantrums is one of our specialties.

//spk

Keep The Change

by Scott Kantner, June 23rd, 2010 in Human Factors

Does this sound like your IT shop?  Reports from the Uptime Institute consistently show that the majority of reliability and uptime woes aren’t caused by hardware,  facilities, or utility failure – they’re caused by humans, and what pray tell are those humans doing?  They’re changing things, and often too much of the change isn’t planned, approved, or documented.  Or, there is simply too much change going on at one time.

Much like a bomb is meant to explode, technicians are meant to be technical, so it’s a bit unrealistic to assume they’re giving a lot of thought to managing change, much less be fond of doing so. They just want to git ‘er done, and in large part, we pay them well to not only do that, but to do it right the first time.  Hard core techies, the ones that really know how to make things work, typically aren’t also wired for sitting in management meetings. The problem with managing change is that it’s boring. It’s not technical. And explaining highly technical things to non-technical folks in a change management meeting is not always the average techie’s strong suite, nor perhaps the best use of their time. To the contrary, it can be a very frustrating experience for them, which can lead them down the Dark Side of making changes beneath the radar. Effective change management therefore becomes a bit of a balancing act. We need to know what’s going on, but we don’t want to bog everyone down in the process.

In our data center controlling change is not optional. Reliability demands it, as do the Spanish Inquisition SAS 70 auditors. But we’ve found a way to manage it without terribly burdening our technical staff. Change requests may be formally entered in the system by any authorized individual whether or not they are technical;  they are simply the person requesting the change. The request is then routed to a technician who can assess what needs to be done, adds those details to the request, makes a suggestion as to when it might be done, and then it’s passed on to someone in management who can assess the risk and approve/disapprove it. If a change is of major significance, the request comes before a Change Advisory Board (CAB) for final approval. Technicians, while welcome, are not required to attend CAB meetings.  When requests are properly documented, the CAB is almost always able to make a good decision without further involving the technical staff.  When the CAB does need more information or defers a  request for some reason (e.g. too many changes on one night), the technician in question is notified and it’s handled outside of a meeting.  This saves time, money, and mental fatigue. Since the pain threshold is relatively low, this method also encourages all change activity to actually be run through the proper channels.

Our process is capable of handling very high rates of change, but that doesn’t mean that we do so.  On the contrary, we try to minimize the rate of change, batching things together when it makes sense to  minimize outages, and spreading them out when the risk is high to maximize uptime.

Managing change is not fun, and you may be justifiably weary of it.  Let us take that burden off of your shoulders.

//spk