Category Archives: Data Center

Flare Up

by Scott Kantner, June 2nd, 2011 in Data Center, Disaster Recovery

Disclaimer: The pictures you are about to see are real. Server names and rack locations have been changed to protect the innocent. No software was harmed in the making of this blog post.

Last week we discovered what happens when copper goes to 1,900 degrees Fahrenheit inside a blade chassis. As you can see, it’s not pretty. I haven’t figured out how to represent the smell in a blog post, but trust me, that wasn’t pretty either. What’s truly amazing however, is that the whole situation wasn’t much worse.

What you’re looking at is the result of an electrical failure inside an IBM BladeCenter H filled with twelve blade servers. It’s still unclear whether the mid-plane or a blade server was the initial cause, but both were victims of a rather vicious internal short that sent an ominous cloud of smoke into the air.

Because we are a 24/7 manned facility, our staff was on top of the situation immediately and no other customer equipment was ever in danger, but it raised blood pressure levels nonetheless. What we marveled at the most was that as bad as it initially looked, the damage was essentially contained to just two components, and the customer was back up an running again within a matter of hours.

That fact that recovery happened as quickly as it did was not an accident of good fortune. This type of event (though not exactly this scenario of course) was planned for in advance by using quality hardware and good system architecture. There are several key take-aways worth noting:

1. The four power supplies inside the BladeCenter shut down their output as soon as the overload was detected, thereby preventing further damage and a possible catastrophic event. Score 1 for IBM.

2. IBM was onsite PDQ with the correct parts and someone who knew how to replace them. Score 1 more for IBM.

3. Despite the intense heat, the rest of the BladeCenter did not spontaneously combust, in fact all of the adjacent blade servers, management, and network modules survived unscathed. Score yet another for IBM.

4. The customer had the wisdom, foresight, and willingness to invest in a boot-from-SAN architecture with adequate redundancy. As soon as IBM replaced the parts, return to operation occurred without the need for a restore. Smoke and drama, yes. Data loss and extended down time, no. Score 1 for the customer and the DSS design team. Note that perfect redundancy in every component was not necessary to accomplish a happy ending – just reasonably adequate redundancy.

It’s become fashionable to say that hardware is commodity and that cost is the bottom line. But is that really true?  Perhaps when buying client-side hardware one can get away with it it’s justifiable, but I’m not ready to extend that to the business-critical server side. This event reinforces that view rather dramatically. Well-designed hardware and effective vendor support, which is usually never the cheapest option, was clearly one of the keys to saving the day.

Based on just this post alone, one might be led to think I bleed IBM-blue when cut, but those who know me will ROTFL and tell you how often I’m calling IBM out rather than giving them kudos. IBM gear is typically never the low-cost option, but this incident serves as a wake-up call that there is more to be considered than cost alone. Regardless of vendor, components fail, and it’s important to know that the hardware is designed to compensate when they do.

Good hardware and good system architecture go hand in hand when reliable operations (and sleeping at night) are a must. There simply are no short cuts.

//spk

Something Old, Something New, Something Borrowed, Something Blue

by Scott Kantner, January 4th, 2011 in Data Center

2010 was good to our data center business. So much so that we found ourselves starting a UPS expansion project just before Christmas, and that means an opportunity to share with you what’s involved in adding power to a data center.

Something Old

We started our data center business in 2007 using a 15,000sf raised floor that had been in mothballs for several years. Built in the late 80’s as part of a bank operations center, by today’s standards it would have fallen somewhere between Tier II and Tier III.

At 43 watts/sf, it had a varying degree of redundancy in the power and cooling infrastructure, being N in some places and N+1 in others, and was state-of-the-art for its day (at least for non-military installations). Then after various merger activities in the mid 90′s, the room was eventually taken out of service.  In it’s place came a smaller 3,000sf area built to accommodate 153 watts/sf. This room had better redundancy, reaching a minimum of N+1 in almost every area and even 2N in some systems, moving things further across the continuum toward Tier III.

During this conversion to a smaller, higher density space, the original pair of UPS modules were decommissioned (but left in place) and upstaged by a pair of new modules. Interestingly, the original modules had been paralleled together “for capacity” at 650kW, meaning that if one of  the modules failed and the load exceeded the capacity of the remaining unit, something in the data center was going to suffer.  The philosophy for the two new 337kw modules was a bit more uptime-conscious, calling for the units to be paralleled “for redundancy” and for the data center loads to be fed such that the loss of one UPS module would not result in any load being dropped.

Unfortunately there was still one minor flaw in the design. While the upgrade planners had good intentions, the two modules were configured to share a single output bus. Perhaps this was a cost consideration and the risk was thought to be manageable, but nevertheless, all of the loads in the data center were still dependent on the health of a single component – the output bus – and more specifically, the circuit breaker keeping it alive.  A better design, though more expensive, would have been to build two completely separate power paths from two completely separate UPS systems.

Despite having inherited a single output bus, we’ve enjoyed 100% uptime for over 3 years, yet we knew from the beginning that one day we would need to remove this single point of failure.  And with the current UPS system now nearing capacity, that day has arrived.

Something New

The original UPS modules have sat silently in their original room since being turned off in the mid 90’s.  The adjacent battery room has been empty except for the battery racks, spill containment pan, and some behemoth input filters for the modules.  Also still standing by and in good working order are the original Automatic Transfer Switch (ATS), input bus, and output bus. Over the next couple of months we will bring these two rooms back online with two new UPS modules and battery strings with space for a third, and another 1-megawatt diesel generator to  supplement the existing three. We’ll also recommission the ATS and rebuild the input and output buses (bussi?).  When the project is complete, the result will be two fully diversified  “A” and “B” power systems, doubled capacity, and a foundation for fast expansion as we grow through the 15,000sf.

And hopefully, I’ll still have hair.

//spk

P.S. The “something borrowed” is time – our output breakers have been living on it.  And as for “something blue,” well…that just goes with the saying. ;)

The Cost of Things Not Seen

by Scott Kantner, September 10th, 2010 in Data Center, Human Factors

With the cast and crew of Extreme Home Makeover in the neighborhood last month, literally less than 1/2 mile away, I was motivated to take stock of my own home makeover projects. It’s funny (if not frightening) how your infrastructure can sometimes be falling apart right in front of your nose and yet go unnoticed. And when you do finally notice, the damage is often deeper than meets the eye. With Ty Pennington barking through a bullhorn off in the distance, I quickly found my priority #1 infrastructure project:

After 18 years, my self-built 15×15 storage shed needed some love.  A little diagnosis with a wrecking bar revealed that more than just a hinge replacement was in the offing.  Beside what you see in the picture, the sheathing beneath the door frame was rotted away as well the lower portion of the door itself.  The ensuing project resulted in an 8 hour effort in 96-degree heat to rebuild the door, restore the door frame, rehang the door, and caulk/paint the whole affair.

Dang it Jim, I’m an IT guy, not a carpenter

I’m always looking for a reason to use my Sawzall, but I did feel a bit like Dr. McCoy on this project. The patient was almost dead when I arrived on the scene, and extreme measures were required. It was way more work than it needed to be had I just been a little more diligent about what I had been observing over the last few years.  The door didn’t rot overnight - it happened slowly right before my eyes.

Why not take a walk through your IT shop today and use your monitoring tools to see what’s going on. Maybe take a closer look at the problems you’ve known about for a while, but haven’t done anything about because they didn’t look “too bad.”  How do they look now?  Do you see anything that needs attention? Is there anything that looks like it could be more evil than meets they eye?  The longer you wait, the more painful fixing it is likely to be. If you need someone to help do a little diagnosis with a wrecking bar, our professional services folks can be of  help. And if you don’t want to maintain your own shed anymore, you can gladly arrange for space in ours.

//spk

Great Expectations

by Scott Kantner, July 9th, 2010 in Data Center, Human Factors

What could possibly be more fun than standing outside in the 100+ degree heat here in southeastern PA?  Standing outside in 90+ degree heat while waiting in line at Disney World.  At least there’s the promise of something fun, and possibly something cool and wet at the end of the wait.

While recently wading through the sea of humanity and waiting in some of the infernal lines that define Disney at this time of year, I was struck by an interesting IT analogy early in the week (yes, I really did need a vacation, and by the end of the week I wasn’t thinking about IT at all).

In last June and early July, the number of baby-strollers per square foot in the Magic Kingdom increases to approximately 10x the normal rate.  This forces one to put up with gives one many opportunities to observe other people’s children under extreme conditions. It’s amazing to watch parents expect their 5-year olds to behave like perfect angels in subtropical queue lines for upwards of 45 minutes, or sprint from one end of a park to another on tiny, tired little legs to score a Toy Story Fast-Pass before they’re all gone. It’s amazing because you can tell these kids are perfect hellions under ideal conditions as well. Putting them under stress only intensifies the problems that already exist.

Likewise, if you’ve got poorly designed or neglected infrastructure, simply moving it to a colo facility isn’t going to improve up-time or performance significantly, if at all. Certainly you can improve environmentals, save capex, and get lower network latency with a colo move, but if application response time and reliability are sucking wind before the move because of bad design or sysadmin neglect, not much is going to change.

My point isn’t that you should avoid putting your infrastructure in a better home if you need to, but that you shouldn’t expect it to behave any differently just because you moved it. Moreover, move time is not the time to make drastic changes to your production systems. It’s not a “free” outage window.  The more changes you make during a move, the higher the risk of a failed, or at minimum a very stressful move.

On the other hand, a move can be an ideal time to upgrade to better hardware and legitimately raise your expectations. For example, you can set up new hardware next to your old, cluster it, and then move the new half of the cluster to a better home while the old half continues to run the business. After you complete the move and let the clusters resynchronize, you can turn down the old cluster and all activity will automatically switch over to the new hardware. Your users will never feel a thing. Very little pain, but very much gain.

Of course that all sounds good, and there are a lot of details involved in making it happen, but that’s what we do best. If you’re interested  in smoothly moving your critical IT gear to a new home and need some experienced help to get it done, give us call. Hardware prone to temper tantrums is one of our specialties.

//spk

The Red Button

by Scott Kantner, April 1st, 2010 in Data Center

Don’t ever press the wed, err red one. While I laughed hysterically at this cartoon as kid, I never thought it would become my reality one day. Yes, I have pressed the red one, but I hope to never have to again.

The “red one” is none other than the Emergency Power Off button, and here on the east coast it’s pretty hard to build a data center without one. What?! You don’t have one?  Shhhh…I won’t tell.  You’re secret is safe with me. Here’s what a real EPO red button looks like in case you’ve never seen one.

Notice the label. I firmly believe it should also say “UPDATE YOUR RESUME BEFORE PRESSING” as pressing this is in most cases is a resume-generating, if not career-ending event.  Why? When activated, this button’s job is to do one thing, and one thing only: cut the power to your data center. All of it. Let that sink in for a moment. Think through that what that would mean in your shop.  No power. No sound. Just deafening silence, that is of course, unless you pressed it by accident and the silence gives way to the sound of clanging pitch forks and the smell of torches being lit over in the end-user community.

I am obviously a bit biased about this topic. I don’t think these systems are necessary, but you should do some research and draw your own conclusions. I am 100% all for safety, but from the historical evidence I’ve seen, the risk that EPO is designed to mitigate is lower than what you’re exposed to driving to work every day.  APC’s white paper #22 pretty much nails it:

EPO is a subsystem that is specifically designed to override all redundancy and fault tolerance built into the
network-critical physical infrastructure (NCPI), thereby putting the entire network at risk. EPO operation is
one of the largest causes of unplanned data center shutdown. The design of an EPO system must
therefore try to prevent any possibility of accidental operation, and it must minimize deliberate operation for
any reason other than a valid life-threatening emergency.
[Emphasis mine]

Red buttons are no panacea, but we are nevertheless forced to install then, and then make them nigh unto impossible to press unless you Really Mean It.  Note in the photo above that the button is both recessed and protected by a plastic cover. Without the plastic cover, the recessed nature of the button is the only thing preventing it from accidentally being bumped and also hopefully slows down a would-be pusher enough to stop and ask “Do I Really Mean It?” Speaking of the cover, note also the small gray loop of wire in the upper left corner of the housing – we opted to install covers with alarms. Lifting the cover results in a piercing electronic squeal capable of  penetrating 2-hour fire-rated walls and forces one once again to stop and ask “Do I Really Mean It?”  Cover alarms are designed to stop non-data center savvy electricians and others from innocently doing something disastrous, such as pressing the red button before installing a new circuit breaker. Yes, it happens. Well, the label does contain the word “off”, doesn’t it?  Changing the label from “Electrical Power Off” to “Emergency Power Off” tends to alter the results little.  The word “off” seems to be the Pavlovian trigger.

Disarming Considerations

As I write this, our EPO system is being expanded to accommodate the growth of our operations. If you are building a new data center with EPO, make sure the designer includes a way to disable the system during maintenance and expansion activities. This seems like an obvious feature to include, but don’t take it for granted. This is also a handy feature to have if your operations are prone to having “civilians” in the data center, i.e. those who are unfamiliar with the various buttons and switches on the walls. It is very reassuring to be able to disarm the red buttons while such folks are meandering about the room. Even when escorted, such folks have been known to find ways to activate the EPO system, either accidentally by bumping a non-recessed red button, or deliberately pushing it out of curiosity when no one is watching.

 

A Marriage Made In Hell

Once you have an EPO system in place, you will have to learn to live with it.  It is a risk that must be managed like all the others. If you’re building a new data center, you at least have the opportunity to design and build it properly, and then test it without jeopardizing your operations. Retrofitting an existing data center with EPO or expanding an existing system is a different matter entirely. You will want to engage an engineering firm and electricians that are very experienced with EPO systems, as most electricians are not familiar with the complexities involved with wiring EPO into a live data center environment. There is no second chance to get it right.

Here is scary story that makes my point.  Cutting to the chase, the article states:

About a month after opening a new facility in March 2003, Roberts, the director of data center services for Novi, Mich.-based Trinity Health, got a call. It was Easter morning, and a contractor had accidentally activated the EPO switch as he tried to replace a module connecting the button to the fire alarm system. According to Roberts, the fiasco “took the data center out.”

“We went out at 8:30 that morning,” he said. “By 11:30 that night, we were probably 95% up and going, so we were pretty lucky. But from that day forward, I tried to lessen the effect of this EPO.”

Lessen the effect indeed. This not the kind of resurrection we want to be talking about on Easter Sunday.

Stress Relief Department

After all of this talk about outages, and with my own data center’s EPO being modified as we speak, it’s time for some needed stress relief:

Happy Easter!

//spk

P.S. I did press the red button, several times actually, but it wasn’t in a live situation.  It was during the initial testing of our system.  The lead engineer said “May as well press it now if you want, because you never will again.”  Hopefully he was a genuine prophet.

Labelmania

by Scott Kantner, March 12th, 2010 in Data Center

A discussion of labeling in the data center could go on for days and probably be done as a nine-part DVD mini-series and sold as a cure for insomnia. Nevertheless, the importance of good and proper labeling can not be understated, but it can be simply stated: Label Everything.  Let’s take a quick look at the big items.

Racks – Label both front and back doors. We use a scheme based on the row number and position within the row, such that the first rack in row 5 would be labeled “5A”, the second “5B’, etc.  Other folks use the time-honored  “Battleship”-style system, based on an XY grid that maps out the room, most often based on two-foot squares that make up a typical raised floor system.

An example would be “AJ06″, where the “X” coordinate is “AJ” and the “Y” is “06″. Neither method is necessarily superior to the other, and we happen to use both. We use the row/rack scheme for our racks and XY coordinates for infrastructure items like Air Handlers, floor PDUs, chilled water valves, etc. The reason we use row/rack rather than XY coordinates is that in a large room full of equipment, it is often hard to see the grid system on the walls and figure out where things are located. We believe it’s easier for a new sysadmin to find rack 5C (row 5, third rack) than to ask him to find rack BQ59 in a room chocked full of racks where he can’t see any latitude/longitude markers on the walls to get his bearings. Again, there is neither right nor wrong here; just a couple of different ways to approach it.

Servers and Network Gear – Label both the front and the back. The name is probably sufficient. Security-sensitive folks in your shop may balk at using IP addresses on server labels.

PDUs and power whips – Both floor and in-rack units. If you have an A+B redundant power distribution system, everything can be tagged with a number to identify the unit and a color to indicate to which  feed (“A” or “B”) it’s attached. Note in the pictures how this flows all the way through – even the whips are colored.

Patch Panels – We could talk about patch panel labeling schemes for days. The important thing to do is pick one system and stick with it. Here’s a look at ours. We label our panels using a a “source/destination” scheme, so in the photo “1A/1D (1-6)” means that these are the first 6 ports running from rack 1A (the rack this panel is in) to rack 1D. Very easy for new sysadmins to grasp. This does not follow the ballyhooed TIA standard for labeling patch panels, but we find it be very practical and easy for the people working in the room.

Cables – Labeling cables is a religious issue for another day also, but in our data center we typically only label the key cables in the network backbone and edges  so that trouble shooting is easier at 3 AM. When we do label a cable, we label each end with a wrap-around style label that identifies where the other end of the cable can be found. You can see an example of this in the photo above.  If you click on the photo to enlarge it, you can almost read the label.

Air Handler (a.k.a “CRAC”) units – Simple to do, and very helpful when the units send alarms to systems management tools.

Emergency Power Off (EPO) and Fire suppression controls – I actually think the EPO button should be labeled “Update Resume Before Pushing,” but that’s a topic for another day.

Mechanical Support Systems -  Here you need to not only identify the control accurately, but sometimes you need to be very specific about it’s operation:

After all the hard work of designing and implementing a label system is done, you’ll need to put ongoing enforcement into place for which a label shouldn’t be necessary:

Take-away: What’s most important is that you pick a labeling system that works for you and is easily maintainable, because it needs to be useful, and it’s a never ending process. If it’s confusing to use or a pain to keep up to date, even Bubba and his .50 cal aren’t going to help.

//spk