Disclaimer: The pictures you are about to see are real. Server names and rack locations have been changed to protect the innocent. No software was harmed in the making of this blog post.
Last week we discovered what happens when copper goes to 1,900 degrees Fahrenheit inside a blade chassis. As you can see, it’s not pretty. I haven’t figured out how to represent the smell in a blog post, but trust me, that wasn’t pretty either. What’s truly amazing however, is that the whole situation wasn’t much worse.
What you’re looking at is the result of an electrical failure inside an IBM BladeCenter H filled with twelve blade servers. It’s still unclear whether the mid-plane or a blade server was the initial cause, but both were victims of a rather vicious internal short that sent an ominous cloud of smoke into the air.
Because we are a 24/7 manned facility, our staff was on top of the situation immediately and no other customer equipment was ever in danger, but it raised blood pressure levels nonetheless. What we marveled at the most was that as bad as it initially looked, the damage was essentially contained to just two components, and the customer was back up an running again within a matter of hours.
That fact that recovery happened as quickly as it did was not an accident of good fortune. This type of event (though not exactly this scenario of course) was planned for in advance by using quality hardware and good system architecture. There are several key take-aways worth noting:
1. The four power supplies inside the BladeCenter shut down their output as soon as the overload was detected, thereby preventing further damage and a possible catastrophic event. Score 1 for IBM.
2. IBM was onsite PDQ with the correct parts and someone who knew how to replace them. Score 1 more for IBM.
3. Despite the intense heat, the rest of the BladeCenter did not spontaneously combust, in fact all of the adjacent blade servers, management, and network modules survived unscathed. Score yet another for IBM.
4. The customer had the wisdom, foresight, and willingness to invest in a boot-from-SAN architecture with adequate redundancy. As soon as IBM replaced the parts, return to operation occurred without the need for a restore. Smoke and drama, yes. Data loss and extended down time, no. Score 1 for the customer and the DSS design team. Note that perfect redundancy in every component was not necessary to accomplish a happy ending – just reasonably adequate redundancy.
It’s become fashionable to say that hardware is commodity and that cost is the bottom line. But is that really true? Perhaps when buying client-side hardware one can get away with it it’s justifiable, but I’m not ready to extend that to the business-critical server side. This event reinforces that view rather dramatically. Well-designed hardware and effective vendor support, which is usually never the cheapest option, was clearly one of the keys to saving the day.
Based on just this post alone, one might be led to think I bleed IBM-blue when cut, but those who know me will ROTFL and tell you how often I’m calling IBM out rather than giving them kudos. IBM gear is typically never the low-cost option, but this incident serves as a wake-up call that there is more to be considered than cost alone. Regardless of vendor, components fail, and it’s important to know that the hardware is designed to compensate when they do.
Good hardware and good system architecture go hand in hand when reliable operations (and sleeping at night) are a must. There simply are no short cuts.
//spk























