What could possibly be more fun than standing outside in the 100+ degree heat here in southeastern PA? Standing outside in 90+ degree heat while waiting in line at Disney World. At least there’s the promise of something fun, and possibly something cool and wet at the end of the wait.
While recently wading through the sea of humanity and waiting in some of the infernal lines that define Disney at this time of year, I was struck by an interesting IT analogy early in the week (yes, I really did need a vacation, and by the end of the week I wasn’t thinking about IT at all).
In last June and early July, the number of baby-strollers per square foot in the Magic Kingdom increases to approximately 10x the normal rate. This forces one to put up with gives one many opportunities to observe other people’s children under extreme conditions. It’s amazing to watch parents expect their 5-year olds to behave like perfect angels in subtropical queue lines for upwards of 45 minutes, or sprint from one end of a park to another on tiny, tired little legs to score a Toy Story Fast-Pass before they’re all gone. It’s amazing because you can tell these kids are perfect hellions under ideal conditions as well. Putting them under stress only intensifies the problems that already exist.
Likewise, if you’ve got poorly designed or neglected infrastructure, simply moving it to a colo facility isn’t going to improve up-time or performance significantly, if at all. Certainly you can improve environmentals, save capex, and get lower network latency with a colo move, but if application response time and reliability are sucking wind before the move because of bad design or sysadmin neglect, not much is going to change.
My point isn’t that you should avoid putting your infrastructure in a better home if you need to, but that you shouldn’t expect it to behave any differently just because you moved it. Moreover, move time is not the time to make drastic changes to your production systems. It’s not a “free” outage window. The more changes you make during a move, the higher the risk of a failed, or at minimum a very stressful move.
On the other hand, a move can be an ideal time to upgrade to better hardware and legitimately raise your expectations. For example, you can set up new hardware next to your old, cluster it, and then move the new half of the cluster to a better home while the old half continues to run the business. After you complete the move and let the clusters resynchronize, you can turn down the old cluster and all activity will automatically switch over to the new hardware. Your users will never feel a thing. Very little pain, but very much gain.
Of course that all sounds good, and there are a lot of details involved in making it happen, but that’s what we do best. If you’re interested in smoothly moving your critical IT gear to a new home and need some experienced help to get it done, give us call. Hardware prone to temper tantrums is one of our specialties.
Don’t ever press the wed, err red one. While I laughed hysterically at this cartoon as kid, I never thought it would become my reality one day. Yes, I have pressed the red one, but I hope to never have to again.
The “red one” is none other than the Emergency Power Off button, and here on the east coast it’s pretty hard to build a data center without one. What?! You don’t have one? Shhhh…I won’t tell. You’re secret is safe with me. Here’s what a real EPO red button looks like in case you’ve never seen one.
Notice the label. I firmly believe it should also say “UPDATE YOUR RESUME BEFORE PRESSING” as pressing this is in most cases is a resume-generating, if not career-ending event. Why? When activated, this button’s job is to do one thing, and one thing only: cut the power to your data center. All of it. Let that sink in for a moment. Think through that what that would mean in your shop. No power. No sound. Just deafening silence, that is of course, unless you pressed it by accident and the silence gives way to the sound of clanging pitch forks and the smell of torches being lit over in the end-user community.
I am obviously a bit biased about this topic. I don’t think these systems are necessary, but you should do some research and draw your own conclusions. I am 100% all for safety, but from the historical evidence I’ve seen, the risk that EPO is designed to mitigate is lower than what you’re exposed to driving to work every day. APC’s white paper #22 pretty much nails it:
EPO is a subsystem that is specifically designed to override all redundancy and fault tolerance built into the
network-critical physical infrastructure (NCPI), thereby putting the entire network at risk. EPO operation is
one of the largest causes of unplanned data center shutdown. The design of an EPO system must
therefore try to prevent any possibility of accidental operation, and it must minimize deliberate operation for
any reason other than a valid life-threatening emergency. [Emphasis mine]
Red buttons are no panacea, but we are nevertheless forced to install then, and then make them nigh unto impossible to press unless you Really Mean It. Note in the photo above that the button is both recessed and protected by a plastic cover. Without the plastic cover, the recessed nature of the button is the only thing preventing it from accidentally being bumped and also hopefully slows down a would-be pusher enough to stop and ask “Do I Really Mean It?” Speaking of the cover, note also the small gray loop of wire in the upper left corner of the housing – we opted to install covers with alarms. Lifting the cover results in a piercing electronic squeal capable of penetrating 2-hour fire-rated walls and forces one once again to stop and ask “Do I Really Mean It?” Cover alarms are designed to stop non-data center savvy electricians and others from innocently doing something disastrous, such as pressing the red button before installing a new circuit breaker. Yes, it happens. Well, the label does contain the word “off”, doesn’t it? Changing the label from “Electrical Power Off” to “Emergency Power Off” tends to alter the results little. The word “off” seems to be the Pavlovian trigger.
Disarming Considerations
As I write this, our EPO system is being expanded to accommodate the growth of our operations. If you are building a new data center with EPO, make sure the designer includes a way to disable the system during maintenance and expansion activities. This seems like an obvious feature to include, but don’t take it for granted. This is also a handy feature to have if your operations are prone to having “civilians” in the data center, i.e. those who are unfamiliar with the various buttons and switches on the walls. It is very reassuring to be able to disarm the red buttons while such folks are meandering about the room. Even when escorted, such folks have been known to find ways to activate the EPO system, either accidentally by bumping a non-recessed red button, or deliberately pushing it out of curiosity when no one is watching.
A Marriage Made In Hell
Once you have an EPO system in place, you will have to learn to live with it. It is a risk that must be managed like all the others. If you’re building a new data center, you at least have the opportunity to design and build it properly, and then test it without jeopardizing your operations. Retrofitting an existing data center with EPO or expanding an existing system is a different matter entirely. You will want to engage an engineering firm and electricians that are very experienced with EPO systems, as most electricians are not familiar with the complexities involved with wiring EPO into a live data center environment. There is no second chance to get it right.
About a month after opening a new facility in March 2003, Roberts, the director of data center services for Novi, Mich.-based Trinity Health, got a call. It was Easter morning, and a contractor had accidentally activated the EPO switch as he tried to replace a module connecting the button to the fire alarm system. According to Roberts, the fiasco “took the data center out.”
“We went out at 8:30 that morning,” he said. “By 11:30 that night, we were probably 95% up and going, so we were pretty lucky. But from that day forward, I tried to lessen the effect of this EPO.”
Lessen the effect indeed. This not the kind of resurrection we want to be talking about on Easter Sunday.
Stress Relief Department
After all of this talk about outages, and with my own data center’s EPO being modified as we speak, it’s time for some needed stress relief:
Happy Easter!
//spk
P.S. I did press the red button, several times actually, but it wasn’t in a live situation. It was during the initial testing of our system. The lead engineer said “May as well press it now if you want, because you never will again.” Hopefully he was a genuine prophet.
On June 29th a cloud burst occurred at Rackspace, proving that even the mighty eventually do fall. The blow-by-blow Rackspace Twitter account of their power outage provides interesting insight into what happens during a crisis at a hosting provider.
In every industry there are dirty little secrets that customers either don’t know about, or don’t want to know about. The meat counter at the grocery store is a prime example. Those steaks and chops look really good, but did you every watch the entire process from hoof to hamburger? It’s not pretty, and for most folks it’s Too Much Information.
So here’s Dirty Little Secret #1 of the hosting industry: While most every hosting company has to make the claim in order to be credible, no one can deliver 100% data center up time forever. No one. Not even the market leader. So why then make the claim at all? Because that’s what customers demand to hear. In talking with customers we find a widespread cross-industry sentiment, usually absent of any logical rational, that says “my business is so important that my infrastructure has to be running 24/7 without any interruptions at all.” Unless your business is keeping patients alive with sophisticated medical equipment, this seems like a rather difficult position to defend. But no one wants to be the bad guy to point that out. We know there is life beyond brief outages because they happen every day and yet nobody goes broke, but it is typically unwise to say so.
Realizing that downtime will occur, even in the elite shops of the world like Rackspace with their fleet of nine data centers, you do need to make realistic decisions about what level of up time you really need in light of the type of business you’re in. And while it may sound like heresy, you also want to make decisions about things that are much more important than up time levels. It seems to me that if downtime is inevitable, and we know that it is, then I want my equipment in the hands of people who know how to recover quickly from an outage, who will communicate with me regularly and truthfully throughout the crisis, and who will do their level best to get me back on line as quickly as possible. I want my equipment in the hands of highly competent people that I can trust. You can’t make that determination when you sign up for service via a web browser or where you do the whole transaction over the phone. The only way to make the determination is to actually meet the people who are going to become the custodians of your infrastructure.
Before you put your equipment in the hands of someone else, make the effort to visit them. If they don’t allow visits, that should be a big Red Flag #1. Talk to their operations and support people, particularly the folks who will be touching your equipment. If you’re not allowed to talk them, that should be Red Flag #2. Ask them about their up time guarantee. If they look at you square in the eye and say 100%, that should be Red Flag #3. Kick the dust out of your shoes and move on.
Let me cordially invite you to visit our data center hosting facility this summer. No red flags – just trustworthy, highly competent, and dependable people.
Warning, this is a long post on a controversial subject. I’d recommend you refill your coffee cup before diving in.
We get a lot of questions about “Green IT.” Is your data center Green? What’s your Green strategy? What do you think of Cap and Trade? And so on. With all the bacteria in the air about global warming and the associated hyper ventilation going on in the media, it’s becoming difficult not to catch the disease and lose perspective. Do we have really have to add “being Green” to our list of worries in the data center? Little did I realize a favorite childhood classic would be a relevant stress reliever so many years later.
Do you like green eggs and ham?
There are many shades of green. I was reminded of that one day when my Dad asked me whether I was waiting for the traffic signal to turn avocado before I was going to pull out. But in terms of going Green in the data center there several flavors, such as reducing so called greenhouse gas emissions, buying RoHS compliant products, recycling old assets properly, etc. Since power for equipment and cooling is the most critical resource for IT, the Greenhouse Gas Police are our primary concern. Over on SearchDataCenter.com, we find this in a piece entitled “Get Ready For A Carbon Tax:”
Today, the U.S. government and other countries are taking carbon emissions seriously. The Environmental Protection Agency last week formally declared carbon dioxide and five other greenhouse gases to be harmful air pollutants. It’s a move that is thought to set the stage for a carbon tax of some kind.[emphasis added]
Kai Reichardt, data center manager for UniCredit Group, an Italian bank, said his company recently built a data center over a canal [raucous laughter added] so it could use free water cooling instead of expending energy for mechanical cooling. And he said the company will also investigate other opportunities to save.
“You have to define rules and implement them,” he said. “You have to punish the polluters and push the innovators.” [more emphasis added]
Well isn’t that rich? You, Mr. Data Center owner, are a polluter that needs to be punished. Whenever the discussion turns from the issue itself to name calling, you know that rational discussion has ended. Let’s assume the title of Evil Polluter for a moment and consider how we might mend our ways.
Would you like them with your server? I would not like them with my server.
Servers are the first potential power hogs that come to mind. To Green our servers quickly, we have to replace or virtualize them. IBM claims their new x3650 M2 class Intel servers boast $100/year savings on power. Yippee skippee. Most companies are not going to be excited about spending thousands on a new server just to save $100 annually on the electric bill. But it adds up in a big data center, you say. Well yes, but if you can show me a budget in this economy that can swap out gear in numbers large enough to make these power savings seem attractive, I’ll show you a budget where that savings is round-off error. $100/yr by itself is not a compelling story, given the capital expense and upheaval to our operations that always comes with putting new boxes in place. The sensible solution here would be to buy Greener hardware as old servers naturally come to end of life. But we’d be doing that anyway without consciously trying to be Green. New hardware from IBM, for example, becomes more energy efficient over time without us asking for it.
If we virtualize, we still have to spend money on product and labor to make it happen, and then we have service outages to incur, end-user politics to negotiate, and all the risks inherent in shutting down and moving healthy systems. Measuring the hard dollar savings of this maneuver from a Green perspective is like trying to weigh a chicken with a yardstick. Virtualization is a Very Good Thing, but using Green alone as the cost justification certainly isn’t going to cut it. Justification is going to found in reduced costs realized from hardware consolidation, new hardware avoidance, and to some degree software license reductions, among other things.
So unless Green means something substantial to the bottom line, who’s actually going to be interested?
Would you like them here or there? I would not like them here or there, I would not like them anywhere.
We could also try to replace or virtualize storage to save power. Kantner’s General Theory of Storage states that:
The rate of storage growth is inversely proportional to the amount of free storage available.
In other words, we’re always going to need more storage at the most inopportune times. Once we advance beyond spinning platters for storage (e.g. SSD), perhaps storage will become more power efficient. But regardless, storage virtualization leads only to a deferral of additional storage purchases through better space utilization, not a reduction of powered up hardware. Ultimately, storage purchases are justified by business requirements, not by an appeal to better power efficiency.
What about network gear? Because of the business impact of disruptions that can occur when swapping out major network components, no one is going to dive into that pool until it’s absolutely necessary. Those big old Cisco 6500 power supplies are going to continue to glow like the sun.
Green really needs to mean something more to the bottom line.
Would you, could you, in your data center? I would not, could not in my data center.
Short of taking a major outage or starting over, what practical facility changes can one realistically expect to cost justify? Few can accord to rip/replace chiller plants, UPS systems, or generators with more efficient units. That said, we are in fact looking at ways to shut our chiller plant down during the winter and just run off the cooling tower loop, but we’re doing that to reduce costs – Green is not the driver, but rather deregulation of the electrical utility.
Sure there are some practical things we can do, but extreme measures are hard to justify. We hear of folks running their cold aisles at temps over 90°F. Anyone smell silicon burning? (Incidentally, we’ve seen equipment get toasted. We like 72° at the equipment inlet for good reason, and the Uptime Institute agrees.)
But let’s get back to our alleged title of Polluters. Here at DSS we don’t have smoke stacks towering out of the data center. Chances are, neither do you. Why? Because the vast majority of us obviously don’t generate our own power. We outsourced that to the electric utility industry a long time ago. Certainly we can turn things off and use less, but the generator plant down the street is not going to run any less and will still make power the same way. And if that way is anything other than nuclear fission, there are still nitrogen oxides, sulfur oxides, dust, and carbon dioxide wafting into the atmosphere. Greening our data centers isn’t going to change that. Clearly there is something else driving the Green monster.
I do so like green eggs and ham, Thank you, Uncle Sam-I-Am!
Business cares about the bottom line, not greenhouse gas. Uncle Sam-I-Am understands this. Could it be that the clarion call to save the planet is rooted in the discovery of green ham? My hunch is that Uncle Sam is all about the green ham, namely the cold hard cash that will come out of Cap and Trade, or any other similar program. And we get the green eggs of trying to minimize our financial exposure to those taxes…err…programs. Cap and Trade, boiled down to it’s essence, is about government revenue, and it looks like Uncle Sam has done his homework. My bet is that Archie Bunker would smell a rat too. (If you’re a Dem, please don’t be offended – the title should really be “Archie Bunker on Government”).
As major consumers of power, those of us with data centers are squarely in the cross-hairs. Faced with confiscatory financial punishment, we suddenly have an interest in global warming, whether it’s reality or not. If the revenue from Cap and Trade were only intended to replace legacy power plants with nuclear or other clean power, then the idea would be more palatable, but as they’re designed now, these plans look like just more sources of pork to be spent as Uncle Sam sees fit. One need look no further than the UK for confirmation.
You do not like them so you say. Try them! Try them! And you may.
While I’m not alone in my contrary view on Green, the IT jury is still out on what Greening the Data Center really means from a practical standpoint. But let me end with some positive suggestions for both camps. For those truly concerned about going Green, consider mothballing your server room or data center and move your gear to a professional hosting facility. Take your shop off the grid and put it in a cloud somewhere that can, because of scale, do it with more power efficiency that you would be able to achieve on your own. If global warming theory rings true for you, this should have tremendous and obvious appeal. If it doesn’t, and you just want to avoid the hassles of Cap and Trade, you too might consider moving your gear to a professional data center. Let us Evil Polluters smell the sulfur of green eggs and send the ham to Uncle Sam.
Since only Robinson Crusoe had the luxury of getting everything done by Friday, the rest of us have to come up with other strategies to get all of the things done necessary to properly serve our customers. To help with this in our own data center, we’ve have a pseudo-tradition called Infrastructure Friday. This is not to be confused with Redneck Tuesday:
On Infrastructure Fridays, members of the data center team who normally don’t work out on the floor put down email, IM and PDAs, roll up their sleeves, and step into the data center to help get some of the “real” work done. To keep IT running smoothly, we sooner or later have to stop talking about it and actually go and do something about it. That “something” we do includes taking care of ongoing operational details and implementing new functionality that maintain or improve reliability.
Like many other things, excellent performance in the data center is all about execution and details. Focus on the details and the big picture will take care of itself, or as Mel Gibson advised his young son in The Patriot, “Aim small, miss small.” What sort of details are we talking about on Infrastructure Friday?
Not just performing rack inspections, but actually correcting any problems found.
Not just noting network latency issues, but getting the right people involved to isolate and resolve them.
Not just checking that critical monitoring systems in the NOC are healthy, but verifying they are actually working by simulating failures.
Not just verifying that operational documentation is current and complete, but actually updating it if it’s not.
Not just checking parts inventories (patch cables, cable management supplies, etc), but placing the orders to replenish supplies.
Not just validating that data center standards are being followed (equipment mounted for proper air flow, floor tile placement, etc) , but actually correcting violations.
Not just noting that wire management is shoddy, but actually making it better.
Not just complaining that critical patch cables aren’t labeled, but actually getting out the label machine and doing the labeling.
Not just finding hot spots in the electrical system, but scheduling the downtimerequired to avert a future disaster.
Hopefully the theme is obvious. On Infrastructure Friday, the goal isn’t to grouse about problems, it’s to fix them.
On a happier note, what sort of cool new functionality might we install on Infrastructure Fridays to improve reliability? That’s a shorter list probably not worthy of a set of bullets, but it typically involves installing new or upgraded monitoring capabilities in the NOC, adding additional monitoring instrumentation out on the floor, improving the quality and types of information on the master dashboards, and continuing to implement automated processes to lessen the chance of unplanned downtime. But again the theme is the same: take action.
In the day-to-day blur of activity required to keep a live data center running, the Oughta List of things (we ought to do this, we ought to do that) that would improve reliability grows week by week, but never seem to get done because of the tryanny of the urgent. We find ourselves officially declared Too Busy to work on the Oughta List and before we know it, an outage occurs and the Oughta List suddenly becomes an embarrassing Shoulda List.
Infrastructure Friday is designed to overcome Oughta List inertia. With a “try me” cost of zero, it has pretty good ROI.
//spk
Page 1 of 1
About DSS
DSS is a company that exemplifies the innovative spirit, delivering Data Center solutions that drive business value. From our Tier III Data Center to our consulting and support that DSS provides, we continually strive to produce high-quality, beyond standard solutions to exceed the expectations of our customers.