Since only Robinson Crusoe had the luxury of getting everything done by Friday, the rest of us have to come up with other strategies to get all of the things done necessary to properly serve our customers. To help with this in our own data center, we’ve have a pseudo-tradition called Infrastructure Friday. This is not to be confused with Redneck Tuesday:
On Infrastructure Fridays, members of the data center team who normally don’t work out on the floor put down email, IM and PDAs, roll up their sleeves, and step into the data center to help get some of the “real” work done. To keep IT running smoothly, we sooner or later have to stop talking about it and actually go and do something about it. That “something” we do includes taking care of ongoing operational details and implementing new functionality that maintain or improve reliability.
Like many other things, excellent performance in the data center is all about execution and details. Focus on the details and the big picture will take care of itself, or as Mel Gibson advised his young son in The Patriot, “Aim small, miss small.” What sort of details are we talking about on Infrastructure Friday?
Not just performing rack inspections, but actually correcting any problems found.
Not just noting network latency issues, but getting the right people involved to isolate and resolve them.
Not just checking that critical monitoring systems in the NOC are healthy, but verifying they are actually working by simulating failures.
Not just verifying that operational documentation is current and complete, but actually updating it if it’s not.
Not just checking parts inventories (patch cables, cable management supplies, etc), but placing the orders to replenish supplies.
Not just validating that data center standards are being followed (equipment mounted for proper air flow, floor tile placement, etc) , but actually correcting violations.
Not just noting that wire management is shoddy, but actually making it better.
Not just complaining that critical patch cables aren’t labeled, but actually getting out the label machine and doing the labeling.
Not just finding hot spots in the electrical system, but scheduling the downtimerequired to avert a future disaster.
Hopefully the theme is obvious. On Infrastructure Friday, the goal isn’t to grouse about problems, it’s to fix them.
On a happier note, what sort of cool new functionality might we install on Infrastructure Fridays to improve reliability? That’s a shorter list probably not worthy of a set of bullets, but it typically involves installing new or upgraded monitoring capabilities in the NOC, adding additional monitoring instrumentation out on the floor, improving the quality and types of information on the master dashboards, and continuing to implement automated processes to lessen the chance of unplanned downtime. But again the theme is the same: take action.
In the day-to-day blur of activity required to keep a live data center running, the Oughta List of things (we ought to do this, we ought to do that) that would improve reliability grows week by week, but never seem to get done because of the tryanny of the urgent. We find ourselves officially declared Too Busy to work on the Oughta List and before we know it, an outage occurs and the Oughta List suddenly becomes an embarrassing Shoulda List.
Infrastructure Friday is designed to overcome Oughta List inertia. With a “try me” cost of zero, it has pretty good ROI.
On my way out the office earlier this week, I met our master Jedi of monitoring standing in my office door. “You might want to sit down” he said. In over 10 years of working together in the hellfire and brimstone of systems management, he’d never said that before. “What could possibly be that bad?” I wondered. “I just went to the Cittio support site,” he said calmly as he handed me his Blackberry, “Here’s what I got:”
For those of you unfamiliar with the world of network management systems, the name Cittio means nothing. For those of you unfamiliar with the history of systems management tools at DSS, you’re also likely thinking “Dude, get over it. It’s just another company folding.” Or, as a former MVS systems programmer colleague use to say to me, “Get over it…and like it.”
Four Times Bitten, Forever Shy?
IBM Netview. We’ve been managing customer systems with NMS tools since 1995. Being an IBM business partner, we decided to start with IBM Netview, a close but homely cousin of HP Openview. While Netview was not without it’s charm, it was a cruel task master. We spent more time offering animal sacrifices to the tool to keep itrunning than we spent actually using it. Besides taking 45 minutes to begin polling after a restart, the monitoring daemon would just go off into the weeds and stop polling. We never could really trust it, and reporting left much to be desired. As we continued to struggle with Netview, IBM bought Tivoli and the product was moved over to the Tivoli side of the house for assimilation into the Tivoli Enterprise Framework. Since IBM surely wouldn’t have bought a company with bad products, and since business partners now had easy access to the Tivoli products, we naively decided to take a look at Tivoli Enterprise.
Tivoli Enterprise Distributed Monitoring (DM). After spending considerable time and money getting indoctrinated in the Tivoli Enterprise Framework and DM, we quickly realized the product was even more of a monster than Netview. More animal sacrifices and offerings of time and energy were required for less functionality and horrible reliability. We did one customer implementation and stopped. We had seen and suffered enough. While contemplating whether to shave our heads and put on sackcloth and ashes, we heard of a new NMS savior coming for the small-medium business space.
Tivoli IT Director. Enter codename “Bossman.” By divine intervention, our company was selected by Tivoli to become part of small circle of customers and partners involved in a skunk works project to develop an NMS targeted at small shops. An all-in-one tool that could poll for availability, collect performance data, monitor thresholds, collect HW/SW inventory, and even do software distribution.A veritable Ginsu knife set for systems management (without the 50-year guarantee). But wait, there’s more… Tivoli released the product on time, and as insiders we were way ahead of the game. We began implementing it at customer sites with good results and the sun was beginning to finally shine again. No more animal sacrifices. We had finally begun to rebuild our remote monitoring business out of the ashes of the Netview days.
IT Director did have one flaw in it’s armor – it couldn’t support more than a couple of hundred nodes. But the boys in Texas were on top of that, and project “California” was underway to take the number of nodes up to 5,000. Just days before we were to receive the beta code, Tivoli pulled the plug on the product. Our sources behind the curtain told us why: it was felt that California, at its dramatically lower price point, would compete against Tivoli Enterprise Distributed Monitoring, and the Mercedes Benz crowd at Tivoli were having none of that. The product was pulled from the portfolio and given to the IBM PC division in Boca Raton, where it was thoroughly lobotomized and re-released as IBM Netfinity Director. So began the Dark Times.
Time out. I realize this is a blog post, not the Chronicles of Narnia, so I’ll hasten to the point. Director was completely unusable after IBM Boca got done with it, and we had to move on. At this point, having been left at the altar by Tivoli, we decided to develop our own system, DSS Systems Manager, and over the next two years we did exactly that and had very satisfying results. Customers loved DSM, and so did we, but we had one problem – DSS was, and still is not a software development shop. At the time we felt we couldn’t continue to develop the product and properly focus on our core business. As we moved into the data center hosting business, we realized we needed additional functionality that we felt we could no longer afford to develop ourselves. So we sought yet another commercial answer. Back to the story….
Cittio Watchtower. Watchtower essentially represented where we wanted to take DSM had we decided to continue development. We negotiated a deal, installed it in under 30 days and were up and running. Like good old Tang, we just added water and the rest was history. We cultivated a close relationship the CEO of Cittio and had regular contact with the VP of Development and other high level folks who controlled the product’s destiny. We did joint marketing events with them, including speaking on their behalf on webinars, and served as a reference account when they had large deals on the table.
Only a week before the company dissolved (like Tang perhaps), the CEO personally asked me to serve as a reference to a couple of companies that Cittio was considering for OEM relationships. Context is everything, and little did I know that OEM had been secretly redefined to mean “Our Exit Money.” In a little over a week after I had happily given a glowing Watchtower review to a company named Nimsoft, my chief monitoring engineer was handing me his Blackberry with news of Cittio’s demise. We contacted Nimsoft on The Day After, and the basic message we got “good luck fellas, you’re pretty much on your own. The product will be no more. We can’t promise support of any kind.” Simply fabulous. To be fair, the whole situation is still in flux, and my sense during the phone call was that they hadn’t fully considered the fallout from their actions. They may very well come back with a migration plan or limited temporary support, etc., but for now we Watchtower users are out in the cold. Our new bride has packed her bags and left us with the credit card bills.
Getting Over It But Not Liking It
Thankfully, faith in God allows me to maintain my composure in situations like this, but a wise friend once taught me that buried feelings are buried alive, and when they come back, they come back as either anger or depression. So in the interests of good mental health, I’m compelled to express my feelings about this debacle and get back to business. Play this back to back 4-5 times for proper effect:
Nobody understands being jilted quite like Sam Kinison. I feel much better now.
What Does This Mean to You?
So what’s the take-away from this situation that we can apply in our shops? DSS has been on both sides of the build vs. buy decision, and there are clear advantages and risks to both positions. My opinion, while still standing here in the smoking crater, is pretty much what it’s always been: if you have the talent and can afford the time, building your own critical monitoring systems is still your best destiny. You have control of all of the variables and are forever immune to vendor adultery. There is plenty of good open source material out there to take care of the heavy lifting and serve as a good starting point.
If you don’t have the time or talent, then buying is obviously the only option. Cittio was a VC-funded company and therefore subject to the whims and wiles of the angels and VCs. If I were to buy again, my first rule #1 would be to limit the vendor short list to firms beyond at least the magical fourth round of funding. Translation: No fresh start-ups. Rule #2 would be to pick a product that is already firmly entrenched in a lot of Really Big Companies with big legal departments. There is safety in numbers and large legal teams. This may yet turn out to be the case with the Cittio breakup – they had some Really Big Customers, so we’ll wait and see if any major players file for damages in divorce court.
Unless IT is your core business, your best strategy is simple avoidance. Running your own infrastructure is full of headaches and horror stories that doing nothing but hurt your bottom line. Let someone else highly skilled in being jilted deal with all the risks, headaches, and heartaches.
Postscript: Just as I was getting ready to publish this entry, I received a call from a former senior exec at Cittio. Though no longer on the payroll, he apologized at length for the situation, described what went down, and was genuinely troubled at the way the in which former customers are now being treated. In the end analysis, the VC guys pulled the plug on a healthy company. While my contact really didn’t know why it happened, perhaps they were selling healthy assets to compensate for unhealthy ones. Who knows. In any event, it’s time to move on.
When AT&T inked its exclusive iPhone deal with Apple, they surely must have had the smile of the fabled Cheshire cat. AT&T had sole possession of a hot product with a locked-in revenue stream – what could be better? Have a look at the following analysis by Alcatel-Lucent on North American iPhone wireless usage:
The iPhone is turning out to be a bit more expensive for AT&T than perhaps they were expecting. While the air time minutes aren’t all that interesting, the bandwidth numbers do tell a story – unexpected hidden costs. The bandwidth-hungry nature of iPhone web applications is seriously taxing the AT&T wireless network, and the company will have to add cell towers and back-haul lines to support the increasing load. Who’s going to foot that bill? Since iPhone plans don’t have bandwidth limits, it won’t be existing customers, and so it turns out that the locked-in revenue stream is really a double-edged sword. AT&T can’t go back to existing customers for more money, but they need to fund the expansion to keep the service viable. Sounds a little like blackmail – infrastructure blackmail to be exact.
The underlying message of the iPhone story has a familiar ring to it (no pun intended). Do you have technology in your business that periodically puts you over a financial barrel? Are you finding yourself forced to fund infrastructure improvements without additional revenue to offset the cost?
We all have critical business applications – those applications without which we cannot transact business. Sooner or later the day comes when the system breaks and can’t be fixed without being upgraded. Then we learn that the upgrade comes with a list of pre-requisite hardware and other ancillary software that has to be upgraded as well. The total price tag is huge, but there is no choice. We swallow hard and pay the cost, or else the vendor banishes us to the land of No Technical Support with a broken application.
Insofar as unexpected server hardware costs are concerned, virtualization can be a saving grace. If you have a virtualized system with excess capacity, you can react quickly at no additional cost. Our analysis of customer systems over the past decade reveals that on average, most servers are running at under 15% processor utilization. If you still haven’t virtualized a portion of your environment, this is an easy strategy to deploy to avoid infrastructure blackmail.
Another strategy is to find a good hosting partner that can provide you with capacity on demand, either with physical or virtualized servers. By moving to a hosted environment, you effectively transfer the burden of hidden infrastructure expense from yourself to the hosting company. Large unexpected capital expenses can be transformed into low to moderate increases in operating expense. AT&T is probably wishing they could do that very thing right about now…
Describing our IT jobs to friends and family can be a bit of a challenge, and sometimes a simple answer is best.
So what do we do all day in the world of IT? In the smaller server closets of the world, very little needs to be done on a daily basis unless something breaks. However, in shops of any significant size, here is just a starter list for daily computer TCB:
Did last night’s backups run OK?
Did last night’s batch processing run OK?
Are the online applications healthy and responsive?
Did all of scheduled change activity go well?
Are all of the servers operating at nominally (e.g. acceptable response time, no 100% hung CPUs, etc.)
Is free disk space at acceptable levels?
Are there any trouble tickets from the overnight hours?
Is the network performing normally?
Are there any suspicious entries in the security logs?
Are any equipment alarm lights lit?
Any wrong answers to the above questions become the day’s TCB workload. Intertwined with that comes the standing to-do list items:
Returning phone calls
Performing project work
Meeting with customers
Handling the unexpected, which usually trumps everything else.
Are you tired yet?
Regardless of the level of automation one can employ to keep the infrastructure running, there are certain activities that require human attention. There simply is no substitute. If infrastructure is not your core business, you may want to take a hard look at whether caring and feeding for your IT systems is the best use of your time and resources.
We don’t each generate our own power, but we certainly need it to run our businesses. There simply is no reason for each of us to invest in the people and equipment to generate electricity – it’s impractical and unaffordable. Likewise, you certainly need computing resources to run your business, but unless IT infrastructure is your business, the ever-increasing complexity and cost of IT is making in harder to find a compelling business reason to continue building and running it yourself.
Not to mention that good IT staff is hard to find…
It would be much better to find a good partner to handle your computing TCB.
DSS is a company that exemplifies the innovative spirit, delivering Data Center solutions that drive business value. From our Tier III Data Center to our consulting and support that DSS provides, we continually strive to produce high-quality, beyond standard solutions to exceed the expectations of our customers.