Since only Robinson Crusoe had the luxury of getting everything done by Friday, the rest of us have to come up with other strategies to get all of the things done necessary to properly serve our customers. To help with this in our own data center, we’ve have a pseudo-tradition called Infrastructure Friday. This is not to be confused with Redneck Tuesday:
On Infrastructure Fridays, members of the data center team who normally don’t work out on the floor put down email, IM and PDAs, roll up their sleeves, and step into the data center to help get some of the “real” work done. To keep IT running smoothly, we sooner or later have to stop talking about it and actually go and do something about it. That “something” we do includes taking care of ongoing operational details and implementing new functionality that maintain or improve reliability.
Like many other things, excellent performance in the data center is all about execution and details. Focus on the details and the big picture will take care of itself, or as Mel Gibson advised his young son in The Patriot, “Aim small, miss small.” What sort of details are we talking about on Infrastructure Friday?
- Not just performing rack inspections, but actually correcting any problems found.
- Not just noting network latency issues, but getting the right people involved to isolate and resolve them.
- Not just checking that critical monitoring systems in the NOC are healthy, but verifying they are actually working by simulating failures.
- Not just verifying that operational documentation is current and complete, but actually updating it if it’s not.
- Not just checking parts inventories (patch cables, cable management supplies, etc), but placing the orders to replenish supplies.
- Not just validating that data center standards are being followed (equipment mounted for proper air flow, floor tile placement, etc) , but actually correcting violations.
- Not just noting that wire management is shoddy, but actually making it better.
- Not just complaining that critical patch cables aren’t labeled, but actually getting out the label machine and doing the labeling.
- Not just finding hot spots in the electrical system, but scheduling the downtime required to avert a future disaster.
Hopefully the theme is obvious. On Infrastructure Friday, the goal isn’t to grouse about problems, it’s to fix them.
On a happier note, what sort of cool new functionality might we install on Infrastructure Fridays to improve reliability? That’s a shorter list probably not worthy of a set of bullets, but it typically involves installing new or upgraded monitoring capabilities in the NOC, adding additional monitoring instrumentation out on the floor, improving the quality and types of information on the master dashboards, and continuing to implement automated processes to lessen the chance of unplanned downtime. But again the theme is the same: take action.
In the day-to-day blur of activity required to keep a live data center running, the Oughta List of things (we ought to do this, we ought to do that) that would improve reliability grows week by week, but never seem to get done because of the tryanny of the urgent. We find ourselves officially declared Too Busy to work on the Oughta List and before we know it, an outage occurs and the Oughta List suddenly becomes an embarrassing Shoulda List.
Infrastructure Friday is designed to overcome Oughta List inertia. With a “try me” cost of zero, it has pretty good ROI.