There is nothing like learning from mistakes and experience. Strategy and process design is best grounded in the reality of experience. I would like to share a real story of an incident that I remember to often when looking at plans and remind myself that there is always more than the obvious to learn when things go wrong. On the surface this story has one key learning point but digging deeper revealed three.
A major trading organisation data centre loses their air con system to one of the two chillers that cool the data centre. Most people even in IT are not aware that air con is as fundamental as power in data centres. Put simply, 10,000 servers produce a lot of heat. In this case the temperature in the cabinets furthest from the working units reached 120 degrees Celsius – despite emergency fans being added to circulate the air and direct the hot air to the cooler end. 120 degrees – oven temperature! To keep temperatures down all non essential systems had to be shut down, all live BCP running was removed to reduce load and as many as possible MIS and other non core transaction processing systems had to be shut down; as the incident constitutes an emergency.
Most people plan for fail-over, but they don’t plan for staged shut-down. Planning for this was never part of the BCP but I would say that it should be.
The cause of the failure was a loss of power lines to one of the chillers. In theory one chiller should have been enough – it clearly was not. In theory there were two power lines to both chillers but the redundancy promised was not there. The moral of the story is check everything for real before it happens – don’t take the suppliers’ word for it – check it and double check it. The second moral is that partial emergencies are more common than the big disaster; planning for a partial shut down and knowing what can be safely switched off in what order is a critical part of BCP that is often overlooked. One more thing to remember is that everything must be redundant to guarantee live running, dual lines, dual power, dual chillers, dual power to the chillers – the list is almost endless.
An addendum to the story was that in the ensuing six months sporadic failures were common from the overheated equipment – in the end 50% of all the servers in the centre had to be renewed. The contract with the data centre provider was not clear enough to cover this kind of event so the bank paid even though it is arguable that the provider was at fault. Simple steps can be taken to avoid this kind of catastrophe, don’t let it happen to you!