Recently, the area I live in experianced a power outage due to a surprise storm that came through and snapped trees like they were matchsticks. When those trees broke, they took thousands of power lines with them. At one point, there were nearly 500,000 households and businesses without electricity - including of course, my employer.
There are quite a few things we've learned as a result of this, so I'm going to point them out as the story moves along.
As in many companies, our first line of defense is a UPS. On any normal day according to the display on the unit we should have about 45 minutes of runtime available at full capacity.
Now, 45 minutes is not a lot of time to shutdown the number of servers we have, but we felt somewhat comfortable because we have a 2 year old Natural Gas fed generator that was powerful enough to feed the UPS at full capacity (we're only at about 55% max load) and to power the AC unit which keeps the systems happy.
Does this sound like you?
If so, read on. If not, and you're working for a company that receives electricity from multiple grids and has redundant sets of N+1 generators, feeding redundant sets of N+1 UPS units running wet cells, mini fusion reactors or whatever, well, good for you. ;)
The fun began at about 9pm on a Thursday night when I was called by one of my staff (Nick) telling me the power to the building was out, and he wasn't sure if the AC (on the generator unit) was running.
After an hour drive of near zero visibility, (driving around trees, powerlines and transformers that had fallen into the street, navigating countless intersetctions with no working traffic signal) to travel what is normally 20 minutes I arrived at the office to find the lights in the datacenter on (remember, we only estimated 45 minutes of run time on the UPS) and the AC unit running.
I met our facilities manager who had been called by the alarm company 30 minutes after the power went out.
[LESSON LEARNED #1 - Do you have an SLA on the timing of your notifications of emergency situations? Yes, it was a wide outage and the poor guy manning the desk at the alarm company probably needed CPR when he saw 90% of his customers drop off the grid almost instantly, but 30 minutes? Fortunately for us, a user was still in the building when the power went off and was able to use her cell phone to call IT so we were already on the way in when the alarm company finally notified us.]
On his way to the facility the facilities manager was in contact with the power company who told him the power outage was widespread and could last into Monday or Tuesday of the following week. We braced for the worst and went to check the generator which was running very well at the time and was treated to a spectacular show of blue-green lightning mixing with the bright blue flashes of exploding transformers in the distance and a symphony of tree limbs breaking off in the woods behind our office.
[LESSON LEARNED #2 - Blue-Green lightning is bad (and somewhat eerie).]
Nick had also arrived and he and I checked the run time on the UPS. 45 minutes as expected, good. We began making plans for what servers would be shutdown in what order in the event of a generator failure. Yes, I know, this should have been done long before, and I agree completely. The fact is that this area of documentation had not yet been completed for a number of reasons, none of which seemed particularly relevant in light of the issue.
[LESSON LEARNED #3 - If you don't already have it completed, find time to develop your emergency response policy and procedures as soon as possible.]
We completed the list and began calling other staff on thier cell phones to assign systems to be shut down remotely, only to find that they too were without power and internet connectivity and phone service from thier homes. By this time, there were more tree limbs and transformers blocking the roads and the town where our office was located issued a driving ban. No one was coming in to help, and no one could connect remotely.
[LESSON LEARNED #4 - Out of band communications are a must during emergency situations.]
About 11:45 that evening, I was behind the datacenter near the AC unit when I heard the worst sound imaginable - sudden silence followed by a frantic yell of "Chris?!?!?!?". The generator had failed and Nick immediately checked the UPS. Right around 24 minutes of power remaining was the display. We got down to the business of following our list and shutting down the systems while praying that the 24 minutes was a display error. I checked the unit again about 5 minutes later and it read a time remaining that was within a minute of the last entry. About 5 minutes later a third person checked the display and again saw a time remaining within a minute of the last display. So here we are, 10-12 minutes into the generator failure, three people have checked the unit and a time between 22 and 24 minutes has been reported and what do you think happens?
Yep, less than 5 minutes after the last check, the room went dark and silent. If you have ever been in a datacenter which is always noisy with AC units pushing air and cabinets full of servers and network equipment suddenly go silent, you know how creepy that is.
A very soft "Oh <explicative deleted>" slipped from Nick and I as we reached for the flashlights.
Our facilities manager checked the generator and began the process of getting emergency support on the phone, and if needed, here.
About three hours later, a cable in the generator that had wiggled loose was pushed back in by a maintenance tech from the company we contract out to for generator service and the generator started back up.
Long before the generator came back to life, Nick had left for home as there really wasn't anything he could do with the power being out and the indications we had that the generator would not be fixed at any time soon.
At 3:30 am, the generator came back on and seemed like it should be stable. I mean, a cable wiggling loose after only 3 hours run time is a fluke when we had recently run it for 16 hours with no problems, right?
The network equipment, servers, phone and other systems come back up and by 7am most applications are running fine except those applications hosted on one of the three servers that died as a result of the hard power down.
By 10:30 a few more issues had been reported, and mostly resolved and I was feeling pretty good despite being up over 24 hours. That is, until the generator died again. To add the proverbial icing onto the cake, the batteries on the UPS hadn't charged so the whole datacenter went down hard, for the majority of the systems, again.
It turns out, the same cable had come loose.
[LESSON LEARNED #5 - Fix the problem right, the first time. When he returned, our facilities manager re-connected the cable and secured it with cable-ties so that it couldn't come loose again. If the maintenance tech who fixed the problem the first time had secured the cable, well, you get the idea.]
Power was restored and systems started coming on line just before Noon. Luckily, we had no additional system failures when the power came back on the second time.
Street power was returned late Sunday evening and the remainder of the weekend was uneventful compared with the adventures of Thursday evening and Friday morning.
So what is this all about? One simple question that need not generate a flood of e-mail but is more intended as food for thought.
Are you sure you're as prepared as you think you are?
Do you have a service level on your alarm company response? If not, did you think you would need it?
Is the display on your UPS correct?
Here's an interesting one .. is it a part of your local fire company's response plan to shut of the gas supply to your area in cases of large industrial fires in your area or other scenarios? Our facilities manager initially thought this was the problem when the generator stopped the first time, as he knows that the response plan for certian incidents in the office park we are in is to shut off the gas supply. It is just as easy to shut off gas for an individual office building under certain scenarios as well.
Does your UPS shut down systems gracefully when X minutes remain? Would it have worked if X minutes were never displayed?
How about your power fail phones, do they actually work? Have you tested them?
The magnetic security locks on your doors, do they fail open, or closed? Are you sure?
I've listed only some of the questions that have come up, there are many, many others that I haven't listed here but become obvious after reading the story.
Challenge some of the assumptions you've made. You'll probably find more exposures than you knew about originally.
Hopefully they can be corrected before they become a problem.
Oct 26th 2006
Oct 26th 2006
1 decade ago