Storms of June 29th 2012 in Mid Atlantic region of the USA

Published: 2012-07-02
Last Updated: 2012-07-06 18:01:21 UTC
by Dan Goldberg (Version: 2)
9 comment(s)
On June 29th 2012 a severe windstorm reffered to as a derecho tore through the Midwest and MidAtlantic regions of the US. Over 1,750,000 homes and businesses were left without electricity. Datacenters supporting Amazon's AWS, Netflix and other large organizations were taken offline, and there were several deaths reported. 
The story that follows offers some lessons relearned and possibly a few new ones.
 
I work for a company with a NOC and primary data-center in the path of the storm. A number of events took place. With day time temperatures near 108F and the windstorm coming through the battery on the backup generator powering the data-center cracked and was not able to start the generator. Notifications went out but due to hazardous road conditions no one was able to get onsite to address a clean shutdown of services. Remote access was offline as UPS batteries provided insufficient time. This is a known factor as we rely on the backup generator to operate. The generator is tested weekly, the test the day prior was ok, and the battery maintenance was performed on the same day as the test. But a generator that does not start when needed is no generator at all.
 
Power was restored only a few hours later which compounded the problem. The power came back before the first admin could safely get on site. When he did he found all systems powered on but none of the systems were reachable. The environment is highly virtualized, with a well-designed and thought out set of VM hosts and systems. The VM hosts are connected to redundant switches with redundant connections. However when power was restored the servers came online before the switches did. The VM hosts deactivated the NICs and prevented local communications. It looked at first glance like a NIC or Vswitch failure. A simple shut/noshut on the switch ports resolved this ultimately. Additionally services such as DHCP servers, AD servers, and RADIUS servers are all VMs. None of which were available. IP subnets were not documented in an emergency manual, nor were some key passwords for access to switches when RADIUS is down. They were all documented but not in an easy to locate emergency manual. A few phone calls resolved each of these situations with each taking additional time and delaying recovery.
 
The failure boiled down to the VM server not bringing up the NICs properly since they were up before the switches, this was then compounded by a sysadmin assuming the problem was a VM problem and beginning to reconfigure Vswitches (at the direction of a VMWare tech support technician). Once all parties were onsite resolution was fast and a complete recovery was obtained.
 
So on to old lessons learned – geographic redundancy is desirable, document everything in simple accessible procedures, some physical servers may be desirable, such as DHCP, and AD. Keys services such as RADIUS must be available from multiple locations. Securely documenting addresses and passwords in an offline reachable manner is essential as well as documenting system startup procedures.
 
Some new to me lessons learned are a little more esoteric. Complacency is a huge risk to an organization. Our company is undergoing a reorganization that is creating a lot of complacent and lackadaisical attitudes. It is hard to fight that. We are losing good people fast and hiring replacements very slowly. There is no technical solution to this problem. It puts a lot of pressure on individuals.  I had not experienced a battery exploding in the past. Though I am finding that at least on this day it was a common event. I have learned of three or 4 similar events that same evening. Inter-team communication is a constant struggle. We all work well together but do not have a well-orchestrated effort to create and document our procedures across team boundaries. Lastly having a clearly identified roster of who to call for what problem when is a must and it must not be electronic. Much of our roster was not available and calls were made to people who may not have been the correct on call person for a group, and then personal relationships took over as the way to get things done. It worked, but is not an ideal scenario.
 
While I am not proud to be in the company of such giants as Amazon and Netflix I am glad that we restored service 100% in only a few hours and had no loss of data and business was not hurt by the event. I am sure I will identify more specific and achievable lessons from this event.
 
Please share your stories about this event, or lessons you learned in a recent event.
 
--
Dan
Dan@MADJiC.net
9 comment(s)

Comments

Was wondering what reffered and derecho meant?
In a former life as a Diesel tech. High heat + Battery offgassing = O2 + H2 buildup = Explosive mixture + Spark = Boom with sulfuric acid everywhere.

Especially if you're on a 24V start system and have a bad connection somewhere. I've seen battery terminals instantly erupt into a cloud of lead vapor.

1 cubic inch of lead instantly going from metallic to vapor phase. Not a good environment to be in.
@VB Derecho is Spanish and means right hand, upright and true, or direct. In its direct meaning, it is a straight line wind following behind a squall line as opposed to a tornado or twisting wind.
Yesterday was the first day I've seen AT&T's cell network go down for a substantial amount of time in a major metro area with multiple towers. Got to wonder what happened, there.
You might consider having at least one physical DNS and LDAP server in addition to DHCP and AD. Anything related to authentication and/or needed for all systems to operate in a timely fashion (DNS timeouts sometimes cause application timeouts) should probably have some replicas on physical hardware. I've found that the default 30 second DNS timeout is long enough that many apps give up and throw alarms before the OS finally tries the 2nd or 3rd IP in resolv.conf. Setting timeouts and retries lower generally makes things run smoother when you have some DNS failures.
As for odd generator failures... Testing regularly is a great idea. At my day-job, it was decided that regular tests of switching over to the generator were expensive and too likely to cause a network outage (umm, yeah, don't get me started) so they stopped testing failover to the generator.

When we finally did have a power failure, the generator started, but the power didn't cut-over from battery to the generator. Why? Apparently the mechanical switch that was responsible for that was encrusted with bird poop and none of the admins on site had physical access to the generator, and the facilities guy who did was in stop 'n go traffic trying to get there during rush hour.

Luckily, we were able to get most critical systems cleanly shut down since the AC wasn't on the UPS and had stopped running...
One thing to remember is that systems do fail, and you need a backup plan for when they do. On the generator note, a UPS is no doubt in the mix, and that UPS will kick in at every test to start the process. This will reduce battery life in your UPS to two or three years maximum if you do weekly tests. Make sure your support contract includes batteries :-)

Amadeus/Altea the global airline check-in system also used by Qantas and Virgin Australia went down generating long queues and flight delays. My bookings dissapeared on the Qantas portal for at least half a day.
'been there, done that:

Electrical circuits were not documented with startup and running load and were subsequently overloaded when everything came up, resulting in bouncing circuits.

The facility was found to have multiple 'unknown' grounds and the electrical circuits were not mapped properly.

Pass-card door locks that were designed to fail open failed closed - every single one. Only the facility manger had keys, in his pocket, at home…

Too many techs showed up without a 'go bag'. They did not have cellphone chargers, tools, multiple long console cables, extension cords, written documentation, phone books, etc. These guys said 'we have this stuff in the datacenter', but only enough for a few people. This made many techs look like amateurs.

Almost every bit of documentation you can use, that's on a server somewhere, should be on your laptop in an encrypted text file. Super important stuff should be actually printed and in a safe. Especially detailed network maps - semi accurate is better than nothing.

Diary Archives