Investigation continues into 365 Main's outage
I'm continuing to investigate the story of the outage Tuesday at 365 Main's San Francisco datacenter that brought down some of the most well-known sites on the Internet. Right now, a 365 Main executive is blaming failures at 5 out of its 10 generators. That's right: Fully half of 365 Main's generators failed right as San Francisco experienced a power outage. More to come on this soon, but for now, here's the memo from Marcy Maxwell, 365 Main's head of security.
From: "Marcy Maxwell"
To: "Engineering" ; "Security"
Sent: 7/25/07 5:08 PM
Subject: UPDATE: POWER EVENT - Fourth Notice
UPDATE: 5:00 P.M., Wednesday, July 25, 2007
A complete investigation of the power incident continues with several specialists and 365 Main employees working around the clock to address the incident.
Generator/Electrical Design Overview
The San Francisco facility has ten 2.1 MW back-up generators to be used in the event of a loss of utility. The electrical design is N+2, meaning 8 primary generators can successfully power the building (labeled 1-8), with 2 generators available on stand-by (labeled Back-up 1 and Back-up 2) in case there are any failures with the primary 8.
Each primary generator backs-up a corresponding colocation room, with generator 1 backing up colocation room 1, generator 2 backing up colocation room 2, and so on.
Series of Electrical Events
* The following is a description of the electrical events that took place in the San Francisco facility following the power surge on July 24, 2007:
* When the initial surge was detected at 1:47 p.m., the building's electrical system attempted to roll all colocation rooms to diesel generator power.
* Generator 1 detected a problem in its start sequence and shut itself down within 8-10 seconds. The cause of the start-up failure is still under investigation though engineers have narrowed the list of suspected components to 2-3 items. We are testing each of these suspected components to determine if service or replacement is the best option. Generator 1 was started manually by on-site engineers and reestablished stable diesel power by 2:24 p.m.
* After initial failure, Generator 1 attempted to pass its 732 kW load to Back-up 1, which also detected a problem in its start sequence. The exact cause of the Back-up 1 start sequence failure is also under investigation.
* After Generator 1 and Back-up 1 failed to carry the 732 kW, the load was transferred to Back-up 2 which correctly accepted the load as designed.
* Generator 3 started up and ran for 30 seconds before it too detected a problem in the start sequence and passed an additional 780 kW to Back-up 2 as designed.
* Generator 4 started up and ran for 2 seconds before detecting a problem in the start sequence, passing its 900 kW load on to Back-up 2. This 900kW brought the total load on Back-up 2 to over 2.4 MW, ultimately overloading the 2.1 MW Back-up 2 unit, causing it to fail. Generator 4 was manually started and brought back into operations at 2:22 p.m. Generator 4 was switched to utility operations at 7:05 a.m. on 7/25 to address an exhaust leak but is operational and available in the event of another outage.
* Generators 2, 5, 6, 7 and 8 all operated as designed and carried their respective loads appropriately.
* By 1:30 p.m. on Wednesday, July 25, after assurance from PG&E officials that utility power had been stable for at least 18+ continuous hours, 365 Main placed diesel engines back in standby and switched generators 2,5,6,7, 8 to utility power.
* Customers in colocation rooms 2, 4, 5, 6, 7 & 8 are once again powered by utility, and are backed up in an N+1 configuration with Back-up 2 generator available.
* Generators that had failed during the start-up sequence but were performing normally after manual start (1 & 3) continue to operate on diesel and will not be switched back to utility until the root causes of their respective failures are corrected.
Other Discoveries
* In addition to previously known affected colocation rooms 1, 3 and 4, we have discovered that several customers in colo room 7 were affected by a 490 millisecond outage caused when the dual power input PDUs in colo 7 experienced open circuits on both sources. A dedicated team of engineers is currently investigating the PDU issue.
Next Steps
* Determine exact cause of generator start-up failure and PDU issues through comprehensive testing methodology.
* Replacements for all suspected components have been ordered and are en route.
* Continue to run generators 1 & 3 on diesel power until automatic start-up failure root cause is corrected.
* Continue to update customers with details of the ongoing investigation.
Regards,
Marcy
Marcy Maxwell Vice President, Security 365 Main Inc. "The World's Finest Data Centers"