365 Main's credibility outage
After killing most of the websites you care about on Tuesday, 365 Main, the troubled datacenter in downtown San Francisco, is back to business. The business of making excuses, that is. Cynthia Harris, the same flack who issued an immaculately timed press release Tuesday morning crowing about how RedEnvelope moved all of its Web operations to 365 Main, only to have them taken down by the outage, is going around telling everyone who will listen that nothing untoward happened. To which any user of Craigslist, Technorati, Six Apart's LiveJournal and TypePad, and AdBrite might respond, rrrrright. Data Center Knowledge has a detailed report. Here's what else I've learned — and why 365 Main's performance remains highly suspicious.
- 365 Main's facilities are supposed to be rock-solid, designed to ride out a major event like an earthquake. CEO Chris Dolan personally gave me a tour shortly after his team revamped the datacenter. Unless he was exaggerating to me then — and, one presumes, exaggerating to every customer he's since signed — a power outage shouldn't have taken 365 Main out.
- 365 Main has multiple colocation rooms, or "colos," in the center. Colos 3 and 4 — on the same floor, if memory serves — went down, while Colos 2 and 8 stayed up. Data Center Knowledge says that an additional, unspecified colo lost power. (According to a current customer, not all of 365 Main's colocation rooms are occupied, because the facility is constrained by power supply, not space.)
- Was there a drunk employee? Harris, the ever-so-believable 365 Main flack, is denying "employee misconduct." But that doesn't rule out someone else with access to the building tripping the emergency-power-off switch on the affected floor. Bad timing? Sure. Impossible coincidence? No.
- What caused the long lines outside 365 Main? Apparently 365 Main's customer-authentication system was down, forcing already-angry sysadmins to wait in line while guards checked IDs manually.
- Were customers' contracts breached? Almost certainly, if they negotiated any decent service-level agreements with 365 Main. Heard about any lawsuits filed or payments sought? Send in those tips.
Now, from commenter somafm, a highly detailed account of what he believes happened.
Here's what really went down at 365 Main:
365 Main, like all facilities built by AboveNet back in the day, doesn't have a battery backup UPS. Instead, they have these things called "CPS," or continuous power systems. What they are is very very large flywheels that sit between electric motors and generators. So the power from PG&E never directly touches 365 Main. PG&E power drives the motors which turn the flywheels which then turn the generators (or alternators, I don't remember the exact details) which in turn power the facility. There are 10 of these on their roof.
The flywheels (the CPS system) can run the generator at full load for up to 60 seconds according to the specs.
There are also 10 large diesel engines up on the roof as well, connected to these flywheels. If the power is out for more than 15 seconds, the generators start up, and clutch in and drive the flywheels. There are no generators in the basement. (There is a large fuel storage in the basement, and the fuel is pumped up to the roof. There are smaller fuel tanks on the roof as well. )
Here's what I think happened. Since there were several brief outages in a row before the power went out for good, it seems that the CPS (flywheel) systems weren't fully back up to speed when the next outage occurred. Since several of these grid power interruption happened in a row, and were shorter than the time required to trigger generator startup, the generators were not automatically started, BUT the CPS didn't have time to get back up to full capacity. By the 6th power glitch, there wasn't enough energy stored in the flywheels to keep the system going long enough for the diesel generators to start up and come to speed before switching over.
Why they just didn't manually switch on the generators at that point is beyond me.
So they had a brief power outage. By our logs, it looks like it was at the most 2 minutes, but probably closer to 20 seconds or so.
And, also via somafm, here's a letter 365 Main GNi, a datacenter operations firm that works in 365 Main, sent to customers:
This afternoon a power outage in San Francisco affected the 365 Main St. data center. In the process of 6 cascading outages, one of the outages was not protected and reset systems in many of the colo facilities of that building. This resulted in the following:
- Some of our routers were momentarily down, causing network issues. These were resolved within minutes. Network issues would have been noticed in our San Francisco, San Jose, and Oakland facilities.
- DNS servers lost power and did not properly come back up. This has been resolved after about an hour of downtime and may have caused issues for many GNi customers that would appear as network issues
- Blades in the BC environment were reset as a result of the power loss. While all boxes seem to be back up we are investigating issues as they come in
- One of our SAN systems may have been affected. This is being checked on right now
If you have been experiencing network or DNS issues, please test your connections again. Note that blades in the DVB environment were not affected.
We apologize for this inconvenience. Once the current issues at hand are resolved, we will be investigating why the redundancy in our colocation power did not work as it should have, and we will be producing a postmortem report.