On June 18th, 2010 service was degraded at our Dearborn, Michigan Data Center (DTW) for roughly 1.5 hours between approximately 18:35 EDT and 20:05 EDT due to a UPS malfunction at the carrier hotel (CH) in Southfield, Michigan which we use for primary Internet access. Power was not affected in our Dearborn facility but connectivity was interrupted for many clients due to this failure.
Incident Time-line (CH)
- 17:21 EDT – A single phase of the input power supply to the UPS powering the CH’s facilities had a voltage surge causing their UPS to discharge until battery power was exhausted. Since this was not a utility power outage their on-site generator did not activate. CH’s on-site staff was alerted immediately by this and engaged electricians to diagnose and solve the problem. Our internal staff was also alerted via our external monitoring and arrived on-site at CH starting at 18:00 EDT.
- 18:35 EDT – On-site electricians brought down AC power to initiate UPS repairs. This caused a loss of power to our core routers which caused degradation in service for many of our clients.
- 20:05 EDT – Technicians completed repairs and returned power to the UPS.
Internal Changes (CH)
Given the above issues we have further diversified our power requirements from CH to include a DC power feed to our core routing gear along with the AC power that was already being supplied. We are also working with the carriers themselves to do the same. Our routers are already equipped with dual power supplies but the CH has no ability to provide true “A/B” power via AC so an AC/DC “A/B” setup will supply the level of redundancy required and alleviate future AC interruptions.
Long term we are announcing this week the acquisition of our second Southeastern Michigan Data Center which will alleviate our reliance on the carrier hotel as a primary point of Internet connectivity.
Internal Changes (DTW)
Our VOIP phone system was affected and was unavailable during the time period above causing issues with client communication and causing further confusion with clients trying to contact us. We are shifting the VOIP system onto an out-of-band carrier which is independent of our internal systems so communications won’t be interrupted by any IP level issues on our network.
Furthermore, while we had a status page at http://nexcess-status.com, knowledge of this avenue of status wasn’t widespread. We also could have used Twitter and/or Facebook as a mechanism to report status but didn’t do so in a timely enough manner. We began tweeting as the UPS repairs were being completed instead of during the incident itself. For future issues we will tweet status updates more frequently and remind clients of http://nexcess-status.com via these tweets and via our initial mailings to new clients.
Weather Statement
The Southfield and Dearborn, Michigan areas were under a Severe Thunderstorm Warning from 21:29 EDT to 22:45 EDT on June 18th, 2010. While this is notable it had no affect on either the CH issue or on the power systems in our Dearborn facility. The power systems under our control worked as planned and while our UPS systems did see some aberrations in power quality our generator was never needed to power the building and the UPS systems performed as planned.
Frequently Asked Questions
Q: Do you have UPS (battery) and Generator backup in Dearborn?
A: We do and the battery system was activated momentarily around 22:04 EDT on June 18th due to weather conditions but Dearborn never lost power. The affected location was in a carrier hotel in Southfield, Michigan.
Q: Your phones did not work during the incident period, why?
A: We use an in-band VOIP system for our phones. This is being changed to out-of-band so any disruption in Internet service in the future will not disrupt our phones.
Q: I didn’t get any alerts from your monitors during the disruption, why?
A: Two reasons. 1) Our client-level monitors are on-site in Dearborn and the servers that perform the monitoring detected the degraded status but could not e-mail alerts outside, given the connectivity disruption. 2) Our on-site techs “acknowledged” the alerts (which basically means they confirmed they were aware of the incident) and this suppressed further e-mails from being sent.
Q: I couldn’t get to your site during the incident period, why?
A: Our main site and our support site were both affected by the disruption. We keep an off-site independent website for status at: http://nexcess-status.com but awareness of this was not widespread enough. We are notifying clients of its existence and will use social media such as Twitter, Facebook, etc in the future more diligently.
Q: Was this due to the weather?
A: No, at the time of the incident skies were clear. Our area was under a severe weather advisory from 21:29 EDT to 22:45 EDT but the weather itself did not have an adverse impact on our operations.