While they might not be as big as these top services having significant outages this week, they certainly are the leaders in the Ruby on Rails world. Our faithful and expensive, experts at Engine Yard have hit trouble not once, but twice this week. As I am writing this adiserver has been down for over and hour and a half. This coming just a few hours after we received a credit an hour of down time from earlier this week.
I find it funny that just yesterday we posted about moving our small applications to EC2 and the fear of single point of failure there. We praised EY's redundant environment only to have it fail on us today. Only one of our slices seems down, yet the whole application is unresponsive, whats up with the redundancy we pay for? Hopefully there will be an explanation in the post-mortem.
This is all to be expected though, as the wildest super rare occurrences seem to happen whenever we move a server. A day before our move to Engine Yard we got hit with Rackspace's really bad 36 hours. Maybe we should warn all network admins before we move another box...
Well lets hope we come back on line soon, because as of now my head is in the clouds.
Update: We are back up, total downtime - 2 hours, 5 minutes, 39 seconds


Hi,
We are very sorry for Friday's downtime on one of our clusters.
Even though you have 2 slices serving your production app, you can experience downtime if those slices lose connectivity to their SAN system. This is an uncommon incident, but it unfortunately came about yesterday on part of the cluster that houses your app.
Yesterday's incident appears to be the result of a software malfunction in 2 or more of 4 switches that handle traffic between a group of servers and their SAN shelves. The switches are from Extreme Networks, which has a good reputation, and they are helping us debug error logs.
Organizationally, we are making changes to reduce the incidence of such problems and to always improve how we deal with them in the future. We're expanding the group that focuses on infrastructure design and management. This group looks at ways to improve our technology (design and vendor choices) and human processes. For instance, one of the top people in that group visited a SAN vendor earlier this week to review their latest offerings.
Thanks for being our customer, and again, sorry for the downtime. This kind of thing really does hurt and we keep endeavoring to get as close to zero downtime as possible.
--- Lance Walley, CEO
--- Engine Yard
Posted by: Lance Walley | August 16, 2008 at 01:02 PM