{:closed=>"Closed", :complete=>"Complete", :false_alarm=>"False Alarm", :identified=>"Identified", :investigating=>"Investigating", :open=>"Open", :recovering=>"Recovering", :resolved=>"Resolved", :scheduled=>"Scheduled", :underway=>"Underway"}
After 8 hours and 49 minutes

It has been a network outage in the AWS network between two nodes of one of our storage clusters of about 60 minutes today — 2014-11-25. Around 15% of our client nodes had been affected.

The outage affected parts of the network which should never be down (i.e. 169.254/16 network). This lead to a failure in our automated fail-over: both storage nodes assumed to be master. This in turn lead fooled our monitoring, since both storage nodes were still available - and master. We've now implemented a patch to compensate for this particular failure and will roll it out asap.

{:closed=>"Closed", :complete=>"Complete", :false_alarm=>"False Alarm", :identified=>"Identified", :investigating=>"Investigating", :open=>"Open", :recovering=>"Recovering", :resolved=>"Resolved", :scheduled=>"Scheduled", :underway=>"Underway"}
After 31 minutes

Now we are sure everything is really resolved. we will write a post mortem soon.

{:closed=>"Closed", :complete=>"Complete", :false_alarm=>"False Alarm", :identified=>"Identified", :investigating=>"Investigating", :open=>"Open", :recovering=>"Recovering", :resolved=>"Resolved", :scheduled=>"Scheduled", :underway=>"Underway"}
After 26 minutes

everything seems to be UP again — we are still checking some things.

{:closed=>"Closed", :complete=>"Complete", :false_alarm=>"False Alarm", :identified=>"Identified", :investigating=>"Investigating", :open=>"Open", :recovering=>"Recovering", :resolved=>"Resolved", :scheduled=>"Scheduled", :underway=>"Underway"}
After 3 minutes

we see some on/off behaviour. the issue is not yet fully identified.

{:closed=>"Closed", :complete=>"Complete", :false_alarm=>"False Alarm", :identified=>"Identified", :investigating=>"Investigating", :open=>"Open", :recovering=>"Recovering", :resolved=>"Resolved", :scheduled=>"Scheduled", :underway=>"Underway"}

we are investigating current issues, probably related to the storage layer. we'll keep you updated here.

Began at: