Partitial Outage


The incident started at 07:04 UTC at January, 1st. The culprit was a degrading storage volume, which started out slowly with seemingly random decreases in i/o performance and began to affect services increasingly until 08:00 UTC, when most Apps suffered: Since the storage cluster itself was healthy and the volume responsive, if slowly, no fail-over did take place. However, HTTP server-side health were aware of the issue and were trying to re-mount the very slowly responding storage volume over and over. Our team was on it since around 08:10 UTC and once the issue was identified we were able to manually segregate the faulty volume. We then initiated the storage fail-over which resolved the issue for most Apps at around 09:20 UTC. Some of the HTTP servers were suffering from a kernel issue, resulting from re-mount loops and stale file handles, which we then, one-by-one, identified and rebooted as well. All our services were again fully operational at around 10:10 UTC. Most Old Apps were affected, no New Apps.

We will once again review our failover mechanisms to better compensate such conditions in the future.

Sorry for the inconvenience. Thanks for reporting and your understanding.

OK, we can now hopefully safely say that we have fully solved the issue. We are still monitoring it right now. As mentioned before: A storage server was the root of the problem, it affected most Old Apps, New Apps were not affected. We are now further investigating the causes and will publish a post-mortem the next days.

We think to have identified the problem — a failed storage server — and are currently working to get it up again.

We are investigating a few downs of webites/apps.

Began at: