The incident started at 07:04 UTC at January, 1st. The culprit was a degrading storage volume, which started out slowly with seemingly random decreases in i/o performance and began to affect services increasingly until 08:00 UTC, when most Apps suffered: Since the storage cluster itself was healthy and the volume responsive, if slowly, no fail-over did take place. However, HTTP server-side health were aware of the issue and were trying to re-mount the very slowly responding storage volume over and over. Our team was on it since around 08:10 UTC and once the issue was identified we were able to manually segregate the faulty volume. We then initiated the storage fail-over which resolved the issue for most Apps at around 09:20 UTC. Some of the HTTP servers were suffering from a kernel issue, resulting from re-mount loops and stale file handles, which we then, one-by-one, identified and rebooted as well. All our services were again fully operational at around 10:10 UTC. Most Old Apps were affected, no New Apps.
We will once again review our failover mechanisms to better compensate such conditions in the future.
Sorry for the inconvenience. Thanks for reporting and your understanding.