General service issues

Summary

On Thursday 1. Oct 2020, starting from 14:15:00 UTC and until 23:00:00 UTC our customers were affected by downtime on our platform. The event was triggered by a system wide update of a central component. This update caused interruption in web delivery and code deployments (SSH & SFTP).

Rest assured, the responsible party is painfully aware of the distress and inconvenience this type of event causes our clients. We will not fire this person just yet, and hopefully as a result of internal discussions, we will improve our practices and avoid scenarios like this in the hereafter.

Impact

More than 28% of all Universal Apps were potentially affected, and we had around 45-65 support cases during the time of the incident. The real impact was somewhere between 2% - 15% of all Apps. For individual Apps, the outage ranged in duration from 25 minutes to six hours. Less than a handful of Pro Apps were affected.

Mitigation

We booted new hosts and transferred affected Universal Apps to the new, unaffected hosts. We believe that at the core of the issue, was an unresponsive filesystem causing other parts of the system to fail during the system wide update.

Follow-up

A greater emphasis on gradual deployment of system-wide updates is the main take away from this incident. We could have easily avoided downtime for the majority of the Apps if the deployment had been done in incremental stages. We may introduce more rigorous maintenance of the filesystem used for Universal Apps. The kernel was in fact reporting this issue prior and during the incident (dmesg). We may incorporate monitoring of these specific messages in our routine monitoring.

Summary

Impact

Mitigation

Follow-up

Find Your Subscription

Subscribe to Status Updates