General service issues

Resolved
{:closed=>"Closed", :complete=>"Complete", :false_alarm=>"False Alarm", :identified=>"Identified", :investigating=>"Investigating", :open=>"Open", :recovering=>"Recovering", :resolved=>"Resolved", :scheduled=>"Scheduled", :underway=>"Underway"}
After 1 day, 4 hours, and 11 minutes

Summary

On Thursday 1. Oct 2020, starting from 14:15:00 UTC and until 23:00:00 UTC our customers were affected by downtime on our platform. The event was triggered by a system wide update of a central component. This update caused interruption in web delivery and code deployments (SSH & SFTP).

Rest assured, the responsible party is painfully aware of the distress and inconvenience this type of event causes our clients. We will not fire this person just yet, and hopefully as a result of internal discussions, we will improve our practices and avoid scenarios like this in the hereafter.

Impact

More than 28% of all Universal Apps were potentially affected, and we had around 45-65 support cases during the time of the incident. The real impact was somewhere between 2% - 15% of all Apps. For individual Apps, the outage ranged in duration from 25 minutes to six hours. Less than a handful of Pro Apps were affected.

Mitigation

We booted new hosts and transferred affected Universal Apps to the new, unaffected hosts. We believe that at the core of the issue, was an unresponsive filesystem causing other parts of the system to fail during the system wide update.

Follow-up

A greater emphasis on gradual deployment of system-wide updates is the main take away from this incident. We could have easily avoided downtime for the majority of the Apps if the deployment had been done in incremental stages. We may introduce more rigorous maintenance of the filesystem used for Universal Apps. The kernel was in fact reporting this issue prior and during the incident (dmesg). We may incorporate monitoring of these specific messages in our routine monitoring.

Avatar for
Resolved
After 4 hours and 51 minutes

We've now resolved the incident. Thanks for your patience.

Avatar for
Recovering
After 3 hours and 49 minutes

Most Apps are back by now, but some are still hanging. We are still monitoring the situation.

Avatar for
Identified
After 1 hour and 9 minutes

Several Apps are still down due to unexpected issues created by routine maintenance. We are working as fast as we can on restoring service for all Apps. The majority should already be back online. We will post more updates as we have them.

Avatar for
Investigating

We are seeing partial problems on all services, including EU and US, deployment and web delivery.

Avatar for
Began at:

Affected components
  • US
    • Pro Apps
      • Web delivery
      • Object Storage
      • Memcache
      • Worker
    • Universal Apps
    • MySQL
    • Deployment
  • EU
    • Pro Apps
      • Memcache
      • Object Storage
      • Web delivery
      • Worker
    • Universal Apps
    • MySQL
    • Deployment