On Sunday 4th day of November 2018 we executed a scheduled maintenance to upgrade some minor PHP versions. The update started at 19:00 UTC and finished at 23:00 UTC.
During this all the Apps were updated successfully (Professional & Universal, EU & US). But some leftover processes on the deploy Node in EU kept on running in the background. Unfortunately it didn't complete overnight.
Business as usual started Monday morning. Clients started to deploy. The combination of normal operations and the still running background processes caused a high load on that Node. We noticed that, but decided to keep it running. At 12:00 UTC the service failed.
The file system containing all Git repositories had become read-only, because during the PHP updates the file system was filled up and was not letting us resize it. It took us some time to identify the cause, a bug in the underlying file system encryption.
We realised we couldn't fix the file system issue in time, so we started preparing an alternate fix. At 15:40 UTC we created a new file system and started re-initialising the Git repos there, which took some time but was finally finished by 20:16 UTC.
YOUR MINOR ACTION might be REQUIRED for Git deployment
As consequence of the alternate fix applied, all Git repo contents where reset — think the repo is there, but completely empty. To continue working with Git you need to reconnect your Git repo with the remote one and push your local contents up. The first deployment will take a little longer as a fresh Composer install will be triggered. When working in a team: Make sure to push only the latest commit up.
We now further nailed down the causes and are working on platform updates to prevent those issues in the future, of course. We learned a LOT with that in a painful way.
It was one of the biggest issues we have ever faced here. Luckily it was only for deployment and only one region, but it was for a long time. We thrive to provide max uptime. I know you hate downtime, please be assured we do too.
Thank you for your understanding and support.