Database maintenance (with issues)

3 days and 22 hours
Complete
Complete

Post Mortem

We were performing upgrades on our database infrastructure on the 3rd of August 2022. In total this would affect about 40% of our clients across the EU and US regions.

After one batch around 00:00 UTC on April 4th, some 90 databases in EU did not come back as expected, effectively breaking the associated Apps.

Our monitoring system did not pick up this issue and our manual checks missed the affected Apps as well. As a result we discovered the problem only on the following morning when clients contacted us about the problem.

We started the recovery process immediately for Apps with backups enabled. For Apps without backups we used automated snapshots from the previous day. In hindsight we should have been better prepared for this outcome to avoid long downtimes.

Most Apps with backups where resorted on the 4th of August around 14:30 UTC after some 12 hours of downtime, give or take a couple of hours. Most Apps without snapshots where restored on 5th of August after 23 hours of downtime. Additionally, a few Apps were affected by other issues related to this maintenance. Finally, those Apps were restored later on the 5th of August.

We consider this all solved now. If your App still has issues, please contact us.

We are cancelling the rest of the planned maintenance and will do that part at a later point.

This has been the longest downtime in our 10 year history at fortrabbit, as far as we recall. We take that very seriously and we feel the pain that was caused by this downtime. Sorry for the inconvenience one more time.

We will evaluate the events internally and take actions to improve our procedures.

Avatar for
Updated

Restoring databases is on it's way. It takes longer than anticipated. Some Apps are already back up. It will still take another while.

Avatar for
Updated

We are still looking into issues regarding a number of Apps hosted in EU with ongoing MySQL connectivity issues. We don't have an ETA until it will be back right now. We plan to restore the databases from backups and snapshots.

You can deploy a static error page (with DB access) if you don't already have in place in the meanwhile.

We will keep you updated here and you can also contact us in client support for individual Apps.

Avatar for
Updated

Some Apps in EU have not recovered well. The database connection is lost since 02:00 UTC early this morning. We are working on restoring operations.

Avatar for
Underway

The scheduled maintenance is now underway. We'll keep you updated on our progress.

Scheduled

We are rolling out some internal updates and therefore will need to take down some databases for a short while.

It affects about 30% of the Apps in both of our regions, EU and US. The updates will be run one by one in batch sequences. The expected individual downtime per is App 5 - 20 minutes. The maintenance will be carried out outside of peak hours.

Avatar for
Began at:

Affected components
  • EU
    • MySQL