One of the HTTP nodes suffered a hardware failure. It was automatically migrated to a new node.
During the bootstrap of the new Node, software upgrades were automatically applied. Among those updates was the libresolve
Linux core library (from the Debian package libc6-2.13-38+deb7u10
). The library has a not yet known bug, leading to random segmentation faults when resolving arbitrary DNS entries.
At first the problem was hard to see, so we initially just tried to replace the "faulty new Node" with another one, which lead to the same result as before. While we continued trying to identify the issue we started manually migrating multiple Apps to other existing nodes with free resources. After a couple thousand lines of strace
we finally found the culprit and downgraded the package to a working version. Everything is back to normal.
Where we can improve: Downtime communication. A lot of clients noticed 500 errors and notified us by writing support tickets — thanks for taking care BTW. At that moment we were already working on the issue. It is a tough decision to put out an incident without knowing what's going on.
This incident has been resolved.
everything should be 200 ok now again. we are monitoring it.
hardware fail, short downtime for ha. longer down for non-ha. One HTTP node. Old Apps only.
We’ll find your subscription and send you a link to login to manage your preferences.
We’ve found your existing subscription and have emailed you a secure link to manage your preferences.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from fortrabbit, are you sure?
{{ error }}
We’ll no longer send you any status updates about fortrabbit.