One of the HTTP nodes suffered a hardware failure. It was automatically migrated to a new node.
During the bootstrap of the new Node, software upgrades were automatically applied. Among those updates was the
libresolve Linux core library (from the Debian package
libc6-2.13-38+deb7u10). The library has a not yet known bug, leading to random segmentation faults when resolving arbitrary DNS entries.
At first the problem was hard to see, so we initially just tried to replace the "faulty new Node" with another one, which lead to the same result as before. While we continued trying to identify the issue we started manually migrating multiple Apps to other existing nodes with free resources. After a couple thousand lines of
strace we finally found the culprit and downgraded the package to a working version. Everything is back to normal.
Where we can improve: Downtime communication. A lot of clients noticed 500 errors and notified us by writing support tickets — thanks for taking care BTW. At that moment we were already working on the issue. It is a tough decision to put out an incident without knowing what's going on.