Partitial Outage


The storage interruption took place between approximately 09:05h CEST and 10:05h CEST. It was caused by a corrupt storage node, which alternated between working and offline.

We identified the faulty node at around 09:20h CEST (the first assumption that it was a network issue delayed the identification) and tried to repair it. Towards 09:50h CEST it became clear that the node was not repairable and was replaced with a new one. During the whole event, the HTTP services for many Apps alternated between online and offline. About 15% of the Apps were fully offline for about 30 minutes and the rest between 5 and 10 minutes. New Apps were not affected at all.

We apologize for the inconvenience and will work on measurements to faster identify faulty nodes.

So far the new master is working correctly. It seems that this is finally over now.

New master-storage is in place, reviewing status.

We are on it! Unfortunately, the issue is escalating. We have identified hiccups in the attached storage node system as the root of the problem. Currently ALL "Old Apps" are affected, status changes from on to off (currently ON). "New Apps" are not affected. We'll keep you updated.

Some Apps are responding with an 503 status code — first seen 07:30 CEST.

Began at: