Network: Unplugged cables cause massive cloud flare failure


On Wednesday afternoon and evening, April 15th, the Cloudflare dashboard and API were down for nearly four and a half hours. Like the company that specializes in network services on his blog, the reason for the failure was neither a DDos attack nor an overload caused by too much traffic, as was the case with other services during the Covid 19 pandemic. The reason was rather the separation of several actually redundant fiber optic connections in one of the company's two central data centers.

In the course of planned maintenance work, on-site technicians were instructed to remove all of the equipment from one of his cabinets, writes Cloudflare. It contained old and no longer used hardware that should be retired. There was no network traffic to the old servers, nor was there any data on them. In the cabinet with the old hardware there was also an actively used patch panel.

This was used to establish all external connections to the Cloudflare data centers. The technician not only shut down the old servers, but also the patch panel, explains Cloudflare. The company's main control level and database are located in the data center, which is why the dashboard and the API could no longer be used.

To solve the problem, Cloudflare first activated its failover processes, but according to the blog entry, the team repeatedly checked whether the complex task for the dashboard and API actually had to be carried out. This would have happened in the event of physical damage from natural disasters. However, after the network connection to the data center could be re-established, the services could be started up again quickly. The transfer of the services was therefore not necessary, writes Cloudflare.

The company concludes from the process that the cause of the failure was the presence of a single point of failure. The connections should therefore be distributed to different data centers in the future. In addition, the company wants to better document its technology physically. For example, Cloudflare writes that the team lost a lot of time in correctly identifying the affected cables in order to re-establish the connections. According to the company, cables and panels will therefore be provided with markings in the future. Ultimately, the technicians who work on the hardware should also receive more precise descriptions of their tasks. A warning should be given before pulling out cables.

