202

Vanishing power feeds, UPS batteries, failover fails... Cloudflare explains that two-day outage (www.theregister.com)

submitted 2 years ago by throws_lemy@lemmy.nz to c/technology@lemmy.world

17 comments fedilink hide all child comments

all 19 comments

sorted by: hot top new old

[-] draughtcyclist@programming.dev 34 points 2 years ago

This is interesting. What I'm hearing is they didn't have proper anti-affinity rules I'm place, or backups for mission-critical equipment.

The data center did some dumb stuff, but that shouldn't matter if you set up your application failover properly. Architecture and not testing failovers are the real issue here

[+] Mbourgon@lemmy.world 13 points 2 years ago* (last edited 3 months ago)

[deleted]

[-] towerful@programming.dev 3 points 2 years ago

That's exactly it.
https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/

Here is a quick summary, but the actual postmortem is worth reading.
Classic example of cascade failure or domino effect. Luckily their resilience wasn't a full outage

Basically, new features get developed fast and are iterated quickly. When they mature, they get integrated into the high availability cluster.
There are also some services that are deliberately not clustered. One of which is logging, which should cause logs to pile up "at the edge" when the logging core service is down.
Unfortunately, some services were too tightly coupled to the logging core. So they should've been HA clustered, but were unable to cope with the core logging service being down.
Whilst HA failover had been tested, the core services has never been taken offline, so all this was missed.

Which all ended up with inconsistent high-availability amongst different services and products. A lot of new features would have failed as expected, and some mature features that shouldn't have failed did.

When they brought their disaster recovery site up, there were some things that needed manual configuration, and some newer features that hadn't been tested in a disaster recovery scenario.

They are now focusing significant resources on:

Remove dependencies on our core data centers for control plane configuration of all services and move them wherever possible to be powered first by our distributed network.
Ensure that the control plane running on the network continues to function even if all our core data centers are offline.
Require that all products and features that are designated Generally Available must rely on the high availability cluster (if they rely on any of our core data centers), without having any software dependencies on specific facilities.
Require all products and features that are designated Generally Available have a reliable disaster recovery plan that is tested.
Test the blast radius of system failures and minimize the number of services that are impacted by a failure.
Implement more rigorous chaos testing of all data center functions including the full removal of each of our core data center facilities.
Thorough auditing of all core data centers and a plan to reaudit to ensure they comply with our standards.
Logging and analytics disaster recovery plan that ensures no logs are dropped even in the case of a failure of all our core facilities.

[-] Nighed@sffa.community 26 points 2 years ago

Surprised a company of their scale and with such a reliance on stability isn't running their own data centres. I guess they were trusting their failover process enough not to care

[-] brianorca@lemmy.world 10 points 2 years ago* (last edited 2 years ago)

They probably need to be in so many different locations, and so many different network nodes, that they don't want to consolidate like that. Their whole point of being is to be everywhere, on every backbone node, to have minimum latency to as many users as possible.

[-] Nighed@sffa.community 1 points 2 years ago

This sounds like a core datacetre though?

[-] JakenVeina@lemm.ee 13 points 2 years ago

the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week.

That poor bastard.

[-] DoomBot5@lemmy.world 12 points 2 years ago

This reminds me of how AWS lost critical infra when us-east-1 went down. That's including the status dashboard that was only hosted there.

[-] kent_eh@lemmy.ca 10 points 2 years ago

I'll be curious to learn if the battery issue was due to being under-dimensioned, or just aged and at reduced capacity.

[-] Vandals_handle@lemmy.world 1 points 2 years ago

Or not properly maintained and at reduced capacity.

[-] scottmeme@sh.itjust.works 7 points 2 years ago* (last edited 2 years ago)

It isn't Flexentials year.

They got burned by DediPath.

They got burned by NextArray.

They just got ousted by Cloudflare.

[-] ninekeysdown@lemmy.world 4 points 2 years ago

If it keeps up it’s going to someone is going to be making 3 envelopes….

[-] autotldr@lemmings.world 4 points 2 years ago

This is the best summary I could come up with:

Cloudflare's main network and security duties continued as normal throughout the outage, even if customers couldn't make changes to their services at times, Prince said.

We're told by Prince that "counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power," and so didn't have a heads up that maybe things were potentially about to go south and that contingencies should be in place.

Whatever the reason, a little less than three hours later at 1140 UTC (0340 local time), a PGE step-down transformer at the datacenter – thought to be connected to the second 12.47kV utility line – experienced a ground fault.

By that, he means at 1144 UTC - four minutes after the transformer ground fault – Cloudflare's network routers in PDX-04, which connected the cloud giant's servers to the rest of the world, lost power and dropped offline, like everything else in the building.

At this point, you'd hope the servers in the other two datacenters in the Oregon trio would automatically pick up the slack, and keep critical services running in the absence of PDX-04, and that was what Cloudflare said it had designed its infrastructure to do.

The control plane services were able to return online, allowing customers to intermittently make changes, and were fully restored about four hours later from the failover, according to the cloud outfit.

The original article contains 1,302 words, the summary contains 228 words. Saved 82%. I'm a bot and I'm open source!

[-] Luisp@lemmy.dbzer0.com 3 points 2 years ago

Mr magoo it's the CEO

[-] MonkderZweite@feddit.ch 3 points 2 years ago

What does she do on the notebook?

[-] nyakojiru@lemmy.dbzer0.com 7 points 2 years ago

I think she is reading the matrix code

[-] Agent641@lemmy.world 2 points 2 years ago

Onlyfans

[-] drdabbles@lemmy.world 1 points 2 years ago

It was poor design. Poor design caused a 2 day outage. When you've got an H/A control plane designed, deployed in production, running services, and you ARE NOT actively using it for new services let alone porting old services to it, you've got piss poor management with no understanding of risk.

this post was submitted on 07 Nov 2023

202 points (98.6% liked)

Technology

84878 readers

4970 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws