Nearly Right

Cloudflare’s database permissions change crashed 20% of web traffic for six hours

A routine security update triggered cascade failures in systems designed to prevent exactly this kind of breakdown

Less than a month after Amazon’s Virginia data centre paralysed the internet with a DNS error, Cloudflare demonstrated that infrastructure concentration creates vulnerability even when the infrastructure itself works perfectly.

On 18 November 2025, Matthew Prince wrote an apology that exposed a different kind of fragility. “We’ve architected our systems to be highly resilient to failure,” Cloudflare’s CEO explained, before detailing how those systems had just failed catastrophically—not from infrastructure collapse, but from software behaving exactly as programmed under unanticipated conditions.

A database permissions change meant to improve security exposed an assumption buried in a query. That query generated a configuration file twice its expected size. The oversized file exceeded a memory limit. The bot management system panicked. The core proxy crashed. Twenty per cent of global web traffic returned error pages for six hours.

This wasn’t Amazon’s problem repeated. The infrastructure was fine. The network hummed along. The cascade originated entirely from software encountering conditions its designers never anticipated—revealing that even brilliantly engineered systems fail in ways that emerge only at planetary scale.

The cascade nobody anticipated

At 11:05 UTC, engineers updated database permissions on a ClickHouse cluster. The goal was sound: move from implicit to explicit access grants, improving security and reliability. Users would see accurate metadata about tables they could already access.

One query wasn’t filtering by database name. Before the change, it retrieved columns only from the default database. After, it returned duplicates—the same columns from both the default database and the newly visible underlying storage layer.

The feature file used for bot management normally contained about 60 machine learning features. It suddenly doubled to over 200. This file regenerated every five minutes and propagated across Cloudflare’s global network within minutes.

Now the protection mechanisms became the problem. Cloudflare’s bot management system preallocates memory for features—a performance optimisation standard in high-traffic systems. The code had a hard limit of 200 features, comfortably above the normal 60.

When the oversized file hit servers worldwide, the limit was exceeded. The software didn’t degrade gracefully or log warnings. It panicked. Every server running the core proxy encountered the same problem and entered the same failure state.

By 11:20 UTC, users worldwide saw HTTP 5xx errors. ChatGPT went silent. X became inaccessible. Spotify stopped. Banking sites failed. For the 20% of web traffic running on Cloudflare’s infrastructure, the internet had stopped working.

When permissions become problems

Where AWS’s October outage revealed how infrastructure concentration creates vulnerability—a DNS error in Virginia paralyzing global services—Cloudflare’s failure exposed a different dimension of the same structural fragility. The infrastructure worked. The cascade originated entirely in software encountering unanticipated conditions.

The outage’s behaviour initially suggested an attack. Engineers were updating the ClickHouse cluster gradually, not all at once. Each five-minute file regeneration was a lottery: sometimes the query hit updated nodes and generated the problematic file, sometimes it hit old nodes and produced a correct one.

The entire system failed, recovered, then failed again in an unpredictable pattern. Engineers watching monitoring systems saw what looked like a sophisticated attack with periodic strikes. Adding to the confusion, Cloudflare’s status page—hosted separately with no dependencies on their own infrastructure—coincidentally went down around the same time. A random failure that reinforced suspicions of coordinated attack.

The team pursued wrong paths. They investigated traffic patterns for DDoS attacks. They examined whether this might be another Aisuru botnet offensive, following October’s record-breaking attack on Microsoft Azure. The fluctuating failures made diagnosis exceptionally difficult.

Then the pattern stabilised. As the database update completed, every query started generating the bad file consistently. This shift from random to consistent failure actually helped. Engineers traced backwards from the panicking bot management system to the oversized file to the unfiltered database query.

The fix was straightforward once identified: stop generating new files, manually insert a known good file, force a restart. By 14:30 UTC, core traffic was flowing normally again, though services continued recovering until 17:06.

The engineering double bind

This outage matters not because of the specific bug but because it reveals how systems built for resilience at scale create their own failure modes.

Memory preallocation is standard in high-performance systems. It avoids expensive allocation operations during request processing, improving speed and reducing latency. Setting the limit at 200 features—more than three times actual usage—seemed prudent.

But protection mechanisms become constraints. The preallocation meant to ensure predictable performance became a hard limit that crashed the system rather than degrading gracefully. The rapid global propagation designed to respond quickly to evolving threats meant the problematic file reached every server within minutes. The automated distribution system built to handle configuration changes without human intervention faithfully distributed bad data as efficiently as good data.

Google’s Site Reliability Engineering literature describes this pattern: cascading failures involve feedback mechanisms where an event causes reduced capacity, increased latency, or error spikes, then other components make the problem worse.

The feedback loop was devastating. Oversized files caused servers to panic. Panicking servers couldn’t handle traffic. Failed traffic triggered client retry logic. Retries added load to servers attempting to restart with the same bad file. The cycle continued until someone stopped the source of bad data.

This is emergent behaviour—outcomes arising from component interactions rather than individual component design. You can design each piece carefully, test extensively, and still encounter failure modes that only manifest when multiple systems interact in specific, unanticipated ways.

The 2019 outage stemmed from a different mechanism—a regex pattern causing excessive CPU backtracking—but the underlying problem was identical. A protective system (the Web Application Firewall) was updated with what seemed like a reasonable rule that triggered unexpected behaviour at scale. June 2025 followed the same pattern: a “coreless” service with no single point of failure turned out to depend on centralised storage.

Each time, engineers implement fixes for the specific problem. Each time, they strengthen systems against that particular failure mode. Each fix adds complexity. And complexity creates new potential for emergent failures that won’t be discovered until they occur in production under conditions nobody anticipated.

The outage rate is accelerating. Not because engineers are getting worse, but because systems are getting more complex faster than our ability to understand them improves.

A pattern, not an anomaly

Three major infrastructure failures in six weeks. AWS in October took down banking, gaming, and government services with a DNS error in Virginia. Microsoft Azure suffered repeated disruptions throughout the year. Now Cloudflare’s configuration cascade.

The economics driving this concentration are well understood. Three companies control two-thirds of cloud infrastructure because the capital requirements—AWS alone spending over £80 billion this year—create insurmountable barriers to competition. Market forces recreated the single point of failure the internet was designed to avoid.

But Cloudflare’s outage reveals something beyond concentration economics. This wasn’t infrastructure failing. It was software encountering an edge case that brilliant engineers designing resilient systems never anticipated. The infrastructure was globally distributed across 320 cities in 120 countries. The failure propagated anyway.

David Schwed at SovereignAI stated the obvious after the outage: “If your organisation needs to be up 24/7, you have to build your infrastructure assuming these outages will happen.” Yet businesses cannot diversify. Building genuine multi-cloud redundancy costs more than the outages. Regional providers lack global reach. The rational economic choice is accepting systemic risk.

What’s changing is the frequency. AWS suffered major outages in 2017, 2020, 2021, and 2023 before October 2025. Cloudflare’s previous worst failure was 2019—a regex pattern causing CPU exhaustion. Then June 2025—third-party storage failing. Now November. The intervals are shortening.

Professor Alan Woodward from Surrey’s Centre for Cyber Security noted that Cloudflare’s distributed architecture made external attack unlikely. True. But internal cascade failures don’t care about distribution when the same configuration file reaches every location within minutes.

Cloudflare acknowledged this tension following their June outage. Workers KV was built as a “coreless service” with no single point of failure. Except it relied on centralised storage that became exactly that single point when it failed.

The pattern suggests something fundamental. These aren’t isolated engineering mistakes that better practices will prevent. They’re symptoms of operating systems whose complexity exceeds human capacity for failure mode analysis.

When complexity exceeds comprehension

Market forces created infrastructure concentration, and those economics aren’t changing. But Cloudflare’s cascade reveals a deeper problem that would persist even with perfect competition and unlimited capital.

The failure stemmed from invisible assumptions in software. An unfiltered database query. A hard memory limit that crashed rather than degraded. Automated distribution that propagated bad configurations as efficiently as good ones. Each decision made perfect sense when implemented. The failure mode emerged only when they combined under specific conditions at global scale.

Prince’s post-mortem demonstrates engineering humility and transparency. The company’s response—detailed analysis, systemic fixes, prevention commitments—follows best practices. They’ll add safeguards against oversized files, improve testing, harden systems against similar failures.

Then the next unanticipated failure mode will emerge. Different trigger, different cascade path, same underlying problem. You cannot test every possible component interaction. You cannot document every assumption buried in code. You cannot design for every failure mode whilst meeting performance requirements for planetary-scale traffic.

The 2019 outage was a regex pattern. June 2025 was storage infrastructure. November was a configuration file. Each time, engineers fix the specific problem and strengthen systems against that failure mode. Each fix adds complexity. Complexity creates new potential for emergent failures.

AWS’s October outage cost organisations over £800 million. Last year’s CrowdStrike incident cost Fortune 500 companies £4.3 billion. These aren’t theoretical risks or acceptable costs. They’re symptoms of infrastructure reaching the limits of human design capacity.

The pattern matters more than individual incidents. Three major outages in six weeks. Cloudflare’s worst failures separated by six years, then six months. The intervals are shortening as systems grow more complex and interdependent.

Engineers will continue improving. Providers will learn from failures. Organisations will maintain the rational but risky choice to concentrate infrastructure with providers offering scale and sophistication no alternatives can match.

And every few months, when another cascade reminds us that planetary-scale systems can fail in ways their designers never imagined, we’ll write post-mortems, implement fixes, and wait for the next unanticipated failure mode to emerge from the complexity we cannot fully comprehend.

The distributed, resilient internet was brilliant design. Market forces recreated concentration. Now engineering is discovering that even brilliant design has limits when complexity exceeds human capacity for analysis.

Monday morning in Virginia was infrastructure failing. Tuesday afternoon globally was software failing. The common thread isn’t the failure type. It’s that we’ve built systems too complex to fully understand, then made the entire digital world depend on them.

#technology