back to posts

In Defense of Rust: The Cloudflare Outage and Unfair Criticism

Alex Miller·November 21, 2025·coding, ai-generated

The Cloudflare outage sparked unfair criticism of Rust. Let's examine what actually happened, why unwrap() was reasonable, and how the crash actually prevented worse outcomes.

In Defense of Rust: The Cloudflare Outage and Unfair Criticism

The Discourse Is Getting Out of Hand

Twitter has been on fire lately with hot takes about Rust following the Cloudflare outage on November 18, 2025. If you believe the discourse, Rust's error handling somehow caused a massive internet outage, proving that the language is "too hard" or "encourages bad practices" or whatever other straw man argument is trending this week.

Let me be clear from the start: I'm not here to absolve Cloudflare or pretend this wasn't a serious incident. A massive internet outage that started at 11:20 UTC and didn't fully resolve until 17:06 UTC is unacceptable by anyone's standards. But the criticism aimed at Rust as a language fundamentally misunderstands both what happened and how error handling actually works in production systems.

So let's break down what actually happened, address the straw man arguments, and talk about why—contrary to the hot takes—the panic behavior here might have actually saved Cloudflare from an even worse situation.

What Actually Happened: A Timeline

First, let's establish the facts from Cloudflare's detailed postmortem. The failure sequence was straightforward but had cascading effects:

  • A permissions change in ClickHouse caused the bot management feature file query to return duplicate rows—doubling the file size from ~200 features to over 400

  • The bot management library expected at most 200 bytes

  • The code called .unwrap() on the result of reading this file

  • The program panicked, returning HTTP 5xx errors

Here's the actual code from Cloudflare's FL2 proxy that triggered the panic:

Asset

Now, before the pitchforks come out, let's understand the nuance here. This wasn't just some careless .unwrap() slapped onto a file read operation. The limit existed for a very specific performance reason: memory preallocation. The bot management system preallocated memory for exactly 200 features as an optimization. When the file suddenly doubled in size due to the ClickHouse query behavior change, it violated this assumption.

Dissecting the Flawed Criticism

Now that we understand what actually happened, let's talk about where the discourse went off the rails. The outage sparked legitimate questions about error handling, testing practices, and operational resilience. Those are valuable conversations to have. But instead of focusing on those systemic issues, much of Twitter decided to make this about Rust itself. Let's examine some specific examples and the reasoning errors behind them:

"You should never use .unwrap() in production"

This is cargo cult programming at its finest. The Rust community sometimes propagates this as dogma, but it's wrong. .unwrap() has legitimate uses in production code when you've reasoned about invariants and determined that failure should be fatal.

Think about it: what should Cloudflare's proxy do when it encounters a feature file that fundamentally violates its memory model? Should it:

  1. Log an error and continue serving traffic with undefined behavior?

  2. Try to gracefully degrade, potentially propagating corrupt state through the system?

  3. Crash immediately and loudly, preventing the bad configuration from propagating further?

Option 3 is exactly what happened, and it was the correct choice. More on this in a moment.

And here's the kicker: if you really believe that .unwrap() should never be used in production, Rust gives you the tools to enforce that. The Clippy linter has lints specifically for detecting unwrap(), expect(), and other potentially panicking operations. You can configure your CI pipeline to fail builds that contain these patterns. You can use #![deny(clippy::unwrap_used)] at the crate level to turn warnings into compile errors.

This is what's so frustrating about the "Rust encourages bad practices" narrative. Rust doesn't just give you powerful tools—it gives you the toolchain to enforce whatever practices you want. If your team decides that unwraps are unacceptable, you can make that a compile-time requirement. Try doing that in most other languages without building your own static analysis infrastructure.

There's a related criticism that's worth addressing: the idea that unwrap() is a "footgun" that shouldn't exist in a well-designed language. The reasoning goes: sure, maybe experienced developers know when to use it, but why give people this dangerous tool in the first place? Just remove the footgun entirely.

This argument sounds reasonable until you think about what it's actually proposing: removing the ability to assert invariants at runtime. And that's not removing a footgun—that's removing a fundamental programming tool. Runtime assertions are supposed to crash your program when assumptions are violated. That's not a bug, it's defensive programming. The alternative—silently continuing with violated invariants—is how you get data corruption, security vulnerabilities, and bugs that are impossible to reproduce.

Every production language has mechanisms for this: C has assert(), Java has assert, Python has assert, Go's philosophy is "don't panic" but it still has panic() for unrecoverable errors. The difference with Rust is that assertions are explicit. When you call unwrap(), you're making a visible decision in your code: "I believe this will succeed, and if it doesn't, the correct behavior is to crash."

The "don't give people footguns" argument essentially says: "developers can't be trusted to make this decision, so remove the option entirely." But that's infantilizing. Good developers should be reasoning about invariants and deciding when violations should be fatal. Testing validates those decisions—if your tests exercise the code paths and the unwraps don't fire, your invariants hold. If they do fire in production, you've discovered an edge case you didn't test for, which is valuable signal for improving your test coverage.

"This wouldn't have happened in [insert favorite language]"

Oh really? Let's play this out:

  • In Go: The code would have silently allocated more memory than expected, potentially triggering OOM conditions across the fleet. Or if they checked the length and returned an error, it would bubble up and... crash the handler anyway.

  • In C++: Buffer overflow, undefined behavior, or an exception that crashes the process—pick your poison. At least Rust made it explicit.

  • In Python/Node.js: Exception thrown, potentially caught at some arbitrary layer, logged to a file no one reads, and the system continues with undefined state.

The hard truth: this was a configuration management problem, not a language problem. Any reasonable language would have failed here once the invariants were violated.

Logical Fallacies: Post Hoc and Correlation-Causation

From @LundukeJournal: "September, 26: Cloudflare rewritten in 'memory safe' Rust. The change is touted as 'faster and more secure' because of Rust. November, 18 (53 days later): Cloudflare has a massive outage, which took down large portions of the Internet, because of a..."

The reasoning error: This is a textbook post hoc ergo propter hoc fallacy—"after this, therefore because of this." The implication is that because Cloudflare rewrote their proxy in Rust and then had an outage, Rust caused the outage. By this logic, if I eat a sandwich and then it rains, sandwiches cause rain.

The outage was caused by a database configuration change that resulted in malformed input data—something that would have broken any system regardless of implementation language. The fact that it happened 53 days after the Rust rewrite is completely irrelevant.

Also from @LundukeJournal: "Over the next few days, Rust apologists will be working overtime. But one thing will remain a certainty: Cloudflare rewrote their core in Rust, then half the Internet went down."

This doubles down on the correlation-causation confusion by framing anyone who points out the logical error as "apologists." The rhetorical move here is to dismiss counterarguments preemptively rather than engage with the actual technical details. Yes, Cloudflare rewrote parts of their infrastructure in Rust. Yes, they had an outage. But the "then" in that sentence doesn't establish causation—it just notes temporal sequence. By this reasoning, we could equally say "Cloudflare runs on Linux, then half the Internet went down."

I don't think I'm an apologist, but the outage is not making me work overtime discrediting bad takes on this whole incident, it's people like Lunduke.

Missing the Point on Explicitness

From @filpizlo: "There's a meme that NULL is a bad idea. That meme leads to language designs that have stuff like unwrap, which gives you the moral equivalent of NullPointerException. It's not clear that forcing programmers to say unwrap to get the NPE is any better than what Java or C do."

The reasoning error: This completely misses the point of Rust's error handling design. The entire value of unwrap() is that it forces explicitness. In Java or C, null pointer dereferences happen silently and unpredictably—any variable of reference type might be null at any time, and you won't know until runtime.

In Rust, you must explicitly opt into the possibility of panic by calling unwrap(). This makes the crash point obvious in the code. When Cloudflare's proxy panicked, they immediately knew where the problem was: right at the unwrap() call. In Java, they'd be digging through stack traces trying to figure out which of dozens of possible null dereferences actually triggered the NPE. That's not equivalent—that's strictly worse.

This gets at a broader criticism I keep seeing: that Rust's error handling is "too complex." But that's backwards. Rust's error handling is explicitly designed to be simple and transparent. Compare this to languages with exceptions where any function call might throw an error you didn't anticipate—silent failures, swallowed exceptions, and mysterious crashes three layers removed from the actual problem. Rust forces you to be explicit about error handling, which makes bugs like this easier to debug, not harder.

The Try-Catch Red Herring

From @Samuel_Roux_: "I struggle to see why unwrap is better than an NPE or an uncaught exception. I really, really struggle to see why Rust error handling is so difficult. This could have all been caught by a try/catch block. This outage never had to happen."

The reasoning error: This assumes that a try-catch block would have somehow fixed the problem rather than just moving where the failure happens. Let's think through what a try-catch would actually do here:

try {
    features = loadFeatures();  // Throws when file is too large
} catch (Exception e) {
    // Now what? Return an error to the user?
    // Use stale features? Use no features?
    // Any of these choices are making up behavior on the fly
}

The problem isn't whether you use unwrap() or try-catch—it's that the system's invariants were violated by bad input data. A try-catch block doesn't magically solve that. You still need to decide: do you serve traffic with undefined behavior, or do you fail fast and make the problem obvious? Cloudflare chose the latter, and they were right to do so.

Also, maybe this is just personal preference, but error handling in Rust is easy because it is explicit. If you want to get around using the Result type ... then use an unwrap() .

The False Equivalence Between Panics and Undefined Behavior

From @vahvarh: "'This is just a bad code, not rust problem' - then double free is a bad code, not C problem? 'Just forbid unsafe and unwrap' - okay, what to do with all libraries doing unsafe and unwrap, ex built-in std and ssl?"

The reasoning error: This creates a false equivalence between unwrap() (a logic error in handling valid program states) and double-free bugs (undefined behavior causing memory corruption). These are categorically different:

  • Double-free in C: Causes memory corruption, can lead to arbitrary code execution, creates security vulnerabilities, is often impossible to debug without specialized tools.

  • unwrap() panic in Rust: Program immediately terminates with a clear error message at the exact line of code that panicked. No memory corruption, no undefined behavior, no security implications.

But here's the deeper issue this tweet misses: panicking can be a fully justified engineering decision. The criticism assumes that any panic in production is inherently bad code, but that's cargo cult thinking. In reality:

  • Assertions in production are standard practice across the industry. C has assert(), Java has assert, Go has panic checks everywhere. The difference is Rust makes them explicit and gives you a clear stack trace.

  • Testing validates your assumptions. If you have proper test coverage that exercises your system with various inputs including edge cases, your unwrap() calls will get validated. If they panic during testing, you fix them. If they don't panic during testing but panic in production, you've discovered an assumption violation you didn't test for—which is valuable information.

  • Fail-fast is a design philosophy, not a bug. When your invariants are violated—like Cloudflare's 200-feature limit—the correct engineering decision is often to crash loudly rather than continue with undefined behavior. This is defensive programming, not sloppy coding.

The suggestion to "forbid unsafe and unwrap" reveals a fundamental misunderstanding of systems programming. Even Rust's standard library uses unsafe internally because sometimes you need low-level control. The point is that these dangerous operations are isolated and auditable, not spread invisibly throughout your codebase like in C.

Why the Crash Prevented Worse Outcomes

Here's the counterintuitive truth that the Twitter critics are missing: the panic was probably the best possible outcome given the constraint violation.

Think through what would have happened if the FL2 proxy hadn't panicked:

  • The bad configuration would have propagated silently across the entire fleet

  • Memory consumption would have spiked unpredictably, potentially triggering cascading failures

  • Bot detection would have been silently broken for an unknown duration

  • The team would have been hunting for a subtle performance degradation instead of a clear crash

  • Recovery would have been harder because the problem would have been less obvious

By crashing loudly and immediately, the FL2 proxy made the problem impossible to miss. The error was unambiguous: "called Result::unwrap() on an Err value." The root cause was clear: the feature file was too large. The fix was obvious: roll back to the good configuration.

This is textbook "fail fast" engineering. When your invariants are violated, you don't try to limp along—you crash and make it someone's problem to fix. The alternative is silent corruption that compounds over time and becomes exponentially harder to debug.

The Real Lessons: Configuration Management, Not Language Choice

If we're being honest about what went wrong, this was fundamentally a configuration management problem. The real issues were:

  • A ClickHouse permissions change that had unexpected downstream effects

  • A database query that didn't filter by database name, leading to duplicate rows

  • Insufficient testing of the feature file generation pipeline under different database configurations

  • Rapid propagation of bad configuration across the global network

Notice what's not on this list? "Rust's error handling." The panic was a symptom, not the cause. In fact, Cloudflare's postmortem doesn't even criticize the use of .unwrap() here—they recognize it as a reasonable design choice given the constraints.

What Cloudflare Is Actually Fixing

From their postmortem, here's what Cloudflare is actually doing to prevent this from happening again:

  • Hardening ingestion of Cloudflare-generated configuration files in the same way they would for user-generated input—treating internal config as untrusted input

  • Enabling more global kill switches for features—better circuit breakers

  • Eliminating the ability for core dumps or other error reports to overwhelm system resources—preventing debugging systems from becoming the bottleneck

  • Reviewing failure modes for error conditions across all core proxy modules—systematic approach to error handling

Notice that none of these involve "stop using Rust" or "never use .unwrap() again." They're focused on configuration validation, testing, and operational resilience. This is what mature engineering organizations do—they look at systemic problems rather than blaming tools.

The Bottom Line

Rust didn't cause this outage. A bad database configuration did. The panic behavior in Rust actually helped by making the problem immediately obvious rather than letting it silently corrupt data across Cloudflare's global network.

The criticism of Rust's error handling in this context reveals a fundamental misunderstanding of what error handling is supposed to do. It's not supposed to magically fix broken assumptions or invalid states—it's supposed to make failures visible and debuggable. Rust did exactly that.

Could the code have been written differently? Sure. Could they have had better testing around the configuration file generation? Absolutely. But these are engineering process and architecture questions, not language questions.

So next time you see someone dunking on Rust because of the Cloudflare outage, ask them: what would have been better? Silent corruption? Undefined behavior? Or a clear, loud crash that made the problem impossible to ignore and forced a quick fix?

I'll take the crash every time.

A Note on AI-Assisted Content

Transparency matters to me, so I want to share how this blog post came together. The first pass was generated with the following prompt before I went and edited it down:

specifically address unfair criticism of rust and the weird straw man arguments i see popping up all over twitter!

here is the postmortem on the outage:

https://blog.cloudflare.com/18-november-2025-outage/

the flow of failure was

→ customer updated cluster

→ bad config made its way into workers

→ library for deserializing config file expected at most 200 bytes

→ but config was double that causing an error

→ the call to read the config used an unwrap, causing the program to crash

to me this is more of a programmer error in that

- the hard coded size constraint on the config file was maybe not the best choice

- relatedly, not handling the error and using an unwrap was maybe not the best idea

but that HAS to be balanced with:

- unwrapping can and often is the right choice where you are decently sure the error conditions you are working with wont be raised, maybe that means you should have thought through your design or use of the module in question, but you inherently can motivate an unwrap or expect in rust, especially when there is not much to be done with an error, or you would want to crash and fail loudly if you ran into an exception that your program cannot handle or should not have to handle

- and in this case — crashing stops bad config from propagating further through the system and causing more errors down the line that might be nastier to recover from!

look at my twitter so you can see what comments ive been making on the whole thing

you should emulate stuff we went over in Why I want to start using neverthrow and the voice and tone you implemented in Parallel Development with Claude Code

be sure to add this prompt at the bottom of the blog post! and tag it as ai-generated