5 Key Insights into the CUBIC Congestion Control Bug in QUIC

By • min read

When the Linux kernel's CUBIC congestion control algorithm—the default for TCP and QUIC traffic—apparently misbehaves, the ripple effects can be felt across the internet. At Cloudflare, our QUIC implementation (quiche) relies on CUBIC, making its correct operation essential for a large share of our traffic. In this listicle, we break down the fascinating story of how a seemingly harmless kernel optimization meant to align CUBIC with RFC 9438 inadvertently triggered a persistent stall in the congestion window (cwnd) in our QUIC network stack. From symptom to root cause to an elegant one-line fix, here are the five things you need to know about this bug.

  1. The Critical Role of CUBIC
  2. The Seemingly Benign Kernel Change
  3. A Test That Fails 61% of the Time
  4. Root Cause: The App-Limited Trap
  5. The Elegant One-Line Fix and Its Implications

1. The Critical Role of CUBIC

CUBIC, standardized in RFC 9438, is the default congestion controller in the Linux kernel. As a result, it governs how most TCP and QUIC connections on the public internet probe for available bandwidth, back off when they detect loss, and recover afterward. Cloudflare’s open-source QUIC implementation, quiche, uses CUBIC as its default congestion controller, placing this code in the critical path for a significant share of the traffic we serve. Understanding CUBIC is essential because any flaw in this algorithm can affect millions of connections worldwide. The congestion window (cwnd) is the central knob: a sender-side cap on how many bytes can be in flight. A larger cwnd means more data per round trip; a smaller cwnd throttles the sender. Every loss-based CCA—CUBIC included—grows cwnd when the network is healthy and shrinks it when loss is detected.

5 Key Insights into the CUBIC Congestion Control Bug in QUIC
Source: blog.cloudflare.com

2. The Seemingly Benign Kernel Change

The story begins with a Linux kernel change aimed at bringing CUBIC into line with the app-limited exclusion described in RFC 9438 §4.2-12. The idea was to prevent CUBIC from treating idle periods (when the application has no data to send) as congestion signals. In TCP, this fix solved a real problem where temporary app pauses could artificially deflate the congestion window. When ported to Cloudflare’s QUIC implementation, however, this change surfaced unexpected behaviors. The kernel patch modified how CUBIC tracks “app-limited” states—periods when the sender is not fully utilizing the available cwnd because the application itself has run out of data to transmit. In TCP, this is typically a short-lived condition; in QUIC, the multiplexed nature of streams can create repeated app-limited intervals that interact poorly with the new logic.

3. A Test That Fails 61% of the Time

Our investigation began when our ingress proxy integration test pipeline started showing erratic failures. Specifically, tests that evaluated CUBIC in a scenario of heavy packet loss during the early part of the connection began failing approximately 61% of the time. This was a recovery after congestion collapse regime—exactly the kind of edge case a congestion controller is designed to handle. Most congestion control tests focus on steady-state and growth phases; far fewer probe what happens at the minimum cwnd after a connection has been beaten down by loss. The failures were consistent but not deterministic, pointing to a timing-dependent interaction between the app-limited logic and the loss recovery state machine. The symptom: after loss forced the cwnd to its minimum (usually 2 or 4 packets), the connection never recovered, staying pinned at that minimum indefinitely.

5 Key Insights into the CUBIC Congestion Control Bug in QUIC
Source: blog.cloudflare.com

4. Root Cause: The App-Limited Trap

Digging deeper, we found that the Linux kernel change introduced a new condition under which CUBIC would skip cwnd growth during recovery. In TCP, the app-limited flag is rarely set during recovery because TCP blocks until the retransmission succeeds. In QUIC, however, the implementation can continue to send data on other streams while waiting for retransmissions. This meant that the app-limited flag could be set even when the connection had data to send—but that data belonged to a different stream than the one experiencing loss. The kernel patch logic checked whether the sender was app-limited at the time of an ACK, and if so, it prevented cwnd from growing. This made the connection stuck: every ACK confirmed only a part of the lost data or new data on another stream, and because the sender was flagged as app-limited, CUBIC never increased cwnd. The congestion window remained at its floor indefinitely.

5. The Elegant One-Line Fix and Its Implications

The fix turned out to be surprisingly simple: one line of code added to the ACK processing logic. The key insight was that the app-limited flag should only be considered when a) the sender truly has no data to send (not just no data on a particular stream), and b) during the open state, not during the recovery or loss recovery phases. By adding a check that resets the app-limited flag when the sender enters recovery, we broke the cycle. After the fix, the same test that previously failed 61% of the time passed consistently. The broader implication is that congestion control algorithms designed and tested for TCP may have hidden flaws when used in QUIC, due to the multiplexed nature of streams. This bug was only possible because QUIC’s architecture allowed the sender to be partially app-limited while still having data to transmit. The fix has been upstreamed to quiche, and we recommend that other QUIC implementations using CUBIC review their app-limited handling.

The story of this bug illustrates the importance of testing edge cases—especially the behavior after congestion collapse—and of not assuming that a TCP-tested algorithm will work identically in QUIC. What was a well-intentioned kernel optimization became a persistent defect in a different protocol context, ultimately resolved by a single line that restored the intended behavior. As QUIC continues to grow in adoption, such lessons remind us that every line of code in a congestion controller carries weight.

Recommended

Discover More

Next-Gen Martian Rotorcraft: Q&A on NASA's Post-Ingenuity Helicopter Breakthroughs10 Crucial Facts About the Dissolution of OxyContin Maker Purdue PharmaCisco Unveils Open-Source Solution to Boost AI Model Transparency and SecurityUnveiling the Secrets of Dolphin Speed: How Vortex Rings Propel These Marine AthletesMicrosoft Copilot Studio Gets Massive Speed Boost with .NET 10 WebAssembly Upgrade