6 Critical Insights into the CUBIC Congestion Control Bug That Stumped Cloudflare

By • min read

At Cloudflare, our open-source QUIC implementation, quiche, relies on CUBIC as its default congestion controller. This means that nearly every connection we serve passes through CUBIC's logic for probing bandwidth and reacting to loss. When a Linux kernel patch intended to align CUBIC with RFC 9438 was ported to quiche, it unleashed a puzzling bug: after a bout of heavy loss early in a connection, the congestion window (cwnd) would lock at its minimum value and never recover. Tests showed a 61% failure rate in scenarios simulating congestion collapse. Here's the inside story of how we discovered, diagnosed, and fixed this odd bug with an elegant one-line change.

1. The Symptom: A 61% Failure Rate in Integration Tests

Our investigation began with erratic failures in our ingress proxy integration test pipeline. These tests placed CUBIC in a scenario with severe packet loss at the start of a connection—an uncommon but critical regime. While most congestion control tests focus on steady-state growth, they rarely probe what happens at minimum cwnd after a connection has been beaten down. The bug lived in this dark corner: after congestion collapse, cwnd remained stuck at its floor, preventing any recovery. This meant that roughly three out of five connections that experienced early heavy loss would never regain their full speed—a huge performance hit for real-world traffic.

6 Critical Insights into the CUBIC Congestion Control Bug That Stumped Cloudflare
Source: blog.cloudflare.com

2. CUBIC's Core Logic: The Basics of Congestion Control

To understand the bug, we need a quick refresher on how CUBIC works. At the heart of any congestion control algorithm lies the congestion window (cwnd), a limit on how many bytes can be in flight without acknowledgment. CUBIC, a loss-based algorithm, follows a simple rule: if no packet loss occurs, increase cwnd (probe for more bandwidth); if loss is detected, assume the network is full and drastically reduce cwnd. Over time, cwnd grows in a cubic function—hence the name. This approach works well for steady-state flows, but it makes several assumptions about network behavior that can break down in edge cases, like when the sender is idle or when loss happens very early.

3. The Kernel Optimization: Aligning with RFC 9438

The story starts with a Linux kernel change intended to fix a real problem in TCP. RFC 9438 §4.2-12 specifies an app-limited exclusion rule: when a sender is app-limited (i.e., has no data to send), it should not apply loss-based congestion window growth, because the sender isn't actually causing the network load. Without this rule, CUBIC could incorrectly increase cwnd during idle periods, wasting capacity. The Linux kernel patch added logic to detect app-limited states and exclude them from CUBIC's growth calculations—a sensible fix. But when we ported this logic to quiche for QUIC, it interacted poorly with our handling of idle connections.

4. The Porting Problem: QUIC's Unique Idle Handling

QUIC differs from TCP in how it manages idle connections. In QUIC, a connection can become idle when there's no data to send, but the sender must still maintain state for potential retransmissions. When porting the Linux kernel's app-limited exclusion to quiche, we inadvertently triggered a new condition: after a congestion collapse event (where cwnd is reduced to the minimum), the sender would be classified as app-limited due to a misaligned idle detection. This classification prevented CUBIC from ever exiting its minimum cwnd state—it kept thinking the sender was idle, so it never applied growth. The bug essentially created a feedback loop: cwnd stays at floor → sender appears idle → no growth → cwnd stays at floor forever.

6 Critical Insights into the CUBIC Congestion Control Bug That Stumped Cloudflare
Source: blog.cloudflare.com

5. Root Cause Analysis: The Permanent Pinning of cwnd

After extensive debugging, we identified the exact sequence that caused cwnd to lock. During early heavy loss, CUBIC reduces cwnd to its minimum (typically 2 packets or about 2-3 KB). Normally, after a recovery period, CUBIC begins to increase cwnd again. However, the new app-limited exclusion set a flag that prevented this growth when the connection was considered idle. In QUIC, the definition of idle can be triggered even if the sender has outstanding data to retransmit—a nuance that doesn't exist in TCP. This mismatch meant that the connection was never seen as actively probing bandwidth, so the growth function was never called. The cwnd remained pinned at its minimum, and the connection could never recover its throughput.

6. The One-Line Fix: Breaking the Cycle

The solution was surprisingly elegant: a single-line change. We realized that when a connection is recovering from congestion collapse, it should never be considered app-limited—even if it appears idle from a data-sending perspective. The fix involved resetting the app-limited flag after a congestion event, allowing CUBIC's growth function to resume normally. In practice, we added one line to clear a timer variable that triggered the false idle classification. This simple adjustment broke the feedback loop and restored CUBIC's ability to recover from deep congestion. After deploying the fix, our integration test failure rate dropped from 61% to 0%—a clear victory for careful protocol implementation.

This story highlights how subtle interactions between protocol layers can lead to surprising behavior. What seemed like a harmless optimization for TCP became a critical bug in QUIC because of differences in idle connection semantics. The fix was tiny, but the debugging journey taught us a lot about the corners of congestion control. For now, our quiche implementation is back to normal, and we're sharing this tale as a reminder that even mature algorithms like CUBIC have hidden complexities—especially when ported across transports.

Recommended

Discover More

How Tectonic Forces Carved the Twelve Apostles: A Step-by-Step Geological GuideMastering Structured Prompt-Driven Development: A Team Guide to Aligning AI with Business Needs10 Steps to Run Your Own Private AI Image Generator - No Cloud RequiredNavigating Your First Kotlin Open-Source Contribution: A Mentorship Playbook10 Key Considerations for Choosing Between Vibe Coding and Spec-Driven Development