When Observability Becomes Dependency: Hyrum's Law, Restartable Sequences, and the TCMalloc Dilemma

From Tsd1588, the free encyclopedia of technology

Hyrum's Law famously states that any observable behavior of a system, even if undocumented, will eventually become a dependency for someone. The Linux kernel community recently encountered a stark example of this principle. In the lead-up to the 6.19 release, developers worked to optimize restartable sequences (rseq)—a mechanism for user-space thread-local operations without kernel mutexes. While they preserved the documented API, Google's TCMalloc library relied on internal behavior that changed, breaking compatibility. This clash between careful API maintenance and real-world usage has forced the kernel team to revisit its no-regressions policy and find ways to accommodate TCMalloc's unexpected reliance. Below, we explore the key questions and implications of this incident.

What Is Hyrum's Law and Why Does It Matter Here?

Hyrum's Law, named after Google engineer Hyrum Wright, asserts that any observable behavior of a system—no matter how minor or accidental—will eventually be depended upon by some user. In the context of software, this means that even undocumented features or implementation details become part of the implicit contract between a system and its consumers. The Linux kernel's recent experience with restartable sequences and TCMalloc is a textbook case. Kernel developers carefully maintained the official rseq API but overlooked the fact that TCMalloc had started relying on a subtle side effect: the ability to safely nest rseq critical sections while using a specific memory allocator trick. When 6.19 changed this behavior, TCMalloc broke, confirming that Hyrum's Law is not just theoretical—it's a practical constraint on any widely used interface.

When Observability Becomes Dependency: Hyrum's Law, Restartable Sequences, and the TCMalloc Dilemma

What Are Restartable Sequences and Why Are They Important?

Restartable sequences (rseq) are a Linux kernel feature introduced to let user-space applications perform per-CPU operations without expensive mutexes or system calls. They work by marking a code region as 'restartable': if the kernel preempts the thread or moves it to another CPU, the sequence is automatically rolled back and retried. This provides lock-free, low-overhead access to per-CPU data, which is critical for high-performance memory allocators, tracing frameworks, and networking stacks. The original design allowed only one active rseq per thread, but developers later added support for nesting to increase flexibility. The 6.19 change aimed to fix performance and correctness issues in the nesting implementation—without altering the official API.

What Changes Did Linux 6.19 Make to Restartable Sequences?

In the 6.19 kernel cycle, developers addressed a performance regression in the nested restartable sequences mechanism. Previously, the kernel tracked each nested rseq by maintaining a stack-like structure. However, certain workloads triggered suboptimal behavior that slowed down context switching. The fix introduced a more efficient tracking method that removed some internal state observability—specifically, it stopped exposing the number of active nested rseqs to user space via /proc/self/stat and related internal checks. The documented API, including the rseq_available() syscall and signal handling, remained identical. The changes were viewed as a harmless internal refactor, but they inadvertently broke a dependency in Google's TCMalloc library.

How Did TCMalloc Violate the Documented API?

TCMalloc, Google's high-performance memory allocator, began using restartable sequences in a nuanced way. The official documentation states that applications must not rely on the ability to nest multiple rseq critical sections—the feature was intended as a kernel optimization, not a guaranteed API. However, TCMalloc's implementation started nesting rseq calls while also reading a now-removed indicator of the nesting depth. This indicator was never part of the stable API. By depending on an internal kernel behavior, TCMalloc effectively violated the documented contract. When 6.19 removed that internal state, TCMalloc's logic broke, causing assert failures and allocation errors. The library's reliance on undocumented observability is a classic example of Hyrum's Law in action.

What Is the Kernel's No-Regressions Rule and How Does It Apply?

The Linux kernel community follows a strict no-regressions policy: any change that breaks previously working user-space applications must be avoided or reverted, even if the application was using the kernel in an unsupported way. This rule prioritizes stability over ideal API cleanliness. In the case of TCMalloc, even though TCMalloc violated the documented rseq API, the kernel developers cannot simply tell Google to fix their library—because the library works on older kernels. Under the no-regressions policy, the 6.19 change that broke TCMalloc is considered a regression. As a result, the kernel team must find a way to restore backward compatibility without reintroducing the performance problems that the original fix addressed. This often leads to compromises like adding a compatibility mode or retaining old behavior under a new sysctl flag.

What Solutions Are Being Proposed to Accommodate TCMalloc?

To satisfy the no-regressions rule, kernel developers are exploring several approaches. One proposal is to keep the old rseq nesting tracking as a legacy option, accessible via a sysctl knob or a new prctl() flag. This would allow TCMalloc to opt into the old behavior while other applications benefit from the performance improvements. Another idea is to reintroduce the removed internal state but only expose it when a process explicitly requests compatibility mode. A third option is to document the previously undocumented behavior and officially support it as part of the rseq API going forward—though this risks enshrining a fragile implementation detail. The discussion mirrors similar past resolutions (e.g., for user-space RCU or BPF verifier changes) where practical compromises were reached.

What Broader Lessons Does This Incident Teach About API Design?

The TCMalloc episode reinforces several lessons for API designers and system developers. First, any exposed behavior will be depended upon—even internal state that you consider private. Second, when changing a popular subsystem, audit real-world consumers (like Google's allocators) for hidden assumptions. Third, the Linux kernel's no-regressions philosophy, while sometimes frustrating, protects users from silent breakage. Finally, this case highlights the tension between innovation and stability: optimizing kernel internals is good, but backwards compatibility often forces messy compromises. For developers outside the kernel, the takeaway is to avoid tapping into undocumented kernel behaviors, even if they seem stable today. Hyrum's Law is not a curse—it's a reminder that every byte you output creates a potential contract.