From Hand-Tuning to Autonomous Search: Meta’s KernelEvolve Agent Transforms AI Infrastructure Optimization

By • min read
<h2>Introduction</h2><p>Meta operates a massive fleet of AI models that power billions of daily user interactions, from personalized recommendations to generative AI assistants. Behind these experiences lies a complex infrastructure of heterogeneous hardware, including NVIDIA GPUs, AMD GPUs, Meta's custom MTIA silicon chips, and CPUs. To maximize performance, every model operation must be translated into highly efficient, chip-specific <strong>kernels</strong>—the low-level code that runs on each accelerator. Traditionally, writing and optimizing these kernels required months of manual effort by expert engineers. But with the introduction of <strong>KernelEvolve</strong>, an autonomous agent built into Meta’s Ranking Engineer Agent, that process is now being transformed into a fast, scalable, and automated search.</p><figure style="margin:20px 0"><img src="https://engineering.fb.com/wp-content/uploads/2026/04/Meta-KernalEvolve-REA-Hero.png" alt="From Hand-Tuning to Autonomous Search: Meta’s KernelEvolve Agent Transforms AI Infrastructure Optimization" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: engineering.fb.com</figcaption></figure><h2 id="heterogeneous-challenge">The Challenge of Heterogeneous Hardware</h2><h3>Why Manual Kernel Optimization Doesn’t Scale</h3><p>As new chip generations and ML model architectures emerge, each combination demands custom kernel code. Vendor libraries like cuBLAS cover standard operators (GEMMs, convolutions), but production workloads often require dozens of custom operators—especially in ranking models. With the number of models multiplying across hardware types and generations, hand-tuning by kernel experts simply cannot keep pace. A single kernel optimization can take weeks of profiling, debugging, and iterative refinement, leaving engineers with little time for higher-level innovation.</p><h2 id="kernel-evolve-intro">Introducing KernelEvolve: An Agentic Kernel Authoring System</h2><h3 id="search-problem">Treating Optimization as a Search Problem</h3><p>KernelEvolve reframes kernel optimization as a <strong>search problem</strong>. A purpose-built job harness evaluates each candidate kernel, feeds detailed diagnostics back to a large language model (LLM), and drives a continuous search over hundreds to thousands of alternatives. This agentic approach replaces the linear, expert-driven process with an autonomous loop that can explore far more design variations in a fraction of the time. The result: kernels that often exceed the performance of manually crafted versions.</p><h3>Integration with the Ranking Engineer Agent</h3><p>KernelEvolve is a core component of Meta’s <em>Ranking Engineer Agent</em>, which autonomously designs, executes, and analyzes ranking model experiments. While the agent’s ML exploration capability handles high-level model changes, KernelEvolve optimizes the underlying infrastructure—ensuring that those models run efficiently at scale. This dual-layer autonomy accelerates the entire innovation pipeline, from experiment ideation to production deployment.</p><h2 id="proven-gains">Proven Performance Gains</h2><p>Real-world deployments of KernelEvolve demonstrate dramatic improvements:</p><figure style="margin:20px 0"><img src="https://engineering.fb.com/wp-content/uploads/2026/04/Meta-KernalEvolve-REA-image1.png" alt="From Hand-Tuning to Autonomous Search: Meta’s KernelEvolve Agent Transforms AI Infrastructure Optimization" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: engineering.fb.com</figcaption></figure><ul><li><strong>Faster development:</strong> Weeks of expert engineering time are compressed into hours of automated search, freeing engineers for creative problem-solving.</li><li><strong>Better performance:</strong> The Andromeda Ads model saw a <strong>60% inference throughput improvement</strong> on NVIDIA GPUs, and an ads model achieved a <strong>25% training throughput gain</strong> on Meta’s custom MTIA silicon chips.</li><li><strong>Broad applicability:</strong> KernelEvolve optimizes across public and proprietary hardware, generating kernels in high-level DSLs like <em>Triton, Cute DSL, and FlyDSL</em>, as well as low-level languages including <em>CUDA, HIP, and MTIA C++</em>.</li></ul><p>These gains are a direct result of the agent’s ability to explore a larger design space than any human could manually cover within the same time window.</p><h2 id="broad-applicability">Broad Applicability Across Hardware and Languages</h2><p>KernelEvolve is not limited to ranking models or any single hardware platform. It has been tested and validated on NVIDIA GPUs, AMD GPUs, Meta’s MTIA chips, and standard CPUs. The agent automatically adapts its search to the target architecture and can output code in both high-level domain-specific languages and low-level native languages. This flexibility makes it a general-purpose tool for any AI team grappling with heterogeneous infrastructure, not just Meta’s Ads group.</p><h2>Future Directions and Research</h2><p>The kernel optimization work described here is detailed in the paper <em>“KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta”</em>, to be presented at the 53rd International Symposium on Computer Architecture (ISCA) 2026. Ongoing research focuses on extending the agent to handle multi-device orchestration, dynamic kernel selection at runtime, and integration with Meta’s wider AI infrastructure management. As AI models grow in complexity and hardware diversity increases, autonomous agents like KernelEvolve will become indispensable for maintaining peak performance.</p>