IBM, Red Hat, and Google Donate Kubernetes LLM Inference Framework llm-d to the CNCF

Overview

IBM Research, Red Hat, and Google Cloud announced the donation of llm-d, an open-source distributed inference framework for large language models, to the Cloud Native Computing Foundation as a sandbox project. The announcement was made on March 24, 2026, at KubeCon + CloudNativeCon Europe in Amsterdam, according to the CNCF blog.

The project, originally launched in May 2025, is designed to make Kubernetes a production-ready platform for serving LLMs at scale. Its founding collaborators include NVIDIA and CoreWeave, with additional support from AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, as well as academic partners at the University of California Berkeley and the University of Chicago, according to the CNCF blog.

What llm-d Does

llm-d provides a Kubernetes-native middleware layer that sits between high-level serving control planes such as KServe and low-level inference engines such as vLLM, according to IBM Research. The framework addresses several challenges that standard Kubernetes orchestration does not handle well for LLM workloads: KV-cache locality management, prefill and decode phase balancing, multi-node coordination, and heterogeneous hardware utilization.

A central technical innovation is prefill/decode disaggregation, which separates the prompt processing (prefill) and token generation (decode) phases of inference into independently scalable pods. This allows operators to allocate different hardware profiles to each phase rather than treating inference as a monolithic unit, as described by IBM Research.

The framework also implements hierarchical KV-cache offloading across GPU, TPU, CPU, and storage tiers, and uses prefix-cache-aware routing to direct requests to the optimal replica based on cache state and traffic patterns, according to the CNCF blog.

Kubernetes Integration

llm-d integrates with emerging Kubernetes standards including the Gateway API Inference Extension and LeaderWorkerSet primitives for multi-node replicas and expert parallelism, according to the CNCF blog. Its autoscaling is traffic- and hardware-aware, adapting to real-time workload characteristics rather than relying on generic CPU or GPU utilization metrics.

The framework supports multiple accelerator types including NVIDIA GPUs, AMD GPUs, Intel accelerators, and Google TPUs, making it vendor-neutral by design, as described by IBM Research.

Performance

In v0.5 testing on a Qwen3-32B model deployed across eight vLLM pods on 16 NVIDIA H100 GPUs, llm-d demonstrated near-zero latency overhead while achieving approximately 120,000 tokens per second of throughput, significantly outperforming baseline Kubernetes services under load, according to the CNCF blog.

Industry Perspective

Brian Stevens, SVP and AI CTO at Red Hat, framed the project as a bridge between data science infrastructure and enterprise IT operations. “Eventually it’s going to be a CIO’s problem. What language do CIOs speak these days? They speak Kubernetes,” Stevens said, as reported by SiliconANGLE.

Robert Shaw, Director of Engineering at Red Hat, emphasized the operational challenges beyond raw performance. “These models are doing an amount of compute that’s hard to fathom, but users aren’t only building state-of-the-art performance systems; they’re also trying to do day-two operations,” Shaw said, according to SiliconANGLE.

Roadmap

The project’s planned development areas include multi-modal workload support, expanded inference engine integrations, multi-LoRA scheduling optimization, and advanced multi-tier cache offloading, according to IBM Research. Red Hat has identified additional priorities including multi-tenant model serving, request prioritization, and enhanced security for agentic systems running on Kubernetes, as reported by SiliconANGLE.