CNCF Seeks K8s Standards for Portable AI/ML Workloads

Thenewstack

Imagine a world where your sophisticated artificial intelligence models and inferencing workloads could seamlessly migrate between any cloud environment, public or private, without a single line of code needing adjustment. This ambitious vision is precisely what the Cloud Native Computing Foundation (CNCF) is working to realize, building on its successful legacy of standardizing Kubernetes deployments.

The CNCF, the open-source body responsible for nurturing cloud-native technologies, is embarking on a new initiative to certify Kubernetes distributions specifically for their ability to run AI workloads. This effort mirrors the highly successful Kubernetes conformance program, which has already ensured interoperability across more than 100 different Kubernetes distributions. Just as a workload running on a Kubernetes-conformant environment can be effortlessly moved to another, the goal is to achieve the same fluidity for AI applications.

“We want to do the same thing for AI workloads,” explained Chris Aniszczyk, CTO of the CNCF, during KubeCon + CloudNativeCon events in China and Japan. He emphasized that achieving this will necessitate a defined set of capabilities, APIs, and configurations that a Kubernetes cluster must offer, going beyond the existing standard conformance. The ultimate aim is to establish a “baseline compatibility” that spans diverse computing environments globally. Aniszczyk reflected on the CNCF’s foundational principle: to create infrastructure that operates uniformly across every cloud, whether public or private.

The intricate task of defining these AI-specific requirements is being undertaken by a newly formed working group within Kubernetes’ SIG-Architecture, or Special Interest Group for Architecture. This group’s explicit mission is to “define a standardized set of capabilities, APIs, and configurations that a Kubernetes cluster must offer to reliably and efficiently run AI/ML [machine learning] workloads,” as detailed on its GitHub page. Beyond this immediate scope, the work will also lay the groundwork for a broader “Cloud Native AI Conformance” definition, encompassing other critical aspects of cloud-native computing, such as telemetry, storage, and security. Major industry players, including Google and Red Hat, are actively contributing resources to this pivotal project.

At its core, the initiative seeks to “commoditize” AI/ML workload platforms, making them as interchangeable and accessible as possible. Early discussions among working group contributors highlight the hope of significantly reducing the need for “do-it-yourself” custom solutions and framework-specific patches often required to deploy AI/ML workloads today. This standardization promises to streamline development and deployment, freeing up engineers to focus on innovation rather than infrastructure nuances.

The working group has already identified three primary types of AI workloads particularly well-suited for Kubernetes, each with distinct platform requirements. For large-scale training and fine-tuning of AI models, essential capabilities include access to high-performance accelerators (like GPUs), high-throughput and network-topology-aware networking, “gang scheduling” to coordinate multiple related tasks, and scalable access to vast datasets. High-performance inference, where trained models are used to make predictions, demands access to accelerators, sophisticated traffic management, and standardized metrics for monitoring latency and throughput. Finally, for MLOps (Machine Learning Operations) pipelines, the focus is on a robust batch job system, a queuing system to manage resource contention, secure access to external services such as object storage and model registries, and reliable support for Custom Resource Definitions (CRDs) and operators, which extend Kubernetes’ capabilities.

The draft document outlining these requirements already distinguishes between recommended practices and absolute necessities. Many of these mandatory features build upon recent enhancements to Kubernetes designed specifically for AI applications. For instance, a Kubernetes AI-compliant system must support Dynamic Resource Allocation (DRA), a feature fully available in the upcoming Kubernetes 1.34 release. DRA offers more flexible and granular control over resources, enabling precise allocation of specialized hardware like GPUs. Similarly, support for the Kubernetes Gateway API Inference extension is mandatory, as it specifies traffic routing patterns essential for large language models (LLMs). Furthermore, the cluster autoscaler, which dynamically adjusts cluster size, must be capable of scaling node groups based on requests for specific accelerator types.

A separate, yet-to-be-named group will oversee the accreditation process. The certification program will feature a public website listing all Kubernetes distributions that successfully pass the conformance tests, which will be conducted annually. Each certified distribution will have a comprehensive, YAML-based conformance checklist publicly available. The CNCF plans to officially unveil the finalized conformance guide at KubeCon + CloudNativeCon North America 2025, scheduled for November 10-13 in Atlanta.