MLPerf Storage v2.0: AI Checkpointing & Scalability Improvements

Insideainews

San Francisco, CA — MLCommons has released the results of its MLPerf Storage v2.0 benchmark suite, an industry standard designed to evaluate the performance of storage systems for machine learning workloads in a way that is fair, representative, and repeatable across different architectures. The v2.0 results indicate a substantial improvement in storage system capabilities, with tested systems now supporting approximately twice the number of AI accelerators compared to the v1.0 benchmark round.

A key addition to the v2.0 benchmark is new tests specifically designed to replicate checkpointing for AI training systems. This addresses a growing challenge in large-scale AI: as training models expand to billions or even trillions of parameters and clusters grow to hundreds of thousands of accelerators, system failures become more frequent. For instance, a cluster with 100,000 accelerators running at full utilization might experience a failure every half-hour, while a million-accelerator cluster could see one every three minutes. Such failures, especially in massively parallel computations where all accelerators operate in lockstep, can bring an entire training process to a halt.

To mitigate these disruptions and maintain high performance, saving intermediate training results, known as checkpointing, is widely accepted as essential. The AI community has developed mathematical models to optimize cluster performance by balancing the overhead of regular checkpoints against the cost and frequency of recovering from failures. However, these models require precise data on the performance and scale of the underlying storage systems used for checkpointing. The MLPerf Storage v2.0 checkpoint tests provide exactly this data, highlighting the critical need for stakeholders to carefully select storage systems that can efficiently store and retrieve checkpoints without impeding system speed.

Curtis Anderson, MLPerf Storage working group co-chair, emphasized the inevitability of component failures in large-scale AI training. “Checkpointing is now a standard practice in these systems to mitigate failures, and we are proud to be providing critical benchmark data on storage systems to allow stakeholders to optimize their training performance,” he stated. Anderson also noted that the initial checkpoint benchmark results reveal a wide range of performance specifications among current storage systems, suggesting that not all systems are optimally suited for every checkpointing scenario. He further pointed out the vital role of software frameworks like PyTorch and TensorFlow in coordinating training and recovery, and the potential for enhancing these frameworks.

Beyond checkpointing, the v2.0 benchmark suite continues to measure storage performance across diverse ML training scenarios, simulating storage demands for various accelerator configurations, models, and workloads. By simulating the “think time” of accelerators, the benchmark accurately generates storage patterns without requiring actual training runs, making it widely accessible. The benchmark primarily assesses a storage system’s ability to maintain performance, ensuring simulated accelerators maintain a required utilization level.

The v2.0 submissions showcased significant innovation and a diverse array of technical approaches to delivering high-performance storage for AI training. These included six local storage solutions, two solutions utilizing in-storage accelerators, thirteen software-defined solutions, twelve block systems, sixteen on-prem shared storage solutions, and two object stores. Oana Balmau, MLPerf Storage working group co-chair, remarked, “Everything is scaling up: models, parameters, training datasets, clusters, and accelerators. It’s no surprise to see that storage system providers are innovating to support ever larger scale systems.” She added, “Faced with the need to deliver storage solutions that are both high-performance and at unprecedented scale, the technical community has stepped up once again and is innovating at a furious pace.”

The MLPerf Storage benchmark is the result of a three-year collaborative engineering effort involving 35 leading storage solution providers and academic research groups. Its open-source and peer-reviewed nature fosters a fair competitive environment that drives innovation, performance, and energy efficiency across the industry, while also providing crucial technical information for customers deploying and fine-tuning AI training systems.

The broad participation in v2.0 underscores the industry’s recognition of high-performance storage’s importance. MLPerf Storage v2.0 includes over 200 performance results from 26 submitting organizations across seven different countries. These organizations include Alluxio, Argonne National Lab, DDN, ExponTech, FarmGPU, H3C, Hammerspace, HPE, JNIST/Huawei, Juicedata, Kingston, KIOXIA, Lightbits Labs, MangoBoost, Micron, Nutanix, Oracle, Quanta Computer, Samsung, Sandisk, Simplyblock, TTA, UBIX, IBM, WDC, and YanRong.

David Kanter, Head of MLPerf at MLCommons, noted that this round set new records for MLPerf benchmarks in terms of participating organizations and total submissions. “The AI community clearly sees the importance of our work in publishing accurate, reliable, unbiased performance data on storage systems, and it has stepped up globally to be a part of it,” Kanter stated. He welcomed the numerous first-time submitters, including Alluxio, ExponTech, FarmGPU, H3C, Kingston, KIOXIA, Oracle, Quanta Computer, Samsung, Sandisk, TTA, UBIX, IBM, and WDC. Kanter concluded that this level of participation is a “game-changer for benchmarking,” enabling the publication of more accurate and representative data on real-world systems, and empowering stakeholders with the information needed to optimize their operations.