NVIDIA Blackwell Boosts MLPerf v5.0 Training Scores by 2.6x
NVIDIA's Blackwell-powered platform recently demonstrated exceptional performance in MLPerf Training v5.0, achieving the fastest times across all seven benchmarks. The benchmarks included LLM pretraining, fine-tuning, text-to-image generation, graph neural networks, recommender systems, natural language processing, and object detection. These results highlight NVIDIA's advanced hardware and software capabilities, particularly in handling large-scale AI models. LLM Fine-Tuning (Llama 2 70B LoRA) NVIDIA's submission for Llama 2 70B LoRA fine-tuning involved several key steps: 1. Prerequisites: Ensure the system has a properly set-up cluster with 72 Blackwell GPUs connected via NVLink and managed by the NVIDIA Base Command Manager (BCM). 2. Cluster Setup: Avoid Docker on NVIDIA's submission clusters as BCM controls the SLURM job launcher. 3. Building the Container: Create a Docker container to download and preprocess the dataset and checkpoint. This can be done on any machine with Docker, and the processed data should be accessible by the compute nodes. 4. Launching the Benchmark: - Source the configuration file (config_*.sh), which includes hyperparameters and system-specific optimizations. - Run the sbatch command to start the training, ensuring the dataset, checkpoint, and configuration files are correctly referenced. LLM Pretraining (Llama 3.1 405B) For Llama 3.1 405B pretraining, the process is similar but involves a larger GPU configuration: 1. Prerequisites: Similar to the fine-tuning benchmark, the system needs a cluster setup with 72 Blackwell GPUs per rack and 512 GPUs total. 2. Cluster Setup: Use BCM to manage the cluster and ensure SLURM commands are accessible. 3. Building the Container: Create a Docker container to download the dataset and tokenizer. The dataset should be preprocessed and stored in the PREPROCESSED_PATH directory, while the tokenizer is located in the TOKENIZER_PATH directory. 4. Checkpoint: Restore training from the Meta official Hugging Face checkpoint, which requires significant storage (>1.5 TB). 5. Launching the Benchmark: - Use a configuration file (config_*.sh) that specifies the hyperparameters, including the global batch size and parallelization schema. - Run the sbatch command to start the training, ensuring the dataset, checkpoint, and configuration files are correctly referenced. 6. Log Parsing: - Look for MLPerf-relevant lines prefixed with :::MLLOG. - Track the initialization, training start, evaluation accuracy, and training stop markers to calculate the final score. The score is the time difference between run_stop and run_start markers. Key Optimizations Architecture Innovations: Blackwell introduces second-generation Transformer Engine, faster NVLink interconnects, and higher-bandwidth HBM3e memory, enabling a 2.2x speedup over Hopper for Llama 3.1 405B pretraining. Software Stack: The cuBLAS library was optimized for Blackwell, and CUDA Graphs were used to reduce memory footprint and minimize host CPU overhead. Parallel Mappings: The TP-CP-DP-PP (DP-Last) parallelism mapping was optimized for GB200 NVL72, maximizing training throughput. Memory Management: Enhanced Flash Attention kernels and the use of FP8 format for SwiGLU input reduced memory usage, allowing Llama 2 70B models to fit into a single GPU. Additional Benchmarks Text-to-Image (Stable Diffusion v2): GB200 NVL72 achieved a 2.6x performance improvement over Hopper, driven by an optimized Apex GroupNorm kernel and pipelined data-parallel communications. Graph Neural Network (R-GAT): A 2.25x performance gain over Hopper was realized through extended CUDA Graphs and fused copy operations. Industry Impact and Evaluation These results from MLPerf Training v5.0 underscore NVIDIA's leadership in AI training hardware and software. The Blackwell platform's ability to deliver up to 2.6x higher performance per GPU compared to Hopper translates to significant reductions in training times and costs. Industry insiders commend NVIDIA for pushing the boundaries of what is possible in large-scale AI training, noting that these advancements can accelerate the development and deployment of more sophisticated and capable AI models. The robust performance at scale, especially in LLM pretraining, is a testament to the platform's scalability and efficiency. As AI continues to evolve, such performance gains could be crucial for organizations looking to leverage LLMs for various applications, from enterprise customization to real-time text-to-image generation. NVIDIA's continuous optimization efforts, as seen in the latest improvements for both Hopper and Blackwell, demonstrate their commitment to ensuring the latest hardware advances are fully utilized. Companies and researchers interested in reproducing these results can access detailed guides and scripts in the NVIDIA submission repositories, making it easier to achieve high-performance training on their own systems.