Cost-Efficient and Scalable GPU Scheduling Strategies in Multi-Tenant Cloud Environments for AI Workloads
Abstract
A recent dramatic growth in the number of artificial intelligence (AI) workloads has led to a growth in contention and inefficiencies in the utilization of GPU resources in cloud computing environments that have subsequently led to a growth in operational costs. Platforms with multi-tenant where heterogeneous workloads share resources are particularly sensitive to such resource bottlenecks. In this paper we present a versatile GPU scheduling system that offers a trade-off among cost-effectiveness, performance isolation, and fairness in heterogeneous AI workloads. The proposed system dynamically optimizes GPU allocation based on multi-objective scheduling algorithm with the help of machine learning-based workload prediction. In order to reduce the impact of GPU fragmentation, we combine automatic memory management, temporal multiplexing of training tasks, spatial segmentation of inference tasks. With large scale testing via real-world workloads, such as computer vision, NLP, and scientific computing, we show that it can maximize GPU utilization by 65 percent and decrease the average job run time by 40 percent over FIFO baselines. Predictive scaling, preemptible instances and smart provisioning are also used in our framework to save 45 percent on cloud infrastructure costs. Other contributions are fairness-aware scheduling policy, mixed-precision workload support, and new GPU defragmentation strategies. The results provide a viable way to scalable, cost-efficient, and tenant-aware GPU scheduling in cloud-native AI platforms, establishing the premise of the next-generation high-performance AI infrastructure available to a large range of users and businesses.