Machine learning models are growing at an unprecedented speed and fitting such models requires a large distributed GPU system with costly collective communication primitives such as AllReduce or AllToAll. Recent trends in large language models suggest that 30-70% of both training and inferencing is spent on communication. Despite major leaps in hardware innovation for GPU communication, there is still a lot of software performance improvement left on the table. The reason for this gap is majorly due to the lack of a performant P2P communication abstraction on GPUs. HPC applications running on CPUs utilize performant implementations of MPI abstraction to control the communication at a fine-grained level, however, such a performant abstraction does not exist on GPUs.
In this talk, we will discuss how much performance room there is for collective communication on GPUs and the challenges for maximizing it. We will also look at existing abstractions in communication primitives and how they are limiting the space of parallelizing configurations for AI workloads. Lastly, we will look at overlapping communication-computation challenges in GPUs. We will conclude by proposing an abstraction for GPU communication and how effective they are in closing the performance gaps.
Tue 28 FebDisplayed time zone: Eastern Time (US & Canada) change
08:30 - 09:30
|GPU Communication Requires Rethinking Abstractions
Saeed Maleki Microsoft Research