This program shows the CGO keynote talk on Monday and the HPCA on Wednesday — a swap from the preliminary program post earlier due to travel issues.
Speaker: Peng Wu
Title: PyTorch 2.0 — the Journey to Bringing Compiler Technologies to the Core of PyTorch
Speaker: Saeed Maleki
Title: GPU Communication Requires Rethinking Abstractions
Abstract: Machine learning models are growing at an unprecedented speed and fitting such models requires a large distributed GPU system with costly collective communication primitives such as AllReduce or AllToAll. Recent trends in large language models suggest that 30-70% of both training and inferencing is spent on communication. Despite major leaps in hardware innovation for GPU communication, there is still a lot of software performance improvement left on the table. The reason for this gap is majorly due to the lack of a performant P2P communication abstraction on GPUs. HPC applications running on CPUs utilize performant implementations of MPI abstraction to control the communication at a fine-grained level, however, such a performant abstraction does not exist on GPUs.
In this talk, we will discuss how much performance room there is for collective communication on GPUs and the challenges for maximizing it. We will also look at existing abstractions in communication primitives and how they are limiting the space of parallelizing configurations for AI workloads. Lastly, we will look at overlapping communication-computation challenges in GPUs. We will conclude by proposing an abstraction for GPU communication and how effective they are in closing the performance gaps.
Bio: Saeed Maleki is a Principal Research SDE at Microsoft Research working in RiSE team. Prior to Microsoft, he was a PhD student at University of Illinois at Urbana-Champaign working with Professor David Padua.
Saeed’s main research is in optimizing AI infrastructure with a focus on accelerators’ communication software stack. He has several publications in this area at different venues including ASPLOS’23, ASPLOS’22, NSDI’23 and a best paper award at PPoPP’21. His research agenda is to create a powerful abstraction for accelerators’ communication which enables performance optimizations for AI applications including custom collective communication algorithms, communication-computation overlapping and resiliency support.
Speaker: Daniel A. Jiménez
Title: Addressing Challenges of Core Microarchitecture Research
Abstract: Core microarchitecture research has been studied for decades, but remains crucial due to the evolving demands of modern computing workloads. Growing instruction footprints, the influx of massive data into the processor, the overhead of modern programming languages, and the emphasis on productivity over performance all require innovative approaches. As Moore’s Law reaches its end, the onus of improving performance and efficiency falls on microarchitecture research. Additionally, with more and more companies opting to design their own processors, academia is tasked not only with developing new processing technologies but also training the workforce to design these new chips.
In this talk, I will motivate the need for continued core microarchitecture research, give some recent examples of topics we study such as instruction fetch, address translation, and cache management, and give some insight into the challenges we face in this kind of work. For example, branch prediction has been a well-studied topic for decades, but recent trends in software design have caused huge growth in instruction footprints, putting pressure on other areas of instruction fetch as well as overwhelming the capacity of modern branch predictors and ultimately leading to performance degradation.
Bio: Daniel A. Jiménez is a Professor in the Department of Computer Science and Engineering at Texas A&M University. Jiménez received his Ph.D. in Computer Sciences from the University of Texas at Austin. Jiménez’s research is in microarchitecture, including microarchitectural prediction and cache management. He pioneered the development of neural-inspired branch predictors that have been implemented in millions of processors sold by IBM, AMD, Oracle, and Samsung.
He designed the neural branch predictors used in the popular Samsung Galaxy S7/8/9/10/20. His 2001 paper on perceptron-based branch prediction won the “HPCA Test of Time Award” in 2019. Jiménez won the 2021 IEEE CS B. Ramakrishna Rau Award for contributions to neural branch prediction. Jiménez is an IEEE Fellow, an ACM Distinguished Scientist, an NSF CAREER award winner, and member of the ISCA, MICRO, and HPCA halls of fame. He is the Chair of the IEEE Computer Society Technical Committee on Computer Architecture (TCCA) and co-Chair of the ISCA Steering Committee. He was General Chair of IEEE HPCA in 2011, Program Chair for IEEE HPCA in 2017, and Selection Committee Chair for IEEE Micro “Top Picks” 2020.
Mon 27 Feb
08:30 - 09:30
|PyTorch 2.0 — the Journey to Bringing Compiler Technologies to the Core of PyTorch|
Peng Wu Facebook
Tue 28 Feb
08:30 - 09:30
|GPU Communication Requires Rethinking Abstractions|
Saeed Maleki Microsoft Research
Wed 1 Mar
08:30 - 09:30
|Addressing Challenges of Core Microarchitecture Research|