DSP: Efficient GNN Training with Multiple GPUs (PPoPP 2023 - Main Conference)

Who

Zhenkun Cai, Qihui Zhou, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, George Karypis

Track

PPoPP 2023 Main Conference

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 1 Mar 2023 11:00 - 11:20 at Montreal 4 - Session 7: Machine Learning Chair(s): Milind Kulkarni

Abstract

Jointly utilizing multiple GPUs to train graph neural networks (GNNs) is crucial for handling large graphs and achieving high efficiency. However, we find that existing systems suffer from \textit{high communication costs} and \textit{low GPU utilization} due to improper data layout and training procedures. Thus, we propose a system dubbed Distributed Sampling and Pipelining (DSP) for multi-GPU GNN training. DSP adopts a tailored data layout to utilize the fast NVLink connections among the GPUs, which stores the graph topology and popular node features in GPU memory. For efficient graph sampling with multiple GPUs, we introduce a \textit{collective sampling primitive} (CSP), which pushes the sampling tasks to data to reduce communication. We also design a \textit{producer-consumer-based pipeline}, which allows tasks from different mini-batches to run congruently to improve GPU utilization. We compare DSP with state-of-the-art GNN training frameworks, and the results show that DSP consistently outperforms the baselines under different datasets, GNN models and GPU counts. The speedup of DSP can be an order of magnitude and is over 2x in most cases.

Zhenkun Cai

The Chinese University of Hong Kong

Qihui Zhou

The Chinese University of Hong Kong

Xiao Yan

Southern University of Science and Technology

Da Zheng

Amazon Web Services

Xiang Song

Amazon Web Services

Chenguang Zheng

The Chinese University of Hong Kong

James Cheng

The Chinese University of Hong Kong

George Karypis

Amazon Web Services

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 1 Mar
Displayed time zone: Eastern Time (US & Canada) change

10:00 - 11:40	Session 7: Machine LearningMain Conference at Montreal 4 Chair(s): Milind Kulkarni Purdue University

10:00 20m Talk		TGOpt: Redundancy-Aware Optimizations for Temporal Graph Attention Networks Main Conference Yufeng Wang University of Illinois at Urbana-Champaign, Charith Mendis University of Illinois at Urbana-Champaign
10:20 20m Talk		Dynamic N:M Fine-grained Structured Sparse Attention Mechanism Main Conference Zhaodong Chen University of California, Santa Barbara, Zheng Qu University of California, Santa Barbara, Yuying Quan University of California, Santa Barbara, Liu Liu , Yufei Ding UC Santa Barbara, Yuan Xie UCSB
10:40 20m Talk		Elastic Averaging for Efficient Pipelined DNN Training Main Conference Zihao Chen East China Normal University, Chen Xu East China Normal University, Weining Qian East China Normal University, Aoying Zhou East China Normal University
11:00 20m Talk		DSP: Efficient GNN Training with Multiple GPUs Main Conference Zhenkun Cai The Chinese University of Hong Kong, Qihui Zhou The Chinese University of Hong Kong, Xiao Yan Southern University of Science and Technology, Da Zheng Amazon Web Services, Xiang Song Amazon Web Services, Chenguang Zheng The Chinese University of Hong Kong, James Cheng The Chinese University of Hong Kong, George Karypis Amazon Web Services
11:20 20m Talk		PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs Main Conference Chunyang Wang Beihang University, Desen Sun Beihang University, Yuebin Bai Beihang University