Dynamic N:M Fine-grained Structured Sparse Attention Mechanism (PPoPP 2023 - Main Conference)

Who

Zhaodong Chen, Zheng Qu, Yuying Quan, Liu Liu, Yufei Ding, Yuan Xie

Track

PPoPP 2023 Main Conference

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 1 Mar 2023 10:20 - 10:40 at Montreal 4 - Session 7: Machine Learning Chair(s): Milind Kulkarni

Abstract

Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. One opportunity to accelerate the attention mechanism is leveraging the sparsity in the attention weight matrix. However, due to the dilemma between “dynamic” and “fine-grained”, previous studies fail to achieve speedup on GPUs under moderate sequence lengths. They also require costly retraining to recover accuracy. In this paper, we present DFSS, the first GPU-friendly dynamic fine-grained pruning mechanism, to address this dilemma. DFSS dynamically prunes the full attention score matrix to N:M fine-grained structured sparse pattern. Our key insight is that on the dynamic side, N:M sparsity is friendly to pruning and encoding the sparse matrix on GPU. On the fine-grained side, it always preserves the dominant entries in each row. We develop a dynamic sampled dense-dense matrix multiplication kernel, first of its kind, that multiplies the query and key matrices, prunes the result, and encodes the compressed sparse matrix without overhead. Compared with previous studies, DFSS achieves speedup in arbitrary sequence lengths. It only takes a few fine-tuning epochs to reach on-par accuracy with full attention mechanism. We provide both theoretical and empirical evidence to demonstrate DFSS is a good approximation of the full attention mechanism. We evaluate the 1:2 and 2:4 sparsity under different settings and achieve $1.27\sim 1.89\times$ speedups over the full-attention on A100 GPU. On tasks from various domains with sequence length from 384 to 4096, its accuracy is on par with the full attention after only a couple of finetuning epochs from the dense pre-trained model.

Zhaodong Chen

University of California, Santa Barbara

Zheng Qu

University of California, Santa Barbara

Yuying Quan

University of California, Santa Barbara

Liu Liu

Yufei Ding

UC Santa Barbara

Yuan Xie

UCSB

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 1 Mar
Displayed time zone: Eastern Time (US & Canada) change

10:00 - 11:40	Session 7: Machine LearningMain Conference at Montreal 4 Chair(s): Milind Kulkarni Purdue University

10:00 20m Talk		TGOpt: Redundancy-Aware Optimizations for Temporal Graph Attention Networks Main Conference Yufeng Wang University of Illinois at Urbana-Champaign, Charith Mendis University of Illinois at Urbana-Champaign
10:20 20m Talk		Dynamic N:M Fine-grained Structured Sparse Attention Mechanism Main Conference Zhaodong Chen University of California, Santa Barbara, Zheng Qu University of California, Santa Barbara, Yuying Quan University of California, Santa Barbara, Liu Liu , Yufei Ding UC Santa Barbara, Yuan Xie UCSB
10:40 20m Talk		Elastic Averaging for Efficient Pipelined DNN Training Main Conference Zihao Chen East China Normal University, Chen Xu East China Normal University, Weining Qian East China Normal University, Aoying Zhou East China Normal University
11:00 20m Talk		DSP: Efficient GNN Training with Multiple GPUs Main Conference Zhenkun Cai The Chinese University of Hong Kong, Qihui Zhou The Chinese University of Hong Kong, Xiao Yan Southern University of Science and Technology, Da Zheng Amazon Web Services, Xiang Song Amazon Web Services, Chenguang Zheng The Chinese University of Hong Kong, James Cheng The Chinese University of Hong Kong, George Karypis Amazon Web Services
11:20 20m Talk		PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs Main Conference Chunyang Wang Beihang University, Desen Sun Beihang University, Yuebin Bai Beihang University