TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition
Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21$\times$ speedup over cuDNN, 1.12$\times$ speedup over TVM, and 3.27$\times$ over the original models using cuDNN with at most 0.05% accuracy loss.
Tue 28 FebDisplayed time zone: Eastern Time (US & Canada) change
13:50 - 15:10 | Session 5: DecompositionsMain Conference at Montreal 4 Chair(s): Milind Chabbi Uber Technologies Inc. | ||
13:50 20mTalk | TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition Main Conference Lizhi Xiang University of utah, Miao Yin Rutgers University, Chengming Zhang Indiana University, Aravind Sukumaran-Rajam Meta, Saday Sadayappan University of Utah, USA, Bo Yuan Rutgers University, Dingwen Tao Indiana University | ||
14:10 20mTalk | Improving Energy Saving of One-sided Matrix Decompositions on CPU-GPU Heterogeneous Systems Main Conference Jieyang Chen University of Alabama at Birmingham, Xin Liang University of Kentucky, Kai Zhao University of Alabama at Birmingham, Hadi Zamani Sabzi University of California Riverside, Laxmi Bhuyan University of California, Riverside, zizhong chen University of California, Riverside | ||
14:30 20mTalk | End-to-End LU Factorization of Large Matrices on GPUs Main Conference Yang Xia , Peng Jiang The University of Iowa, Rajiv Ramnath The Ohio State University, Gagan Agrawal Augusta University | ||
14:50 20mTalk | Fast Eigenvalue Decomposition via WY Representation on Tensor Core Main Conference Shaoshuai Zhang University of Houston, Ruchi Shah University of Houston, Hiroyuki Ootomo Tokyo Institute of Technology, Rio Yokota Tokyo Institute of Technology, Panruo Wu University of Houston |