LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which includes the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a load balanced (through dynamic workload allocation) implementation of Kahn’s algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with a recent multi-core GPU implementation, GLU 3.0, our out-of-core version achieves speedups of 1.13-32.65X. Further, our out-of-core implementation achieves a speedup of 1.2-2.2X over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.
Tue 28 FebDisplayed time zone: Eastern Time (US & Canada) change
13:50 - 15:10 | Session 5: DecompositionsMain Conference at Montreal 4 Chair(s): Milind Chabbi Uber Technologies Inc. | ||
13:50 20mTalk | TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition Main Conference Lizhi Xiang University of utah, Miao Yin Rutgers University, Chengming Zhang Indiana University, Aravind Sukumaran-Rajam Meta, Saday Sadayappan University of Utah, USA, Bo Yuan Rutgers University, Dingwen Tao Indiana University | ||
14:10 20mTalk | Improving Energy Saving of One-sided Matrix Decompositions on CPU-GPU Heterogeneous Systems Main Conference Jieyang Chen University of Alabama at Birmingham, Xin Liang University of Kentucky, Kai Zhao University of Alabama at Birmingham, Hadi Zamani Sabzi University of California Riverside, Laxmi Bhuyan University of California, Riverside, zizhong chen University of California, Riverside | ||
14:30 20mTalk | End-to-End LU Factorization of Large Matrices on GPUs Main Conference Yang Xia , Peng Jiang The University of Iowa, Rajiv Ramnath The Ohio State University, Gagan Agrawal Augusta University | ||
14:50 20mTalk | Fast Eigenvalue Decomposition via WY Representation on Tensor Core Main Conference Shaoshuai Zhang University of Houston, Ruchi Shah University of Houston, Hiroyuki Ootomo Tokyo Institute of Technology, Rio Yokota Tokyo Institute of Technology, Panruo Wu University of Houston |