End-to-End LU Factorization of Large Matrices on GPUs (PPoPP 2023 - Main Conference)

Who

Yang Xia, Peng Jiang, Rajiv Ramnath, Gagan Agrawal

Track

PPoPP 2023 Main Conference

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 28 Feb 2023 14:30 - 14:50 at Montreal 4 - Session 5: Decompositions Chair(s): Milind Chabbi

Abstract

LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which includes the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a load balanced (through dynamic workload allocation) implementation of Kahn’s algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with a recent multi-core GPU implementation, GLU 3.0, our out-of-core version achieves speedups of 1.13-32.65X. Further, our out-of-core implementation achieves a speedup of 1.2-2.2X over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.

Yang Xia

Peng Jiang

The University of Iowa

United States

Rajiv Ramnath

The Ohio State University

Gagan Agrawal

Augusta University