High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs (PPoPP 2023 - Main Conference)

Who

William S. Moses, Ivan Radanov Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko

Track

PPoPP 2023 Main Conference

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 27 Feb 2023 14:50 - 15:10 at Montreal 4 - Session 2: Programming Chair(s): Michael Scott

Abstract

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.

We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7$\times$.

William S. Moses

Massachusetts Institute of Technology

Ivan Radanov Ivanov

Tokyo Institute of Technology

Jens Domke

RIKEN Center for Computational Science

Toshio Endo

Tokyo Institute of Technology

Johannes Doerfert

Lawrence Livermore National Laboratory

Oleksandr Zinenko

Google

France