Machine learning for Compiler: A Domain Specific Language for FFT

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2023/22-536

Type:

NAISS Small Compute

Principal Investigator:

Yifei He

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2023-06-01

End Date:

2024-06-01

Primary Classification:

10201: Computer Sciences

Webpage:

Allocation

Mimer at C3SE: 250 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

We propose a domain-specific language and compute library for FFT (fast Fourier transform). The compute library can generate high-performance code for different hardware targets, such as CPU and GPU. It will also train a deep learning/machine learning model in the runtime to select the most optimal plans. Thus we need the GPU in Alvis to perform deep learning/machine learning training for the predicate model, also the A100 and ice lake server CPU as the code generation target. FFT is an efficient algorithm to compute discrete Fourier transform. We want to introduce machine learning in the runtime of the compute library, to choose the most reasonable plan out of the auto-generated plans. DFT can be represented as a matrix-vector product, and FFT can be described as factorizations of the transform matrix DFT. The proposed DSL is a tensor-based language that allows you to define functions (permutation and twiddle factor matrix generator), perform some math computation (dot product, Kronecker product), and print results. We design a DSL using similar notations as used in mathematics. The computation is literally what the language says. We utilize the infrastructure of MLIR and LLVM to apply different levels of optimizations and generate high-performance code. First, apply high-level language-specific analysis and transformation in MLIR, such transformation is trivial to match on the MLIR but would be quite hard for LLVM to figure. Then the tensor operation would be lowered to affine loop nests and later llvm ir, we would apply loop optimization passes available in MLIR and llvm pipeline. We would also take advantage of the vectorizer and code generator in llvm, finetune the pipeline, and passes specifically for our FFT code. We will finetune the code generation pipeline with hardware-specific optimization to generate high-performance code for NVIDIA A100 and ice lake server CPU. For the evaluation, we would like to compare our DSL against the state-of-the-art FFT library FFTW, which is also a compiler. We want to evaluate against the FFTW’s codelets, since we don’t apply the divide and conquer algorithm as FFTW does in the planning stage. We also plan to develop an FFT library based on our DSL. Just like the FFTW’s plan, given a general DFT problem, there exists a space of valid plans (a particular composition of algorithmic steps in a specific order of execution) and we need to select one. First, we need to reduce the multi-dimensional DFT to a sequence of rank-1 problems. Second, we solve the one-dimensional DFT employing some FFT algorithms such as Cooley-Turkey. We want to introduce machine learning into auto-tuning and design space exploration, machine learning for compilers and program optimization is what the compiler community has been working on and it could be a game-changer.