Scalable training of explainable machine learning models informed by random Fourier features

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1293

Type:

NAISS Small Compute

Principal Investigator:

Xin Huang

Affiliation:

Umeå universitet

Start Date:

2025-09-23

End Date:

2026-10-01

Primary Classification:

10105: Computational Mathematics

Webpage:

Allocation

Klemming at PDC: 500 GiB
Dardel at PDC: 8 x 1000 core-h/month

Abstract

Modern machine learning methods are increasingly powerful in capturing complex patterns in high-dimensional data. However, many state-of-the-art models, especially those relying on deep learning or kernel methods, remain difficult to interpret or analyze from a statistical perspective. This project aims to develop and test scalable training pipelines for explainable machine learning models, combining recent advances in kernel approximation, generalized additive modeling, and representation learning. The core idea is to leverage random Fourier features (RFF) as a way to approximate shift-invariant kernels efficiently and to use these features to guide the training of interpretable regression models such as Generalized Additive Models (GAMs). By combining kernel-based insights with sparse additive structure, we aim to build models that not only perform well in prediction but also allow for intuitive understanding of feature effects. A key focus is on mixture-of-GAMs frameworks, where different local GAMs are trained on clusters identified in a learned feature space, improving flexibility while preserving interpretability. The project also investigates data augmentation strategies, particularly using Variational Autoencoders (VAEs) to synthesize new covariate data in the vicinity of the original data manifold. This is especially valuable in real-world datasets where sample sizes are limited and covariates exhibit structure or sparsity, as seen in benchmark datasets like the NASA airfoil self-noise data. By perturbing the learned latent space representations, we generate synthetic data to enrich the training set and improve generalization.