Modern machine learning methods are increasingly powerful in capturing complex patterns in high-dimensional data. However, many state-of-the-art models, especially those relying on deep learning or kernel methods, remain difficult to interpret or analyze from a statistical perspective. This project aims to develop and test scalable training pipelines for explainable machine learning models, combining recent advances in kernel approximation, generalized additive modeling, and representation learning.
The core idea is to leverage random Fourier features (RFF) as a way to approximate shift-invariant kernels efficiently and to use these features to guide the training of interpretable regression models such as Generalized Additive Models (GAMs). By combining kernel-based insights with sparse additive structure, we aim to build models that not only perform well in prediction but also allow for intuitive understanding of feature effects. A key focus is on mixture-of-GAMs frameworks, where different local GAMs are trained on clusters identified in a learned feature space, improving flexibility while preserving interpretability.
The project also investigates data augmentation strategies, particularly using Variational Autoencoders (VAEs) to synthesize new covariate data in the vicinity of the original data manifold. This is especially valuable in real-world datasets where sample sizes are limited and covariates exhibit structure or sparsity, as seen in benchmark datasets like the NASA airfoil self-noise data. By perturbing the learned latent space representations, we generate synthetic data to enrich the training set and improve generalization.