Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-335

Type:

NAISS Small

Principal Investigator:

Minyu Cui

Affiliation:

Linnéuniversitetet

Start Date:

2026-02-16

End Date:

2026-09-01

Primary Classification:

10201: Computer Sciences

Webpage:

https://lnu.se/

Allocation

Mimer at C3SE: 2000 GiB
Alvis at C3SE: 750 GPU-h/month

Abstract

This project aims to develop, optimize, and evaluate advanced machine learning (ML) models, i.e., computation and communication overlap, for large-scale scientific and technical data analysis, leveraging the GPU-accelerated infrastructure provided by the Alvis cluster at C3SE within the NAISS ecosystem. Modern scientific and engineering applications increasingly rely on deep learning and data-driven methods for tasks such as pattern recognition, surrogate modeling, anomaly detection, and predictive analysis. However, training state-of-the-art models on high-dimensional and heterogeneous datasets requires substantial computational resources, particularly in terms of GPU capacity and memory bandwidth. This project addresses these challenges by combining scalable ML architectures with efficient training and optimization strategies on heterogeneous GPU platforms. The research will focus on designing and training neural network models, including deep convolutional networks, transformer-based architectures, and hybrid physics-informed learning methods. Particular emphasis will be placed on improving training efficiency, model robustness, and generalization performance through techniques such as distributed data-parallel training, mixed-precision computation, hyperparameter optimization, and adaptive learning strategies. Training data will primarily be generated and preprocessed on external national resources, in accordance with Alvis usage guidelines. The Alvis system will be used mainly for large-scale model training, fine-tuning, and experimental evaluation. The availability of NVIDIA T4, V100, A40, and A100 GPUs enables systematic performance comparisons across different accelerator generations and supports experimentation with memory-intensive and compute-intensive workloads. By utilizing the Alvis infrastructure, this project will enable computational experiments that are not feasible on local resources and will strengthen national competence in AI-driven scientific computing. The results will support ongoing research collaborations and contribute to the long-term development of sustainable, high-performance AI methodologies within the Swedish and international research community.