AI-driven Fault Inference and Self-healing in Distributed Computing Environments

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-822

Type:

NAISS Small

Principal Investigator:

Praveen Kumar Donta

Affiliation:

Stockholms universitet

Start Date:

2026-04-28

End Date:

2027-05-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

Allocation

Alvis at C3SE: 750 GPU-h/month
Mimer at C3SE: 500 GiB
Arrhenius GPU at NAISS: 300 GPU-h/month
Arrhenius Disk at NAISS: 250 GiB

Abstract

The growing deployment of AI services across heterogeneous distributed infrastructures, spanning edge devices, fog nodes, and cloud data centers, introduces significant operational challenges. As system complexity increases, failures become increasingly frequent, uncertain, and prone to cascading across layers. Ensuring resilience in such environments requires moving beyond static fault tolerance toward intelligent, adaptive mechanisms capable of reasoning about uncertainty, identifying root causes, and autonomously executing recovery actions. This project develops and empirically validates AI-driven approaches for autonomous fault detection, root-cause analysis, and self-healing across distributed computing environments. Our research explores how techniques such as probabilistic inference, causal discovery, and neuro-symbolic reasoning can be combined with modern machine learning to build systems that are both resource-aware and robust under noisy, heterogeneous runtime conditions. Key research interests include learning causal structures from system logs, uncertainty-aware fault diagnosis, and adaptive recovery decision-making, with a focus on deployability across the full computing continuum from resource-constrained edge nodes to cloud infrastructure. Realizing these goals requires substantial empirical work, including training and evaluating a range of models for fault inference and decision-making, benchmarking causal discovery methods under varying noise conditions, and assessing end-to-end self-healing pipelines through realistic distributed system simulations with controlled fault injection. The Alvis cluster will support these training runs, hyperparameter optimization, multi-seed experiments, and ablation studies.