The growing deployment of AI services across heterogeneous distributed infrastructures, spanning edge devices, fog nodes, and cloud data centers, introduces significant operational challenges. As system complexity increases, failures become increasingly frequent, uncertain, and prone to cascading across layers. Ensuring resilience in such environments requires moving beyond static fault tolerance toward intelligent, adaptive mechanisms capable of reasoning about uncertainty, identifying root causes, and autonomously executing recovery actions.
This project develops and empirically validates AI-driven approaches for autonomous fault detection, root-cause analysis, and self-healing across distributed computing environments. Our research explores how techniques such as probabilistic inference, causal discovery, and neuro-symbolic reasoning can be combined with modern machine learning to build systems that are both resource-aware and robust under noisy, heterogeneous runtime conditions. Key research interests include learning causal structures from system logs, uncertainty-aware fault diagnosis, and adaptive recovery decision-making, with a focus on deployability across the full computing continuum from resource-constrained edge nodes to cloud infrastructure.
Realizing these goals requires substantial empirical work, including training and evaluating a range of models for fault inference and decision-making, benchmarking causal discovery methods under varying noise conditions, and assessing end-to-end self-healing pipelines through realistic distributed system simulations with controlled fault injection. The Alvis cluster will support these training runs, hyperparameter optimization, multi-seed experiments, and ablation studies.