User-Assisted Distributed ML Inference with QoS-Aware Autoscaling

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1446

Type:

NAISS Small Compute

Principal Investigator:

Alfreds Lapkovskis

Affiliation:

Stockholms universitet

Start Date:

2025-10-26

End Date:

2026-11-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

Allocation

Cloud at SSC: 20000 Coins
Alvis at C3SE: 500 GPU-h/month
Mimer at C3SE: 500 GiB

Abstract

The rapid adoption of AI services, particularly large language model (LLM) inference, has significantly increased the demand for scalable computation, low tail latency, and cost-efficient resource allocation. While cloud computing remains the dominant platform for AI deployment, it faces escalating financial costs, energy demands, and growing latency concerns under heavy workloads. Edge and fog computing aim to mitigate these issues by moving computation closer to users, yet large-scale deployments remain limited due to economic and management complexity. Meanwhile, a massive pool of underutilized compute resources already exists in the form of end-user devices—desktops, laptops, and mobile phones that remain idle for much of the day. Volunteer computing leverages this potential but historically fails to provide Quality-of-Service (QoS) guarantees due to resource heterogeneity and intermittent availability. To address these challenges, we propose a novel user-assisted distributed inference architecture for QoS-aware autoscaling. The system combines a small pool of dedicated resources for inference with volunteered client resources contributed by users themselves under controlled resource constraints. Unlike traditional volunteer systems, our architecture introduces a probabilistic autoscaling model that dynamically partitions inference workloads between dedicated and volunteered resources, while adapting to resource availability and maintaining response-time guarantees. This hybrid design reduces reliance on costly centralized accelerators while preserving QoS through reciprocal participation incentives and adaptive scheduling policies. We will evaluate our architecture using real inference workloads, demonstrating how volunteered computation can significantly extend AI serving capacity without sacrificing latency, throughput, or reliability.