The rapid adoption of AI services, particularly large language model (LLM) inference, has significantly increased the demand for scalable computation, low tail latency, and cost-efficient resource allocation. While cloud computing remains the dominant platform for AI deployment, it faces escalating financial costs, energy demands, and growing latency concerns under heavy workloads. Edge and fog computing aim to mitigate these issues by moving computation closer to users, yet large-scale deployments remain limited due to economic and management complexity. Meanwhile, a massive pool of underutilized compute resources already exists in the form of end-user devices—desktops, laptops, and mobile phones that remain idle for much of the day. Volunteer computing leverages this potential but historically fails to provide Quality-of-Service (QoS) guarantees due to resource heterogeneity and intermittent availability.
To address these challenges, we propose a novel user-assisted distributed inference architecture for QoS-aware autoscaling. The system combines a small pool of dedicated resources for inference with volunteered client resources contributed by users themselves under controlled resource constraints. Unlike traditional volunteer systems, our architecture introduces a probabilistic autoscaling model that dynamically partitions inference workloads between dedicated and volunteered resources, while adapting to resource availability and maintaining response-time guarantees. This hybrid design reduces reliance on costly centralized accelerators while preserving QoS through reciprocal participation incentives and adaptive scheduling policies. We will evaluate our architecture using real inference workloads, demonstrating how volunteered computation can significantly extend AI serving capacity without sacrificing latency, throughput, or reliability.