Specialized deployment of large language models (LLMs) requires balancing strong reasoning performance with low inference cost. Proprietary reasoning models provide robust capabilities that can handle complex tasks, but their high inference costs and latency make them unsuitable for high-volume use. Smaller, open-weight models are cost-effective and easily deployable, but often fail to adapt to specialized downstream domains. This project adresses this tradeoff via a Test-Time Online Active Distillation framework for Large Language Models (LLMs). The studied system consists of a deployed student open-weight model (e.g. Qwen3-1.7B) and a high-performance proprietary teacher model (e.g. GPT-5.2). The goal is to approach the teacher accuracy over time while minimizing teacher queries by selectively querying the teacher when the semantic entropy of student samples is high (suggesting low confidence) and leveraging the teacher output to update the students' weights via policy gradient updates