NAISS
SUPR
NAISS Projects
SUPR
WASP-AD-LLM
Dnr:

NAISS 2026/4-1002

Type:

NAISS Small

Principal Investigator:

Anindya Sundar Das

Affiliation:

UmeƄ universitet

Start Date:

2026-06-03

End Date:

2027-07-01

Primary Classification:

10201: Computer Sciences

Allocation

Abstract

This project aims to enhance the robustness, interpretability, and generalization of anomaly detection (AD) for LLM systems in realistic environments. Large language models (LLMs) and pre-trained language models have emerged as core components of modern artificial intelligence systems, powering advanced capabilities such as instruction following, conversational interaction, and complex reasoning. Despite their impressive performance, recent studies have demonstrated that these models remain susceptible to covert behavioral manipulation, including backdoor attacks and jailbreak-oriented exploits, where malicious behaviors are secretly embedded during training or fine-tuning. In contrast to conventional data poisoning attacks that depend primarily on explicit trigger tokens in the input, these attacks often manifest during the response generation process itself, causing the model to produce harmful, policy-violating, or attacker-controlled outputs while appearing benign under standard evaluation settings. To address this challenge, this project introduces an inference-time anomaly detection framework designed to identify and mitigate hidden backdoor and jailbreak behaviors during text generation. Motivated by recent defense strategies such as CleanGen and JBShield, the proposed approach focuses on abnormal generation dynamics rather than relying solely on suspicious input patterns. Specifically, the framework analyzes token-level generation behaviors, including shifts in model confidence, abnormal attention concentration, gradient-based saliency patterns, and response anchoring effects, in order to detect anomalous outputs that diverge from the behavior of clean models under similar prompts. Importantly, the proposed method does not require prior knowledge of trigger phrases, attack strategies, or reference responses, making it suitable for realistic deployment scenarios involving both encoder-based and decoder-based architectures. Beyond anomaly detection, the framework also emphasizes explainable and localized mitigation strategies, enabling selective suppression, rewriting, or regeneration of suspicious portions of generated responses while preserving overall model functionality and utility. By integrating anomaly detection, interpretability, and inference-time defense into a unified framework, this project aims to strengthen the practical security and trustworthiness of language models deployed in open-ended and safety-critical environments.