Evaluating the Reliability of Self-Explanations in Large Language Models

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-215

Type:

NAISS Small Compute

Principal Investigator:

Korbinian Robert Randl

Affiliation:

Stockholms universitet

Start Date:

2025-02-17

End Date:

2026-03-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

Allocation

Alvis at C3SE: 500 GPU-h/month
Mimer at C3SE: 500 GiB

Abstract

In this project, we build on our recently published work (article: https://doi.org/10.1007/978-3-031-78977-9_3, preprint: https://arxiv.org/abs/2407.14487) on large language models' (LLMs) self-explanatory capabilities. Our prior study assessed extractive and counterfactual self-explanations using models ranging from 2B to 8B parameters across both objective and subjective classification tasks. We found that while these self-explanations often align with human judgment, they do not always faithfully reflect the model's actual decision-making processes, revealing a gap between perceived and genuine model reasoning. Notably, prompting LLMs for counterfactual explanations yielded faithful, informative, and easily verifiable results, positioning them as a promising alternative to traditional explainability methods. As LLMs become increasingly integrated into high-stakes applications such as legal analysis, medical diagnostics, and scientific research, the ability to interpret and trust their outputs is crucial. However, current explainability techniques, such as SHAP, are often resource- and time-intensive. Our previous research suggested that self-counterfactuals could be a viable alternative, but further investigation is needed to understand their scalability in larger LLMs and how to ensure the reliable generation of valid counterfactuals. By expanding our study to models such as LLaMA 3.1 70B and Gemma 2 27B, we aim to assess whether self-explanation and counterfactual reasoning remain effective at larger scales and whether traditional gradient- and attention-based methods remain viable. Understanding these dynamics will help bridge the gap between theoretical explainability and practical deployment, contributing to more reliable and interpretable AI systems. Beyond self-explanation, we also evaluate gradient- and attention-based explanation methods, which require access to internal model variables and autograd functionality—features unavailable in API-restricted LLMs. So far, our experiments have been conducted on models up to LLaMA 3.1 8B on local servers. To determine whether our findings generalize to larger models beyond our current computational capacity (specifically LLaMA 3.1 70B and Gemma 2 27B), we seek access to NAISS resources.