SEERLL: Correct and Explainable Test Oracle Generation with LLMs

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1374

Type:

NAISS Small Compute

Principal Investigator:

Yue Zhao

Affiliation:

Stockholms universitet

Start Date:

2025-10-10

End Date:

2026-10-01

Primary Classification:

10105: Computational Mathematics

Webpage:

Allocation

Alvis at C3SE: 500 GPU-h/month
Mimer at C3SE: 500 GiB

Abstract

Automating test oracle generation is one of the most challenging aspects of software testing, yet it has received significantly less attention than automated test input generation. A test oracle determines whether a program behaves correctly by distinguishing between correct and faulty behavior. Existing neural approaches to oracle generation often suffer from high false positive rates and weak detection capabilities. Large Language Models (LLMs) have demonstrated remarkable effectiveness in a variety of software engineering tasks, such as code generation, test case creation, and bug fixing. However, there is a lack of large-scale, systematic studies on their effectiveness in generating high-quality test oracles. This project proposes SEERLL, a learning-based framework built upon SEER, to explore and enhance the capabilities of LLMs in generating correct, diverse, and strong test oracles that can detect a large number of unique bugs. Specifically, I will fine-tune seven code LLMs (Code LLM, CodeGPT, CodeParrot, CodeGen, PolyCoder, Phi-1, PolyCoder) using multiple prompt strategies on a large dataset. The most effective fine-tuned LLM–prompt pair will be used to predict whether unit tests pass or fail. To evaluate generalizability, SEERLL will then be tested on 50 previously unseen large-scale Java projects. Additionally, I will leverage LLM capabilities to enhance the explainability of generated oracles, aiming to make them more readable, interpretable, and trustworthy for developers and testers.