Large-language-model-based analysis of code revision histories for refactoring detection

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1642

Type:

NAISS Small Compute

Principal Investigator:

Daniel Strüber

Affiliation:

Göteborgs universitet

Start Date:

2025-12-01

End Date:

2026-12-01

Primary Classification:

10205: Software Engineering

Webpage:

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

This project aims to classify software refactorings from machine learning projects using state-of-the-art large language models and to analyze common patterns of refactoring types in these projects. During the earlier stage of the project, we have extracted refactoring information from 173 selected machine learning repositories and used GPU resources to run Llama-2-7B and GPT-4 for a binary classification on 129 manually labeled refactoring instances. Due to the time-intensive cost of manual labeling and fuzzy category boundaries in human annotation, we shifted from pursuing binary accuracy to adopting a multi-dimensional characterization of refactoring activities. According to the typical workflow of machine learning projects, we proposed six dimensional categories: Data Processing, Model Development, Evaluation, Deployment & Serving, Visualization & UI, and Monitoring & Logging. For each project, we generated a radar chart illustrating the distribution of its refactoring types. In the next stage of this project, we will first continue using several LLMs the classification process on a small sample of 30 machine learning projects. We will analyze the classification results and visualize how each category is distributed across the projects. After the initial statistical analysis, we will extend the classification to a larger set of machine learning projects. We then plan to apply K-means clustering to group projects based on their refactoring profiles and identify the most common clustering patterns. We will generate normalized radar charts for these clusters to illustrate the characteristic refactoring patterns observed in typical machine learning projects. We plan to use Llama 2, Llama3, and GPT-4 to conduct the classification task. We will use the GPU resources to run and analyze the large language models. By leveraging multiple models, we plan to compare the classification results of Llama 2, Llama3, and GPT-4 for validation. In the end, this project targets a research paper.