Large-language-model-based analysis of code revision histories for refactoring detection

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2023/22-1194

Type:

NAISS Small Compute

Principal Investigator:

Daniel Strüber

Affiliation:

Göteborgs universitet

Start Date:

2023-11-09

End Date:

2024-12-01

Primary Classification:

10205: Software Engineering

Webpage:

Allocation

Alvis at C3SE: 250 GPU-h/month
Mimer at C3SE: 250 GiB

Abstract

This project aims at detecting machine-learning-specific software refactorings from machine learning projects using state-of-the-art large language models. We plan to use Llama 2 and Code Llama to obtain information from code commits and identify refactoring categories. To achieve this text classification task, the input data we will use should be the refactoring information and relevant commits. After our fine-tuning process for the large language models, the output should be the categories of the refactorings, which are either general refactorings or ML-specific refactorings. We will use the GPU resources to run and fine-tune the large language models. The demands of GPU should depend on the model size and data size. We expect to use multiple GPUs to accelerate this process.