SUPR
LLMs for Code Understanding
Dnr:

NAISS 2024/22-612

Type:

NAISS Small Compute

Principal Investigator:

Simin Sun

Affiliation:

Göteborgs universitet

Start Date:

2024-04-25

End Date:

2025-05-01

Primary Classification:

10205: Software Engineering

Webpage:

Allocation

Abstract

Large Language Models (LLMs) like GPT-4 or LLaMA-2 are increasingly used in software engineering tasks, being part of such tools as Github Copilot. Their use for these tasks is based on the assumptions that programmers express their design in vernacular close to the problem domain -- the so-called naturalness hypothesis. In this project, we evaluate this hypothesis by studying whether the large language models can recognize the domain vocabulary better than the programming language for any given program. We analyze the Rosetta code repository with four language models in three different scenarios.