NAISS
SUPR
NAISS Projects
SUPR
Classification of sustainability-aware developer data on GitHub
Dnr:

NAISS 2026/4-556

Type:

NAISS Small

Principal Investigator:

Andrea Gilot

Affiliation:

Uppsala universitet

Start Date:

2026-03-26

End Date:

2027-04-01

Primary Classification:

10205: Software Engineering

Webpage:

Allocation

Abstract

Rapid advances in Large Language Models (LLMs) have increased attention to the data used to train them. A critical step towards this understanding is examining the social, environmental, and technical impacts of online software development. To address this gap, this project aims to systematically identify and analyse sustainability-related practices in public software repositories. We propose to fine-tune a language model on a curated dataset of commits and pull request discussions that capture trade-offs in energy usage, which we have previously annotated. Our preliminary experiments demonstrate the feasibility of this approach. For evaluation, we will compare both the performance and scalability of our method against state-of-the-art LLMs. We will apply the trained classifier to a corpus of seven million commits that we have mined and pre-filtered from GitHub, as well as to a dataset of pull request discussions that we are currently mining. The resulting dataset will enable researchers and practitioners to identify patterns, trade-offs, and opportunities for making sustainable choices in software systems. We will complement this with a qualitative analysis that provides empirical insights into current sustainable software practices. We will release both the dataset and the trained model to support reproducibility and facilitate further research. PI's advisor is Eva Darulova, co-PI's advisor is Sofia Ouhbi both in the Department of Information Technology at Uppsala University