SUPR
Natural Language Processing Research (year 4-5)
Dnr:

NAISS 2024/23-196

Type:

NAISS Small Storage

Principal Investigator:

Lovisa Hagström

Affiliation:

Chalmers tekniska högskola

Start Date:

2024-04-01

End Date:

2025-04-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Webpage:

Allocation

Abstract

I intend to investigate and work with retrieval augmented language models and interpretability methods for NLP during years 4-5 of my PhD studies. This involves models such as Atlas (https://github.com/facebookresearch/atlas#corpora) and LLaMA (https://github.com/meta-llama/llama). Retrieval augmented models perform passage retrieval over e.g. a full Wikipedia 2018 dump and use dense indices to efficiently encode the information in the passages for simple retrieval. To reduce the number of computations necessary, one can cache pre-computed indices for future use. This requires quite large storage spaces, the precomputed indices for Wikipedia 2018 amount to about 370GB in size (and takes about 2 hours to pre-compute on 4 A40:4 nodes). Since I intend to work with different retrieval sources and wish to re-compute the indices as few times as possible, I need more storage space to make room for this. I expect to work with up to three different retrieval sets at the same time, requiring about 400GiBx3=1200GiB in total.