SUPR
Natural Language Processing Research (year 3-4)
Dnr:

NAISS 2023/23-168

Type:

NAISS Small Storage

Principal Investigator:

Lovisa Hagström

Affiliation:

Chalmers tekniska högskola

Start Date:

2023-03-13

End Date:

2024-04-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Webpage:

Allocation

Abstract

Svar på kommentar från Thomas Svedberg: Hej Thomas, Det stämmer att jag bara använder Alvis. Jag ser inga problem med att flytta projektet till Mimer, så länge som medlemmarna av det nuvarande projektet går att flytta över till det nya. Har ändrat min projektansökan utefter detta. Allt gott, Lovisa ______________________________________________________________________________ I intend to investigate and work with retrieval augmented language models and neuro-symbolic models during my 3rd to 4th year PhD studies. These models include e.g. Atlas (https://github.com/facebookresearch/atlas#corpora) and potentially TIARA (https://github.com/microsoft/KC/tree/main/papers/TIARA). Retrieval augmented models perform passage retrieval over e.g. a full Wikipedia 2018 dump and use dense indices to efficiently encode the information in the passages for simple retrieval. To reduce the number of computations necessary, one can cache pre-computed indices for future use. This requires quite large storage spaces, the precomputed indices for Wikipedia 2018 amount to about 370GB in size (and takes about 2 hours to pre-compute on 4 A40:4 nodes). Since I intend to work with different retrieval sources and wish to re-compute the indices as few times as possible, I need more storage space to make room for this. I expect to work with up to four different retrieval sets at the same time, requiring about 400GiBx4=1600GiB in total. Apart from the retrieval augmented models, I also intend to work with so called KBQA models that perform retrieval over KBs such as Freebase (on about 50 GiB), which also needs to be stored.