SUPR
Text and speech mining on GPU
Dnr:

NAISS 2025/22-660

Type:

NAISS Small Compute

Principal Investigator:

Johan Frid

Affiliation:

Lunds universitet

Start Date:

2025-05-01

End Date:

2026-05-01

Primary Classification:

10208: Natural Language Processing

Webpage:

Allocation

Abstract

The purpose of 'Text and speech mining on GPU' is to process language data in the form of text, speech recordings and images of (handwritten) text to 1) convert speech and images to text and 2) discover information, extract meaningful aspects of the text, and make the information contained in the language material accessible to the various data mining (statistical and machine learning) algorithms. This includes clustering text in different related subgoups, classifying texts according to their content and extracting and categorizing certain pieces of information; also called entities; such as place names, person names, temporal expressions, commodities etc. Speech is analysed mainly though non-commercial models such as Whisper and Pyannote. Whisper converts speech to text, Pyannote provides information on who is talking in a multi-party conversation. This combination makes it particularly useful to analyse podcasts, which has been an increasing request from users of my workplace, Lund University Humanities lab. Whisper has also been used by a project co-investigator to transcribe Syrian Arabic, which he used for his PhD thesis. In another project, we are working with Folklivsarkivet at Lund University. They have a large collection of handwritten documents - Manuskriptarkivet - which they want to convert to computer readable text in order to index and make searchable. Mining language related material typically applies machine learning techniques such as clustering, classification, predictive and generative modeling. These techniques uncover meaning and relationships in the underlying content. Text and speech mining is used in areas such as linguistics and language studies, phonetics, cognitive sciences, life sciences, archiving and documentation and journalism. Speech and text processing involves using large pretrained models of vectorised data and then fine-tuning them on particular tasks. This work greatly benefits from using GPUs through the Huggingface transformers library, which facilitates and speeds up the use of many state of the art algorithms, methods and models related to processing of language material. We will utilize it for tasks such as term/feature extraction, sentence similarity calculations, document clustering and speech transcriptions. The PI has a PhD in Phonetics, and has been involved in numerous speech and language technology related projects and is also linked to the VR-funded Språkbanken CLARIN and Huminfra national infrastructures. In summary, this project will continue to investigate the possibilities of deriving high-quality information from large collections of language related material through state-of-the-art machine learning.