Text and speech mining on GPU

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2024/22-592

Type:

NAISS Small Compute

Principal Investigator:

Johan Frid

Affiliation:

Lunds universitet

Start Date:

2024-05-01

End Date:

2025-05-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Webpage:

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

The purpose of text and speech mining is to process unstructured information, extract meaningful numeric indices from the text, and, thus, make the information contained in the language material accessible to the various data mining (statistical and machine learning) algorithms. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. Hence, you can analyze words, clusters of words used in documents, etc., or you could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project. In the most general terms, text and speech mining will "turn language into numbers", which can then be incorporated in other analyses such as predictive data mining projects, the application of unsupervised learning methods (clustering), etc. Mining language related material typically applies machine learning techniques such as clustering, classification, association rules and predictive modeling. These techniques uncover meaning and relationships in the underlying content. Text and speech mining is used in areas such as competitive intelligence, life sciences, voice of the customer, media and publishing, legal and tax, law enforcement, sentiment analysis and trend-spotting. Recent developments in both speech and text processing involve using large pretrained models of vectorised data and then fine-tuning them on particular tasks. This project is a continuation of an earlier project. The novelty of this project is using GPUs through the Huggingface transformers library, which facilitates the use of many state of the art algorithms, methods and models related to processing of language material. We will utilize it for tasks such as term/feature extraction, sentence similarity calculations, document clustering and speech transcriptions. The PI has a PhD in Phonetics, and has been involved in numerous speech and language technology related projects and is also linked to the Swe-Clarin national infrastructure In summary, this project will investigate the possibilities of deriving high-quality information from large collections of language related material through statistical pattern learning. The target language will be Swedish.