SUPR
SACCADE – Semantic-Aware Compactor for Curating Data Lakes
Dnr:

NAISS 2024/22-1517

Type:

NAISS Small Compute

Principal Investigator:

Pierre Lamart

Affiliation:

Göteborgs universitet

Start Date:

2024-11-26

End Date:

2025-12-01

Primary Classification:

10202: Information Systems (Social aspects to be 50804)

Webpage:

Allocation

Abstract

The problem addressed with SACCADE is to prevent uncontrolled growth of data lakes and to support their sustainable curation with a theoretical concept and validated prototype. Over the past decade, AI/ML-enabled solutions have gained a lot of attention, but they require a large amount of good quality data to train and evaluate new models and achieve acceptable and safe performance. While it does not require much effort to blindly collect and store large amounts of data, it makes the datasets heavy and expensive to use. Uncontrollably growing an existing dataset makes it unsustainable and inefficient to use and query relevant information. Although quantity is necessary, it is not synonymous with quality, and the need for carefully documented datasets became apparent. Analyzing the quality and the diversity within a dataset is highly important for both a safe implementation in the industry as well as to understand and improve the output of a model. With this idea in mind, we want to evaluate the added value new data can provide to a dataset. This could allow us to enrich the dataset with new valuable data while avoiding storing redundant data that makes the dataset heavier for minimal added value. The basic idea would be to develop an end-to-end pipeline that will require two main components: • A careful and efficient analysis of the new data. This will most likely be a complex multi-modal data collected from multiple sensors. One idea is to use multi-modal fusion along with embedding to get a rich vector representation of the new data, preserving the semantic in a compact and efficient way. • A proper storage and retrieval of the data within a data lake. The idea is to provide a fast and efficient way to investigate the information stored when using the dataset as well as when we want to investigate the dataset to evaluate the added value of the new data. An idea is to enrich the metadata layer of the data lake to give a better description of data stored and the scenarios represented