The main aim of this project is to assess how major microbiome disrupting events during early life affect the longterm gut microbiome, as assessed during young adulthood. In MiApp, the longterm effect of childhood appendectomy is evaluated; while in MiUro everyone has been exposed to prolonged antibiotic exposure (1-2 years) during early childhood as part of the treatment of vesico-urethral reflux. Brusselaers is co-PI of both projects; with clinical PI's Helene Lilja (MiApp) and Tryggve Neveus/Goran Läckgren (MiUro)
The project team is implementing structured, team-wide storage policies to ensure efficient and responsible use of NAISS storage resources. All team members follow these practices from the start of the allocation period. Under these policies, only irreplaceable data (raw data, curated processed datasets, and final analysis outputs) are stored in backed-up storage, while intermediate and reproducible analysis files (e.g. temporary objects, transformed datasets, logs, and intermediate outputs) are systematically stored in no-backup storage and cleaned regularly. R libraries are documented and reinstalled on the target system; no R binaries or libraries are transferred.
The project involves the analysis of microbiome datasets (mainly MiUro and MiApp, with SweMaMi as external comparison) across three cohorts and two parallel analyses. No new primary data will be generated during the allocation period; instead, existing datasets will be integrated and re-analysed. The active dataset size is estimated at approximately 107.5 GiB, with a total of approximately 300000 files. Due to the current project phase, which involves concurrent analyses across multiple cohorts, approximately 80% of the data is expected to reside in backed-up storage, while the remaining fraction corresponds to intermediate and temporary files stored in no-backup storage.
The raw input data are stored once and reused across analyses, without duplication between workflows. However, microbiome analysis pipelines generate substantial intermediate data, which are reproducible but can be large and the need to retain multiple versions during validation and comparison steps. To conservatively account for peak storage usage during these concurrent analyses, and calculating based on the expected expansion factor, the total estimated storage requirement is around 1000 GiB (files in the 105 order of magnitude).
This estimate intentionally overstates steady-state storage needs to ensure sufficient capacity, as intermediate files are cleaned according to the project’s storage policy, the effective storage footprint is expected to decrease toward typical values. Storage demand is expected to remain approximately constant over time, given that no new primary data are generated and datasets are reused across analyses.