SUPR
Developing an efficient NLP pipeline for biomedical data extraction
Dnr:

NAISS 2023/22-1044

Type:

NAISS Small Compute

Principal Investigator:

Rasool Saghaleyni

Affiliation:

Chalmers tekniska högskola

Start Date:

2023-10-06

End Date:

2024-11-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Allocation

Abstract

Literature containing knowledge in biomedical and clinical documents is constantly increasing [Chang et al., 2022]. It is, therefore, necessary to develop effective NLP methods to handle unstructured text data for various tasks, including information retrieval and information extraction [Erdengasileng et al., 2022]. In addition, it is recognized as a challenging task to identify biomedical entities and relations such as genes, proteins, cell types, and cell lines and their relations using Named Entity Recognition (NER) and Relation Detection (RD) [Abdurxit et al., 2022, Perera et al., 2020]. Consistent with emerging developments in the genome editing field that led to the Nobel prize in chemistry in 2020 [Anzalone et al., 2020], developing NLP tools that could automatically extract and categorize associations between genes, diseases, tissues, and genome-engineered cells-lines could have promising prospects to be used in healthcare research and industry. To this end, we will see an opportunity for a project to combine currently available supervised and unsupervised NLP models and examine the best methodologies to extract valuable biomedical information with the highest accuracy from scientific publications. To test and train models, we have already generated a manually curated database of the information the algorithms need to extract. The pipeline could be valuable for accelerating research on currently open research questions in the biotechnology field and has a high potential to be translated into a popular service. References Chang, L., Zhang, R., Lv, J., Zhou, W., & Bai, Y. (2022). A review of biomedical named entity recognition. J. Comput. Methods Sci. Eng., 22, 893-900. Erdengasileng, A., Han, Q., Zhao, T., Tian, S., Sui, X., Li, K., Wang, W., Wang, J., Hu, T., Pan, F., Zhang, Y., & Zhang, J. (2022). Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification. Database: The Journal of Biological Databases and Curation, 2022. Perera, N., Dehmer, M., & Emmert-Streib, F. (2020). Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Frontiers in Cell and Developmental Biology, 8. Abdurxit, M., Tohti, T., & Hamdulla, A. (2022). An Efficient Method for Biomedical Entity Linking Based on Inter- and Intra-Entity Attention. Applied Sciences. Anzalone, A.V., Koblan, L.W., & Liu, D.R. (2020). Genome editing with CRISPR–Cas nucleases, base editors, transposases and prime editors. Nature Biotechnology, 38, 824-844.