Lymphoma is a type of cancer affecting the cells of the immune system, called lymphocytes. These cells can be found in several organs, such as the lymph nodes, spleen, thymus and bone marrow.
Typically, lymph nodes are investigated to diagnose Lymphoma.
At the current state of the art, a Lymphoma diagnosis is usually obtained with a biopsy exam. This method has some weaknesses, such as the procedure's invasiveness or the relative delay in time between the sample collection and the diagnosis.
With the advent of AI-based technologies and the availability of massive amounts of medical data and computational resources, new diagnostic approaches have been developed in the oncology field. Medical imaging techniques can be combined with AI methods to generate accurate and reliable cancer detection and prediction algorithms  . Some advantages of the AI-based techniques compared to the standard biopsy include faster processing time, non-invasiveness and high accuracy of the results achieved.
In the oncological field, a widely used acquisition technique is the PET-CT. In this type of acquisition, two different acquisitions are performed:
One acquisition is a CT scan, where a 3D anatomical image of the patient is produced. The other acquisition is a PET scan, where the metabolic activity of the patient is recorded and mapped in a 3D volumetric representation.
Several studies have shown the potential of Deep Learning(DL)-based methods applied to medical imaging tasks such as cancer prediction. detection and segmentation . To classify the presence of Lymphoma correctly, the first step of a fully automated workflow needs to focus on lymph node detection in PET-CT.
In this study, a ViT-based ( Vision Transformer ) model will be employed to predict, from an initial estimate, the 3D bounding boxes of the Lymphoma lesions in the PET-CT. In detail, the network architecture comprises two encoder transformers and one decoder transformer. The primary encoder transformer branch will learn the cross-relation of an initial estimate of 100 bounding boxes, fusing this information with a Swin transformer used to learn spatial relationships in the PET-CT volumes. The fused block ( ViT + Swin ) will then be forwarded to an encoder transformer block, where, by using a bipartite matching loss function, the network learns how to retain only plausible bounding boxes around the lymphoma lesions.