Symbolic representations of scenes, also know as Scene Graphs, can be used in various downstream tasks, such as Visual Question Answering (VQA) or Image Captioning to understand scene dynamics at a fine-grained level. Recently, we have seen the rise of Scene Graphs for reasoning of embodied agents but also in Robotics. These new applications require real-time and low-resources approaches, which as sparked the new field of Real-Time Scene Graph Generation.
In this work, we tackle this issue by proposing a new model for Real-Time Scene Graph Generation based on latest development of transformers architecture (DINO, Deformable Transformers, Low-Rank Adapter etc). We aim to push the boundaries of relationships modeling while maintaining low parameters count for efficient trade-off between accuracy and latency. We will explore different methods such as hyperparameters tuning, fine-tuning or low-rank adaptation of object detector backbones.