Conventional visual recognition object detectors are built upon the assumption that the model will only encounter ‘known’ object classes that it has come across while training. Recently, the problem of open-world object detection (OWOD) has received attention, where the objective is to detect known and ‘unknown’ objects and then incrementally learn these ‘unknown’ objects when introduced with labels in the subsequent tasks. In this problem setting, the newly identi-
fied unknowns are first forwarded to a human oracle, which can label new classes of interest from the set of unknowns. The model then continues to learn and update its understanding with the new classes without retraining on the previously known data from scratch. Thus, the model is desired to identify and subsequently learn new classes of objects in an incremental way when new data arrives. Although the standard OWOD provides flexibility to detect unknown object categories and then incrementally learn new object categories, the general problem of incremental learning of new classes comes with the need to be trained in a fully supervised setting. To this end, current OWOD approaches rely on strong oracle support to consistently label all the identified unknowns with their respective semantics classes and precise box locations.
In this project, we strive to to decrease the aforementioned reliance on the human oracle to provide annotations at run time for the unknown classes as well during the continual learning stages. We argue that it is less realistic to assume that an interacting oracle is going to provide annotations for a large amount of data. The annotation problem becomes extremely laborious in domains requiring a much higher number of dense oriented box annotations, in the presence of background clutter and small object size. We aim to extend the realistic open-world setting to 3D visual recognition field with potential real-world applications in robotics and autonomous systems. Furthermore, we aim to investigate developing efficient and robust foundational vision-language models for learning novel visual recognition tasks at scale in the realistic real-world settings.