SUPR
Enhancing Robotic Automation with Vision-Language Models for Failure Detection
Dnr:

NAISS 2024/22-1220

Type:

NAISS Small Compute

Principal Investigator:

Faseeh Ahmad

Affiliation:

Lunds universitet

Start Date:

2024-09-23

End Date:

2025-10-01

Primary Classification:

10207: Computer Vision and Robotics (Autonomous Systems)

Webpage:

Allocation

Abstract

This project aims to enhance robotic task execution by integrating advanced Vision-Language Models (VLMs) to detect and identify failures during operations. The focus is on the peg-in-a-hole task, a fundamental robotic challenge that involves inserting a peg into a hole within a box-shaped object. Traditional methods using Behavior Trees (BT) often struggle with unexpected failures, such as obstacles blocking the hole, leading to task failures. Our approach addresses this limitation by using the CLIP model, a state-of-the-art VLM, to provide real-time feedback and adaptive responses. CLIP can interpret visual inputs from cameras and associate them with descriptive text, allowing it to understand and describe what it sees, even in new, untrained scenarios. Before executing the planned BT for the task, the system captures the environment’s current state using a camera. CLIP processes these images to detect issues like obstacles or misalignments and provides feedback that informs adjustments to the BT. For example, if the hole is blocked, CLIP can identify the problem and suggest removing the obstacle before attempting the task again. The project’s main goals are to: 1. Train CLIP on task-specific data to improve its ability to recognize and describe task failures. 2. Integrate CLIP’s outputs with the BT framework, allowing the robot to adapt its behavior based on real-time visual feedback. 3. Test and validate the system in various scenarios to ensure that it improves task success rates and can handle different types of failures. To achieve these goals, we will fine-tune CLIP using a dataset of images and videos showing both successful and failed task executions. These will be labeled with descriptions of what the robot sees and the steps it takes, including pre-conditions and post-conditions for each action. This training process requires significant computational power, particularly for adjusting CLIP’s understanding of specific task scenarios, which is where the requested resources will be critical. The computational resources will be used for training and fine-tuning CLIP, as well as integrating it with the BT framework. We need GPU processing power due to the high complexity of the model and the large volume of data required to teach CLIP to accurately detect and describe task failures. Additionally, resources will support the iterative testing process, where the robot will be repeatedly tested in real-world scenarios to refine its performance. The outcome of this project will be a more flexible and reliable robotic system that can autonomously detect and respond to failures, making it well-suited for dynamic and unstructured environments. By enhancing the robot’s ability to see, understand, and react to what’s happening in real time, we aim to significantly improve the success rate of tasks like peg-in-a-hole and beyond, ultimately pushing the boundaries of what autonomous robots can achieve.