Over the past two decades, machine learning (ML) has dominated and accelerated the field of computer vision at unprecedented rates. In spite of their dominating performance, the deployment of ML-based vision systems in sensitive domains has been a long-standing concern, including fairness and explainability (e.g., in healthcare, policymaking), safety (e.g., autonomous vehicles), and privacy (e.g., surveillance, recommender systems).
Adversarial attacks are a well-known vulnerability of ML based vision systems. Through an optimization process that leverages gradient information from the victim model, an attacker can create a perturbation capable of misleading the model into producing a faulty output, even within domains where it is known to perform well; moreover such perturbations can be deployed both in the digital and physical domain, and they can even be optimized to be stealthy towards human supervision, either by being visually imperceptible (for the digital domain) or resembling natural inconspicuous patterns (for the physical domain).
A large amount of work has succeeded in improving the robustness of ML vision models to such perturbations in relevant scenarios. However, it is often the case that such scenarios make assumptions that are often strict and remain vulnerable when such constraints on the attacker do not apply. Furthermore, it is usually the case that later, more advanced attacks can break the defenses even under their own deployment assumptions.
Moreover, the plethora of research devoted to detecting, mitigating, and preventing such attacks, rarely turns towards a general solution to adversarial attacks. The reason is that they focus on “realistic” conditions, which turn away from the original worst-case formulation of adversarial attacks, which defines them not as a necessarily realistic threat model, but as a feature of neural networks. Even in relatively simple scenarios involving early convolution-based neural networks and small low-resolution datasets, and in complex tasks involving recent vision-language models with tens of billions of parameters, simple attack algorithms can produce highly effective adversarial attacks under their original formulation.
This project aims to broaden the understanding of adversarial attacks as an inherent property of neural networks, thereby providing further insights to potential detection and recovery schemes and their limitations. Deviating from the established benchmarks oriented towards realistic threat models, we will study the vulnerability of state-of-the-art defenses and victim vision models to the original worst-case white-box adversarial attacks, and study how well existing countermeasures can limit their effectiveness.