ML systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, the safety of ML systems should be a leading research priority. ML Safety is about ensuring systems can withstand hazards (Robustness), identifying hazards (Monitoring), reducing inherent ML system hazards (Alignment), and reducing systemic hazards (Systemic Safety).
This project will initially mainly be developing a Mechanistic Anomaly Detection benchmark (https://github.com/ejnnr/cupbearer) which would fit under the Monitoring part in the section above. Mechanistic Anomaly Detection is about detecting out of distribution behavior in the computations of a neural network regardless of whether the input would be considered out of distribution or not. An important criteria for systems that are built to generalize outside the training distribution.