Studying Inductive Bias Transformers and MLPs

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1631

Type:

NAISS Small Compute

Principal Investigator:

Alasdair Paren

Affiliation:

Chalmers tekniska högskola

Start Date:

2025-11-23

End Date:

2026-12-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

https://alasdair-p.github.io/Alasdair-P/

Allocation

Alvis at C3SE: 500 GPU-h/month
Mimer at C3SE: 500 GiB

Abstract

Project Description The Simplicity Bias (SB) is an inductive bias exhibited by neural networks trained with standard techniques toward learning “simple” functions [2]. This phenomenon acts like occam’s razor and can be a benefit or drawback depending on the relative complexity of the desired function to be learnt from the data. The SB can be highly beneficial if the desired function is the simplest that minimises the loss over the training data. However, if the the problem is underspecified and there are simpler functions; commonly named “shortcuts” or “spurious features” that are highly predictive, that are not present in the test environment, the SB can result in poor generalisation performance. Recent work has shown that the SB is likely due to specific architectural choices made with in typical networks, such as the choice of activation function [3, 4]. Most existing investigations on the impact of the SB make unrealistic assumptions on the number or type of features present in the train and test domains, such as the ever common two feature assumption that is present in a large number of SB benchmarks. In this project we will investigate the effects of SB on specially design data sets such as hierarchical data sets, which are data sets designed to contain a large number of features with increasing predictability and complexity [1]. Another interesting choice could be algorithmic data sets. Where the input output pairs are linked by a set of increasing complex sequence to sequence algorithms. We will investigate the learning dynamics of two commonly used model architectures (Transformers and MLPs) over various data distribution scales and data regimes that let us provide new insights into how different features are learnt throughout training. References [1] Katherine L Hermann, Hossein Mobahi, Thomas Fel, and Michael C Mozer. On the foundations of shortcut learning. ICLR, 2024. [2] Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33:9573–9585, 2020. [3] Damien Teney, Liangze Jiang, Florin Gogianu, and Ehsan Abbasnejad. Do we always need the simplicity bias? looking for optimal inductive biases in the wild. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 79–90, 2025. [4] Damien Teney, Armand Mihai Nicolicioiu, Valentin Hartmann, and Ehsan Abbasnejad. Neural red-shift: Random networks are not random functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4786–4796, 2024.