Randomness in non-linear Machine Learning

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-888

Type:

NAISS Small Compute

Principal Investigator:

Arien Haghshenas

Affiliation:

Stockholms universitet

Start Date:

2025-06-11

End Date:

2026-07-01

Primary Classification:

50202: Business Administration

Webpage:

Allocation

Klemming at PDC: 500 GiB
Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month
Dardel at PDC: 10 x 1000 core-h/month

Abstract

This project investigates how seed-induced randomness in nonlinear machine learning models affects empirical asset pricing. While methods such as Random Forest, XGBoost, LightGBM, CatBoost, and neural networks have advanced return prediction and portfolio construction, they rely heavily on stochastic elements like bootstrap sampling, feature subsampling, and random weight initialization. Despite their widespread use, most studies and financial applications evaluate these models using only a single random seed—raising serious concerns about reproducibility and the reliability of reported performance metrics. To address this gap, the project systematically quantifies the variation in out-of-sample Sharpe ratios, CAPM alphas, and predictive R2 that arises solely from seed choice. Using a panel of U.S. equity returns with over 150 firm-level predictors—one of the largest and most widely studied datasets in asset pricing. I will train multiple models across 1000 distinct random seeds per algorithm. At each iteration, models generate one-month-ahead return forecasts and construct long-only top-quintile portfolios. This setup enables a distributional view of performance metrics rather than relying on single-point estimates. The scale of this analysis requires significant computational power. Each model involves deep hyperparameter tuning, high-dimensional data, and multi-year rolling forecasts—all repeated across thousands of seeds. Running this pipeline efficiently necessitates access to a high-performance computing cluster with parallel processing capabilities. The project’s goal is twofold: first, to empirically assess how randomness affects model outcomes in a high-stakes financial context; second, to develop methodological best practices for ensuring robustness and transparency in machine learning-based asset pricing. The findings will contribute to the literature on financial machine learning and provide actionable guidance for academics and practitioners using these tools in production.