This project applies machine learning to a large-scale discrete choice prediction problem in demography: predicting which partner an individual actually chose from among a large set of realistic alternatives. The goal is to quantify how predictable partner choice is, a question that has remained open because prior research has relied on explanatory regression models rather than predictive evaluation.
The core computational task is as follows. For each individual in the data, we construct a choice set containing the observed partner and a large number of alternative partners sampled from the same local market (defined by geography and time). Each choice set thus represents a realistic decision environment. The prediction task is to identify the true partner among these alternatives, given observable characteristics of both the individual and each candidate. This is a standard discrete choice setup, but at a scale and flexibility that requires machine learning.
We use conditional logit forests, a tree-based ensemble method designed for ranked discrete choice data. Unlike standard conditional logit models, these forests can capture high-dimensional interactions and nonlinearities (for example, complex trade-offs between a partner's education, age, and ethnicity) without requiring the researcher to specify them in advance. This flexibility is what makes the method suitable for the question: if partner choice follows patterns that are richer than what linear models can express, a flexible model should achieve higher predictive accuracy.
Computational demand arises from the combination of large choice sets, high-dimensional feature spaces, and the need for many repeated model fits. Each individual generates one choice set with dozens to hundreds of alternatives. Feature engineering expands each alternative into a wide set of pairwise and interaction terms. The conditional logit forest then fits many trees, each involving repeated conditional logit estimation within subsets of the data. On top of this, the research design requires systematic variation across hyperparameters, random seeds, train/test splits, and model specifications, all of which are independent runs well suited to parallel execution on a cluster.
The pilot uses US census microdata (IPUMS) with approximately 800,000 focal individuals (in unions) and 50 alternatives each, producing roughly 40 million row-level observations before feature expansion. Subsequent stages will increase choice set sizes, add richer feature sets, and eventually move to Scandinavian population register data. Even at the pilot stage, the scale makes local computing infeasible: runtime grows multiplicatively with sample size, choice set size, number of features, and number of model specifications, and the workload is embarrassingly parallel by design.
The project benchmarks the machine learning approach against standard conditional logit models commonly used in assortative mating research. The comparison provides both a substantive answer (how much predictive power is gained from flexible methods) and a methodological contribution to computational social science.