One of the principal challenges when estimating the effects of interventions from
observational data (e.g. historical data) is controlling for confounding factors that may
bias the results. The causal inference literature is rich with methods to remove
confounding bias in settings where they have been identified and measured. However,
these settings are typically rare in the social sciences. Information about confounders
tends to be captured in less precise data formats such as text (if they are captured at
all). The goal of the project - and of my PhD - is to develop and apply causal inference
methods that can make use of text data for more precise and robust estimates in the
humanities and social sciences. We hope that developing better causal estimation
methods will enable researchers to answer more questions about the society we live in
and in turn result in more informed policy and institutional decisions. These methods
will make use of recent techniques developed in the field of NLP to automatically
capture information from text.
The starting point for my research is a paper by Richard Johansson and Adel Daoud
titled “Conceptualizing Treatment Leakage in Text-based Causal Inference". In this
paper, they lay out the problem of treatment leakage: if the text used for controlling for
confounders is predictive of both treatment assignment and the outcome, then using
text could introduce bias into the system and harm the quality of the causal estimate.
To counter this, they show (in a simplified experiment) that removing the influence of
treatment assignment from the text removes the bias.
The first goal of my research is to replicate and extend these experiments to continue
to study how using text data might introduce bias into the causal estimate. Another
research direction is to attempt to apply these methods to investigate the effect of
International Monetary Fund (IMF) programs on the economy of countries. The IMF
produces yearly country reports for several countries around the world. The question
is: can we use the reports alongside existing tabular data to improve causal estimates
about the efficacy of the IMF programs? More concretely, the project involves
gathering and preprocessing all the relevant PDFs from the IMF archives, testing the
predictive power of the country reports to see if they contain relevant information and
then designing and applying methods to the data to see if they improve prediction.