One of the principal challenges when estimating the effects of interventions from
observational data (e.g. historical data) is controlling for confounding factors that may
bias the results. The causal inference literature is rich with methods to remove
confounding bias in settings where they have been identified and measured. However,
these settings are typically rare in the social sciences. Information about confounders
tends to be captured in less precise data formats such as text (if they are captured at
all). The goal of the project - and of my PhD - is to develop and apply causal inference
methods that can make use of text data for more precise and robust estimates in the
humanities and social sciences. We hope that developing better causal estimation
methods will enable researchers to answer more questions about the society we live in
and in turn result in more informed policy and institutional decisions. These methods
will make use of recent techniques developed in the field of NLP to automatically
capture information from text.
This project would contribute towards my main research focus at the moment, which revolves around testing the effectiveness of the Design-based Supervised Learning (DSL) framework for debiasing statistics based on annotations from Large Language Models. Depending on the results of the experiments, I also want to test how effective DSL can be at producing unbiased estimates for the Average Treatment Effect when using text documents as proxies for an unobserved confounder.
Additionally, together with some master students I will work on a continuation of my previous project "Can Large Language Models (or Humans) Disentangle Text?" where we seek to develop methods to use LLMs to remove information from text in an interpretable way.