SUPR
Validating lexicostatistic method using simulated data and analytical methods
Dnr:

NAISS 2024/22-143

Type:

NAISS Small Compute

Principal Investigator:

Philipp Rönchen

Affiliation:

Uppsala universitet

Start Date:

2024-03-01

End Date:

2025-03-01

Primary Classification:

60201: General Language Studies and Linguistics

Webpage:

Allocation

Abstract

This project aims to test the validity of the lexicostatistic methods used by Bouckaert et al (2012) and Chang et al (2015). Chang et al used Bayesian methods to estimate the age of Indo-European, giving a certain model of vocabulary evolution. Since they reached remarkably different conclusions than Bouckaert et al (2012), while using similar methods, more investigation in the validity of their inference methods is warranted. This will be done in two parts 1) A coarse-grained full-scale analysis of the behaviour of the methods of Chang et al (2015) and Bouckaert et al (2012) on simulated data sets. 2) A fine-grained analysis of the core part Chang et al (2015), Bouckaert et al (2012) and similar studies, that is of common cognate substitution models used, in which the effects of various parameter choices and their effects on inferences are investigated. Part 1) has mainly been carried out in previous SNICC/NAISS projects. The only thing that remains is an analysis of the tree topologies (as opposed to diversification times) of Indo-European that were produced by Bouckaert et al (2012) and Chang et al (2015) on simulated data sets. To do this, we will calculate the quartet distances of the posterior tree samples produced by these methods to the trees that we provided as input for the simulation method. For part 2), we will investigate the effect of different choices of birth and death rates on the inferences of the cognate evolution models used by Bouckaert et al (2012) and Chang et al (2015). Abstracting away from implementation details and considering only the core evolution models allows us to do a more fine-grained analysis that provides qualitative and generalizable insights about these models (which are also used in other contexts). We want to establish: A) What are the effects of different reasonable parameter choices on the inferences of the models? B) What is the effect of priors on the fitting of parameters in a Bayesian context? Do the priors used invalidate the inferences? C) What are the effects of so-called ascertainment corrections which in essence condition the inference models used on certain properties of the data D) How to the models investigated differ from other reasonable models of cognate evolution that are not parametrised by a birth and death rate? The questions will be investigated by a mix of simulation studies and combinatorial/analytical calculations. To our knowledge no fine-grained investigation of the effects of (birth and death rate) parameter choices and ascertainment correction on the inferences using cognate substitution models has been carried out. _____ Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., ... & Atkinson, Q. D. (2012). Mapping the origins and expansion of the Indo-European language family. Science, 337(6097), 957-960. Chang, W., Hall, D., Cathcart, C., & Garrett, A. (2015). Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language, 194-244.