Topic: Statistical inference methods for complex models and large datasets: methodological developments and applications in evolutionary population genomics
Dates: 15 September 2022 – 14 September 2026
CBGP supervisor: A. Estoup
University: University of Montpellier, Institut Montpelliérain Alexandre Grothendieck (IMAG), Doctoral School of Information, Structures and Systems (I2S)
The overall aim of this thesis is to develop, evaluate and apply inferential methods suited to complex stochastic models and high-dimensional datasets, with a particular focus on the challenges and questions specific to the field of evolution and population genomics.
The analysis of genetic polymorphism (both evolutionarily neutral and under natural selection) enables the estimation of past evolutionary parameters (demographic, historical or selective) of populations, such as population sizes or densities, dispersal parameters, divergence times or demographic changes, genomic signatures of natural selection, etc. These analyses rely on the combination of (1) stochastic models of population evolution, such as the Kingman coalescent (Kingman, 1982), and (2) statistical inference methods, the most powerful of which are based on likelihood estimation for the simplest evolutionary models (e.g. Rousset et al. 2018), or on the comparison of simulations with real datasets (through a set of summary statistics) for more complex models (ABC for Approximate Bayesian Computation; Marin et al. 2012). These methods of population genetics inference have evolved significantly over the past 10 years, particularly to adapt to the drastic change in the type and size of genetic/genomic datasets resulting from the rapid development of whole-genome sequencing techniques (NGS data, or Next-Generation Sequencing). Whilst the accelerated development of genetic markers derived from NGS technologies now provides biologists with massive amounts of data that enable them to explore, evaluate and compare hypotheses concerning the evolutionary history of populations with a precision that was unimaginable until recently.
These advances require the development of new statistical inference methods that can make the most of these enormous datasets and that are applicable to realistic and therefore complex evolving scenarios. To attempt to overcome these pitfalls and constraints, we (IMAG and CBGP) have co-developed a new statistical inference methodology called ABC Random Forest (Pudlo et al. 2016; Raynal et al. 2018; Collin et al. 2021), in which Random Forest algorithms — which fall within the field of artificial intelligence and, more specifically, supervised machine learning — are combined with ABC simulation algorithms.
Among the general issues that will be addressed in more detail in the thesis, we can highlight three:
‘All models are wrong, but some are useful’ (George Box). How can we measure the goodness-of-fit when we have a very large dataset that is effectively capable of rejecting all the models explored? In such situations, what information should be prioritised? Should we, for example, focus solely on the goodness-of-fit of certain aspects of the models in which the experimenter has a particular interest?