Welcome to the home page of Bruce Rannala. I develop statistical methods, computer software, and sometimes new theory. I also teach a course on human genetic variation EVE 131 each Fall. My research focuses on statistical genetics, population genetics, and phylogenetics. However, our methods are often applied to real world problems such as pandemics (HIV, COVID19), cancer genetics, conservation biology and disease gene mapping.
Examples of Current Projects
Bayesian inference of ages of latent lineages of HIV using sequence data (PhD student Anna Nagel)
Bayesian inference of hybrid and backcross individuals using genomic sequences (Postdoc Sneha Chakraborty)
New models aimed at early identification of SARS-CoV2 variants of concern using phylogenetic information (Postdoc Mike May)
Epidemiology has been transformed by the advent of Bayesian phylodynamic models that allow researchers to infer the geographic history of pathogen dispersal over a set of discrete geographic areas [1, 2]. These models provide powerful tools for understanding the spatial dynamics of disease outbreaks, but contain many parameters that are inferred from minimal geographic information (i.e., the single area in which each pathogen was sampled). Consequently, inferences under these models are inherently sensitive to our prior assumptions about the model parameters. Here, we demonstrate that the default priors used in empirical phylodynamic studies make strong and biologically unrealistic assumptions about the underlying geographic process. We provide empirical evidence that these unrealistic priors strongly (and adversely) impact commonly reported aspects of epidemiological studies, including: 1) the relative rates of dispersal between areas; 2) the importance of dispersal routes for the spread of pathogens among areas; 3) the number of dispersal events between areas, and; 4) the ancestral area in which a given outbreak originated. We offer strategies to avoid these problems, and develop tools to help researchers specify more biologically reasonable prior models that will realize the full potential of phylodynamic methods to elucidate pathogen biology and, ultimately, inform surveillance and monitoring policies to mitigate the impacts of disease outbreaks.
@article{gao2023model,title={Model misspecification misleads inference of the spatial dynamics of disease outbreaks},author={Gao, Jiansi and May, Michael R and Rannala, Bruce and Moore, Brian R},journal={Proceedings of the National Academy of Sciences},volume={120},number={11},pages={e2213913120},year={2023},doi={10.1073/pnas.2213913120},publisher={National Academy of Sciences},}
Bioinformatics
PrioriTree: a utility for improving phylodynamic analyses in BEAST
Jiansi Gao, Michael R May, Bruce Rannala, and Brian R Moore
Phylodynamic methods are central to studies of the geographic and demographic history of disease outbreaks. Inference under discrete-geographic phylodynamic models—which involve many parameters that must be inferred from minimal information—is inherently sensitive to our prior beliefs about the model parameters. We present an interactive utility, PrioriTree, to help researchers identify and accommodate prior sensitivity in discrete-geographic inferences. Specifically, PrioriTree provides a suite of functions to generate input files for—and summarize output from—BEAST analyses for performing robust Bayesian inference, data-cloning analyses and assessing the relative and absolute fit of candidate discrete-geographic (prior) models to empirical datasets.
@article{gao2023prioritree,title={PrioriTree: a utility for improving phylodynamic analyses in BEAST},author={Gao, Jiansi and May, Michael R and Rannala, Bruce and Moore, Brian R},journal={Bioinformatics},volume={39},number={1},pages={btac849},year={2023},url={https://academic.oup.com/bioinformatics},doi={10.1093/bioinformatics/btac849},publisher={Oxford University Press},}
Genetics
An efficient exact algorithm for identifying hybrids using population genomic sequences
The identification of individuals that have a recent hybrid ancestry (between populations or species) has been a goal of naturalists for centuries. Since the 1960s, codominant genetic markers have been used with statistical and computational methods to identify F1 hybrids and backcrosses. Existing hybrid inference methods assume that alleles at different loci undergo independent assortment (are unlinked or in population linkage equilibrium). Genomic datasets include thousands of markers that are located on the same chromosome and are in population linkage disequilibrium which violate this assumption. Existing methods may therefore be viewed as composite likelihoods when applied to genomic datasets and their performance in identifying hybrid ancestry (which is a model-choice problem) is unknown. Here, we develop a new program Mongrail that implements a full-likelihood Bayesian hybrid inference method that explicitly models linkage and recombination, generating the posterior probability of different F1 or F2 hybrid, or backcross, genealogical classes. We use simulations to compare the statistical performance of Mongrail with that of an existing composite likelihood method (NewHybrids) and apply the method to analyze genome sequence data for hybridizing species of barred and spotted owls.
@article{chakraborty2023efficient,title={An efficient exact algorithm for identifying hybrids using population genomic sequences},doi={10.1093/genetics/iyad011},author={Chakraborty, Sneha and Rannala, Bruce},journal={Genetics},pages={iyad011},year={2023},publisher={Oxford University Press},}
SystBiol
Estimation of species divergence times in presence of cross-species gene flow
George P Tiley, Tomás Flouri, Xiyun Jiao, Jelmer W Poelstra, Bo Xu, and
4 more authors
Cross-species introgression can have significant impacts on phylogenomic reconstruction of species divergence events. Here, we used simulations to show how the presence of even a small amount of introgression can bias divergence time estimates when gene flow is ignored in the analysis. Using advances in analytical methods under the multispecies coalescent (MSC) model, we demonstrate that by accounting for incomplete lineage sorting and introgression using large phylogenomic data sets this problem can be avoided. The multispecies-coalescent-with-introgression (MSci) model is capable of accurately estimating both divergence times and ancestral effective population sizes, even when only a single diploid individual per species is sampled. We characterize some general expectations for biases in divergence time estimation under three different scenarios: 1) introgression between sister species, 2) introgression between non-sister species, and 3) introgression from an unsampled (i.e., ghost) outgroup lineage. We also conducted simulations under the isolation-with-migration (IM) model, and found that the MSci model assuming episodic gene flow was able to accurately estimate species divergence times despite high levels of continuous gene flow. We estimated divergence times under the MSC and MSci models from two published empirical datasets with previous evidence of introgression, one of 372 target-enrichment loci from baobabs (Adansonia), and another of 1,000 transcriptome loci from fourteen species of the tomato relative, Jaltomata. The empirical analyses not only confirm our findings from simulations, demonstrating that the MSci model can reliably estimate divergence times, but also show that divergence time estimation under the MSC can be robust to the presence of small amounts of introgression in empirical datasets with extensive taxon sampling.
@article{tiley2023,author={Tiley, George P and Flouri, Tomás and Jiao, Xiyun and Poelstra, Jelmer W and Xu, Bo and Zhu, Tianqi and Rannala, Bruce and Yoder, Anne D and Yang, Ziheng},title={Estimation of species divergence times in presence of cross-species gene flow},journal={Systematic Biology},year={2023},month=mar,issn={1063-5157},doi={10.1093/sysbio/syad015},}
RSocInterface
Bayesian phylogenetic inference of HIV latent lineage ages using serial sequences
HIV evolves rapidly within individuals, allowing phylogenetic studies to infer histories of viral lineages on short time scales. Latent HIV sequences are an exception to this rapid evolution, as their transcriptional inactivity leads to negligible mutation rates compared with non-latent HIV lineages. This difference in mutation rates generates potential information about the times at which sequences entered the latent reservoir, providing insight into the dynamics of the latent reservoir. A Bayesian phylogenetic method is developed to infer integration times of latent HIV sequences. The method uses informative priors to incorporate biologically sensible bounds on inferences (such as requiring sequences to become latent before being sampled) that many existing methods lack. A new simulation method is also developed, based on widely used epidemiological models of within-host viral dynamics, and applied to evaluate the new method—showing that point estimates and credible intervals are often more accurate than existing methods. Accurate estimates of latent integration dates are crucial in relating integration times to key events during HIV infection, such as treatment initiation. The method is applied to publicly available sequence data from four HIV patients, providing new insights regarding the temporal pattern of latent integration.
@article{Nagel2023,author={Nagel, Anna A and Rannala, Bruce},title={Bayesian phylogenetic inference of HIV latent lineage ages using serial sequences},journal={J. R. Soc. Interface},year={2023},doi={10.1098/rsif.2023.0022},}