# Allele Age Inference

# Introduction

During the period from 1997 to 2003 I coauthored 5 papers on the topic of inferring allele age using linked genetic markers. Three of these papers present new theory and methods for inferring allele ages [1,2,3] and two are review articles [4,5]. The age of an allele refers to the time in the past at which the mutation occurred that first created the allele. Obviously the notion of age is only meaningful when applied to a unique non-recurrent mutation. The concept of allele age goes back to the earliest days of population genetics and lots of theoretical work during the 1970s focused on the relationship between the population frequency of an allele and its age [4]. The methods used in the papers presented here use an additional source of information, the so-called ``intra-allelic variability.’’ The basic principle underlying these methods is that a novel allele arises by mutation on a particular chromosome at some time in the past and is initially associated with a set of alleles at nearby genetic markers. Thus, there is a characteristic haplotype associated with the allele. If the allele persists in the population then subsequent mutation and recombination events in the descendent chromosomes can modify the surrounding markers. These modifications at nearby sites (intra-allelic variants) then carry potential information about the age of the mutation.

The possibilities for using intra-allelic variation to estimate allele ages expanded explosively with the development of the PCR method for targetted DNA amplification during the 1990s. This development came shortly after the first successful attempts to map and sequence the specific gene mutations causing simple Mendelian diseases in humans by positional cloning during the late 1980s and early 1990s. One of the earliest applications of intra-allelic variation to estimate allele age was the paper by Serre at al (1990); they used the RFLP variation flanking a common mutation in the CFTR gene causing cystic fibrosis to estimate the age of the mutation based on a simple method of moments estimator. Their age inference method inspired the series of coauthored papers presented here.

If an allele is rare it can be approximated using a birth-death (BD) process model, similar to the way that R.A. Fisher approximated the dynamics of a rare allele using the closely-related Galton-Watson branching process. The practical importance of this insight is that an appropriately parameterized BD process can be used to obtain the probability distribution of gene trees and branch lengths for a sample of individuals carrying a particular allele, even when the allele is under selection or the population is undergoing exponential growth.
Given the probability distribution for the allelic genealogy (tree topology and branch lengths of the gene tree for the allele) it is possible to construct a parametric method for inferring allele age. If the allele arose by mutation at time *t* in the past the probability distribution of the number of copies *i* of the allele in a sample follows a geometric distribution; a maximum likelihood estimator is
derived in explicit form (equation 8 of [1]). The papers discussed below all employ the BD approximation for the intraallelic genealogy but make different assumptions about mutation and recombination
with the goal of estimating allele age using information from intraallelic variability.

# The papers

[1] Slatkin, M., and B. Rannala. 1997. Estimating the age of alleles by use of intraallelic variability. American Journal of Human Genetics 60: 447-458. Download

In this paper, a numerical method for obtaining a maximum likelihood estimate of allele age using both the
frequency and the intra-allelic variation is developed. Here intra-allelic variation is measured by the number of segregating sites assuming a infinite alleles model of mutation for completely linked markers (each mutation is unique and creates a new haplotype).
This model is at one extreme in terms of the assumptions: relatively high mutation rates producing novel alleles at linked variants (such as microsatellites) that are very close to the allele locus and do not undergo recombination. The model is
best suited for analyzing, for example, microsatellite markers that are closely linked to an allele locus. Under this
model the expected number of mutations follows a Poisson distribution with expectation and variance *u T* where *T* is the sum of the branch lengths on the genealogy and *u* is the overall rate of mutation for the region surrounding the allele locus. There is no analytical expression for the probability density of *T* and Monte Carlo simulations were therefore used to evaluate the likelihood and obtain a maximum likelihood estimate of the allele age.

[2] Rannala, B., and M. Slatkin. 1998. Likelihood analysis of disequilibrium gene mapping and related problems. American Journal of Human Genetics 62: 459-473. Download

The method developed in this paper uses the same BD approximation for the intra-allelic genealogy but considers a single linked locus with recurrent mutation between only two allele types and recombination on the interval between the allele locus and the marker. A maximum likelihood estimator of allele age is again developed using Monte Carlo simulations. This method is most appropriate for markers with few alleles (low mutation rates) such as single nucleotide polymorphisms (SNPs) but sufficiently far from the allele locus that multiple recombination events occur over the intraallelic genealogy.

[3] B. Rannala and J.P. Reeve. 2003. Joint Bayesian estimation of mutation location and age using linkage disequilibrium. Pacific Symposium on Biocomputing 8: 526-534. Download

The method developed in this paper extends the allele age inference method to allow multiple linked genetic markers surrounding an allele and undergoing recombination (but assuming no mutation). It is assumed that a linkage map of the distances between markers (in units of cM) is available but the actual position of the allele may be unknown. A Bayesian algorithm is developed that allows the joint posterior density of the allele position and allele age to be estimated. In other words, the program provides a fine-scale map of the position as well as the age of an allele (usually the allele is associated with a phenotype, often a simple Mendelian disease). Two examples are presented using sample data for 23 restriction fragment length polymorphism (RFLP) markers spanning a total distance of 1.8 Mb from europeans carrying the most common mutation causing cystic fibrosis (CF) and analyzing a sample from the Finnish population for the mutation causing the disease diastrophic dysplasia (DTD) for 2 RFLP and 3 microsatellite markers spanning a total distance of 20 kb.

# Review articles

[4] M. Slatkin and B. Rannala. 2000. Estimating allele age. Annual Review of Genomics and Human Genetics 1: 225-249. Download

[5] B. Rannala and G. Bertorelle. 2001. Using linked markers to infer the age of a mutation. Human Mutation 18: 87-100. [Download