Are molecular taxonomists lost upstream?

DNA-based species delimitation is a relatively new approach for evaluating whether populations of organisms are isolated from one another on an “evolutionary timescale” thus supporting the hypothesis that they may be distinct species. Most current species delimitation methods require apriori assignments of individuals to populations, as well as a “guide tree” specifying the phylogenetic relationships among the populations. The delimitation methods then determine the level of support for different nodes (splits) as defining distinct species rather than simply populations which have some admixture, or a relatively short history of complete isolation.

The current process of species delimitation has 3 steps: (1) define the populations; (2) infer the phylogenetic relationships (guide tree) among populations; (3) evaluate the statistical support for species delimitations. The division of the delimitation process into these 3 steps is done for pragmatic reasons: it is simply not computationally feasible to jointly infer population assignment, phylogeny and species delimitation using currently available approaches. One exception is Brian O’Meara’s method which attempts a joint assignment and delimitation – however his approach relies on several heuristics which may detract from its statistical performance. In the 3-step delimitation scheme, steps 1 and 2 are “upstream” of the actual delimitation step and an important question addressed in a paper by Olave et al. that appeared this month in Systematic Biology (see the paper here) is how errors at step 1 or 2 may affect the inferences about species obtained at step 3.

The study of Olave et al. used computer simulations to examine the effect of “upstream” analysis methods on the accuracy of DNA-based species delimitation. The study specifically focused on the performance of one particular species delimitation method, BPP (Bayesian Phylogeography and Phylogenetics), recently developed by Ziheng Yang and myself. The authors point out that previous simulation studies, which generally demonstrated good statistical performance of the BPP method, all assumed that individuals were correctly assigned to populations and that the guide tree topology was correct. These earlier studies thus do not address the effects of upstream error. The effects of a grossly mis-specified guide tree on delimitations obtained using BPP were studied by Leache and Fujita and found to increase the frequency of false species delimitations, or “over-splitting” (see their paper here). However, the extreme forms of incorrect topologies that they used as guide trees are unlikely to be obtained using any reasonable phylogenetic inference algorithm and the practical importance of guide tree mis-specification (when using popular phylogenetic inference methods to obtain the guide tree) thus remains to be determined.

The Olave et al. study is an important first step in exploring the sensitivity of species delimitation methods to errors in the upstream analysis. However, the study has some problems which make the results difficult to interpret. First, the summary of the simulation results is misleading – particularly with regard to the interpretation of delimitation errors. My perspective on species delimitation is that over-splitting is a much more serious problem than under-splitting. It seems reasonable that if the data are uninformative, for example, a well-behaved statistical method should tend to lump species together rather than split them. Thus, failing to delimit species should be interpreted to mean that EITHER the power is low OR there is a single species. On the other hand, splitting should be interpreted as meaning there is strong evidence for distinct species. Olave et al. combine both forms of delimitation errors (over-splitting and under-splitting) together in Figure 2 which makes the results of their study difficult to interpret.

A second problem of the study is the simulation design. Olave et al. essentially pool all the simulated sequences together as if there were no prior information about the population affiliations of individuals. They then use the Structurama program to cluster the individuals into populations based on their multi-locus genotypes. They do this under a model where there are in reality 8 species. The biological interpretation of this design is therefore that 8 sympatric cryptic species exist in a sample with nothing to distinguish them a priori – this is arguably a much more extreme situation than most datasets to which our method has been applied. Most often, BPP has been applied to datasets where distinct geographical subtypes exist and thus the individuals are assigned to populations a priori based on their geographical locations. The residual uncertainty is not about population affiliation but whether the populations constitute distinct species.

Using an assignment program like Structurama to assign individuals to groups a priori based on the relatively small number of loci used in a typical species delimitation study is obviously problematic. In the simulation analysis of Olave et al. the number of loci was between 4 and 14. Assignments are bound to be error prone with such small numbers of loci, as the authors noted. However, there is information about this uncertainty that the authors ignore. Structurama provides posterior probabilities for assignments and with limited data these assignment probabilities will typically not be large. A cautious researcher would not use species assignments obtained from Structurama that are poorly supported. The correct approach, in this case, would be to use a threshold such as 0.95 for the posterior probability of an individual population assignment and if an individual’s assignment probability falls below this threshold the individual should not be assigned to a population. The individual could be excluded from the delimitation analysis or, more rigorously, one could apply the delimitation method to multiple assignments and weight the delimitations by the assignment probabilities. This technically appealing strategy is probably too computationally demanding, however, to be of any practical interest. Another solution, possibly less appealing to experimentalists with limited funds, would be to increase the number of loci until the pre-specified assignment probability threshold is reached for most sampled individuals.

Unfortunately, the complexity of assigning individuals to populations is glossed over in the Olave et al. study. This is particularly troublesome because it is clear that the bulk of the delimitation errors (as they note) are due to upstream assignment errors. The importance of individual population assignments for delimitation studies cannot be overstated. If individuals of the same species are mis-assigned to different populations it is impossible for any delimitation method to achieve a correct delimitation. The outstanding issue here is how often this degree of mis-assignment occurs in practice (for realistic scenarios) when a rational approach is used to evaluate the quality of individual assignments (such as a threshold based on the posterior probability of the assignment).

Ultimately, the solution to the “problems” of “upstream analyses” in species delimitations is to move the species delimitation process upstream by jointly inferring the species tree, assignments and delimitation. One intriguing result from the Olave et al. study was the finding that assigning each individual to a distinct population produced very accurate delimitations – this approach essentially lets the BPP program consider all assignments of individuals to populations and effectively both “assign”and “delimit” individuals and species. Undoubtedly, effective algorithms will be developed for making joint assignments and delimitations in the near future. However, my prediction is that the problems introduced by upstream analyses in our current situation have been exaggerated in the Olave et al. study. Once a joint delimitation method is available, a simple re-analysis of published studies will resolve the question of whether upstream errors are indeed an important source of errors of delimitation.

Enjoy Reading This Article?