## Tuesday, November 29, 2016

### Site-specific selection and phylogenetic accuracy

In my , I argued that classical amino-acid replacement matrices do not faithfully describe the long-term evolutionary process at typical positions of protein-coding sequences. More specifically, those matrices do not correctly capture the long-run restrictions induced by puriyfing selection on the spectrum of amino-acids that are accepted at a typical site -- they anticipate too many amino-acids in the long run. In the words of Claus Wilke, empirical amino-acid replacement matrices are derived by pooling sites, but in the end, no specific, individual site behaves like a ‘pooled’ site.

Now, this has very important consequences on phylogenetic accuracy. To illustrate this point, here is a case presented by Andrew Roger in his phyloseminar, concerning the position of Microsporidia in the phylogeny of eukaryotes. This is based on a phylogenomic dataset of Brinkmann et al (2005), with 133 genes ($\sim$ 24000 aligned positions). I have reanalyzed this dataset, using two models, one that uses the same amino-acid replacement matrix for all sites (LG), and another one that models site-specific amino-acid equilibrium frequencies (CAT-GTR).

The tree reconstructed by the LG model gives Microsporidia sister group to all other eukaryotes:

In contrast, the tree obtained once you account for site-specific amino-acid preference in the long run (using CAT-GTR) shows Microsporidia sister-group to Fungi:

We used to believe that Microsporidia were ‘early-emerging’ eukaryotes, as suggested here by LG. The fact that they lack mitochondria was interpreted as a primitive trait: supposedly, the split between Microsporidia and the rest of eukaryotes would have pre-dated the endosymbiotic event leading to the acquisition of the mitochondrion by the so-called ‘crown’ eukaryotes. However, there are now quite a few lines of evidence suggesting that Microsporidia are in fact closely related to Fungi, as suggested here by CAT-GTR; also, their lack of mitochondria is probably due to a secondary loss (or, more accurately, a secondary simplification, since Microsporidia have hydrogenosome-like organelles that might well be the vestige of their mitochondria).

In fact, the tree obtained under the LG model does not just show Microsporida sister to all other eukaryotes. It also gives a somewhat unusual phylogeny for those other eukaryotes, with Fungi branching first. However, if we ignore the rooting, this tree differs from the tree obtained by CAT-GTR by just moving Archaea from their branching point between unikonts and bikonts to a position where they are sister group to Microsporidia. Thus, here, it is not so much that Microsporidia ‘move down’ in the tree because they are attracted by the archaean outgroup. Instead, it is the outgroup itself which ‘goes up’ in the tree, being attracted by Microsporidia. In any case, whichever way you decide to see it, this is basically a long branch attracted by another long branch -- thus, a paradigmatic case of a long-branch attraction artifact.

So, we have a clear example here, where accounting for site-specific amino-acid preference appears to result in a greater robustness against tree reconstruction errors. See also Lartillot et al (2007) for other examples, concerning the phylogenetic position of nematodes and platyhelminths.

Why is it that models accounting for site-specific amino-acid restrictions are more accurate? The main reason is relatively simple and can be summarized as follows.

Convergent substitutions (or, equivalently, reverting substitutions) represent the primary source of tree reconstruction errors. Therefore, in order to correctly discriminate between shared-derived characters (signal) and convergent or reverting substitutions (noise), it is particularly important to correctly model all of the factors that could cause a high rate of convergent substitution in real data.

Now, the probability of convergent evolution toward the same amino-acid at the same site, in two unrelated species separated by a large evolutionary distance, is roughly the inverse of the number of possible amino-acids at that site. Thus, a small number of acceptable amino-acids per site mechanically implies a high probability of convergent evolution.

And thus, one can easily imagine that a model not correctly accounting for site-specific amino-acid restrictions, because it essentially assumes that all amino-acids are accepted at all sites in the long run, will automatically underestimate the probability of convergent evolution. Therefore, by mistaking noise for signal, it will tend to produce artifacts.

In contrast, a model explicitly accounting for site-specific amino-acid restrictions is in a better position to correctly estimate the probability of convergent substitution, and as a result, is better calibrated in terms of the signal/noise ratio. This is what make such site-specific models more robust against tree reconstruction artifacts.

All of what I have seen thus far suggests that this effect due to site-specific amino-acid restrictions is quite strong and that it can have a significant impact on phylogenetic accuracy over deep evolutionary times. In a sense, this is not so surprising: the number of amino-acids typically accepted at a given site is on average of the order of 4 or 5, out of 20 amino-acids -- which means that there is room here for an underestimation of the probability of convergent evolution (and thus, an underestimation of the prevalence of noise) by a factor of 4. I will come back to this point in later posts, in order to make it more precise and more quantitative.

In any case, this connection between site-specific amino-acid restrictions and the rate of convergent evolution, combined with what we have seen earlier concerning the problems inherent to the idea of imposing a single amino-acid replacement matrix to a heterogeneous collection of amino-acid positions, each with its own selective requirements, captures one of the most fundamental limitations of the simple amino-acid replacement models that are still currently used in phylogenomics.

--

Henner Brinkmann, Mark van der Giezen, Yan Zhou, Gaëtan Poncelin de Raucourt, and Hervé Philippe (2006). An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics., 54(5), 743–757. http://doi.org/10.1080/10635150500234609

Lartillot, N., Brinkmann, H., & Philippe, H. (2007). Suppression of long-branch attraction artifacts in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary Biology, 7 Suppl 1, S4. http://doi.org/10.1186/1471-2148-7-S1-S4

## Friday, November 25, 2016

### Double standard (2)

Just a side remark: in my last post, I pointed out that classical amino-acid replacement matrices implicitly encode site-specific selection in the first-order Markov dependencies of the process, and that doing so is optimal in situations of low sequence divergence (more specifically, when the fractions of sites that have more than one substitution event over the whole tree is negligible).

But then, in this situation, there is also no information that can be gained about rate variation across sites. In this situation, sites have either 0 or 1 substitution event, and the only meaningful statistic that we have is the fraction of sites with one substitution — which gives us an estimate of the mean rate of substitution across sites, but otherwise, does not tell us anything about the variance (the shape parameter alpha).

An analogy can be made here with an imaginary coin-tossing experiment, in which you would make each toss with a different coin, all of which have potentially different biases (different probabilities of giving a head). Thus, there is an unknown distribution, over coins, of the probability of getting head.

In this situation, with only one draw per coin, your fraction of heads is equal to the average probability of getting head over all coins. However, this is the only information that you can get from this experiment. You need to make at least 2 tosses per coin in order to have information about the variance of the head probability over coins. An increasing number of tosses per coin give you information about an increasing number of the moments of the distribution.

In any case, coming back to sequence evolution, the main point I wanted to make is this: empirical amino-acid replacement matrices are the perfect solution in those situations where you also cannot meaningfully estimate the variance in rates across sites. Conversely, as soon as you have sufficient sequence divergence, giving you empirical information about rate-heterogeneity, then you also have useful information concerning pattern-heterogeneity — which will not be captured by a single amino-acid replacement matrix.

On a more general note, making the parallel between rate- and pattern-heterogeneity is, I think, insightful.

Also, to address one of the comments: of course, there are computational issues behind our potential methodological inconsistencies. But this does not prevent us from pointing them out and raising a little sense of discomfort about what we easily take for granted in our daily routine. That's the only way we can make any progress.

## Wednesday, November 23, 2016

### Long-term site-specific selective constraints

There is a catch in my argument about the double standards of some of the current phylogenetic models, concerning rate- versus pattern-heterogeneity across sites. It is unfair to suggest that a model does not capture the consequences of varying selection along the sequence just because it uses the same amino-acid replacement matrix across all sites. In fact, empirical amino-acid replacement matrices do capture the effects of site-specific selection, but it is just that they do it implicitly and globally -- simply by proposing higher exchange rates among biochemically similar amino-acids.

If we look for instance at the WAG matrix:

we see that this matrix gives particularly high exchange rates between I and V and E and D. The way those exchange rates capture site-specific selection is implicit, the logic being as follows: given that a site is currently is state D, then, probably, this site is under (site-specific) selection for a negatively charged amino-acid, and therefore, we expect that the next substitution at that site will be an E with high probability.

So, classical amino-acid replacement matrices do model site-specific selection. Technically, they encode it in the first-order Markov dependencies of the amino-acid replacement process. And in fact, this approach is optimal in non-saturated situations, where each site of the alignment typically has 0 or 1 substitution event over the whole tree, but rarely more than that. In this situation, there is no hope to get more information than what is already captured in this first-order Markov structure.

However, things are quite different in a deep evolutionary context, where each site makes potentially many substitution events. Here is an example (taken from an empirical alignment at the scale of metazoans):

We see that this site is apparently under strong selective constraint to accept negatively charged amino-acids, but otherwise, can make repeated substitutions between D and E (and sometimes visit other amino-acids). In other words, this site (and many other sites) tends to be confined, in the long-run, within a very restricted subset of all possible amino-acids.

Are such site patterns easily predicted (and reproduced) by classical empirical matrices? To check this, we could for instance simulate repeated substitution histories for a given site that would start on an aspartic acid (D), and then tabulate the frequencies at which the 20 possible amino-acids are observed after 1, 2, ... k substitutions. Here is what you get with the WAG model for a site starting on an aspartic acid (D) or on a lysine (K), and over the first 5 substitution events:

As you can see, the substitution process implied by the WAG matrix does not maintain itself for a very long time within the restricted orbit of the two negatively charged amino-acids D and E (same thing for the positively charged amino-acids K and R). Instead, a typical site very quickly looses memory of the selective constraint it is supposed to stay under in the long run, such that after as few as 4 substitution events, it ends up sampling amino-acids according to proteome-wide average frequencies.

Why doesn't that work? There are at least two reasons. The first is that, as you can check from the matrix of relative exchange rates shown above, there is a high exchange rate, not just between D and E, but also between D and N. Then, from N, there is a high rate to S, etc. So, basically, what happens is that, in reality, you have some sites that make repeated substitutions between D and E, other sites between D and N, and still many other sites visiting other different but overlapping amino-acid orbits. In terms of selection pressure, this betrays the fact that the same amino-acid can be selected at different sites for different reasons: D and E are both negatively charged, but on the other hand, D and N have similar geometries for their side chain.

In this situation, when you estimate an average amino-acid replacement matrix simultaneously over all of those sites, you end up with significant exchange rates between D and a broad set of alternative amino-acids: E, N, but also G and H -- which then makes it very difficult for this matrix to explain the long-term confinement that can be seen at typical sites under long-term purifying selection.

The second reason is more conceptual and connected to the following question: what should be encoded at the level of the transient regime of the process, versus what should be captured by its stationary regime?

The site shown above suggests that the selective constraint experienced by a typical site under strong purifying selection is stable in the long run. This may not always be true, but this is generally the case. This might in fact be even more pronounced in the context of phylogenetic analyses, where there is a selection bias in favor of proteins that have a highly conserved structure (this is what makes the alignment possible at such a large evolutionary scale in the first place). And thus, sites of those proteins tend to be in a conserved biochemical environment over very long evolutionary times.

If selection is stable in the long run, then it would perhaps make more sense to encode it in the stationary regime of the substitution process at a given site. Instead, the relative exchange rates of an amino-acid replacement matrix encode the transient regime of the substitution process, whereas the stationary regime is entirely determined by the equilibrium frequencies, which are basically equal to the observed amino-acid frequencies over the entire alignment. And what the figure above indicates is that the transient regime encoded by a typical amino-acid replacement matrix is, indeed, very transient.

This argument suggests that site-specific amino-acid preferences should perhaps be encoded at the level of the equilibrium frequency vector over the 20 amino-acids. But then this means that the equiibrium frequencies of the substitution process should be site-speciifc. This idea was first proposed, and explored, by Halpern and Bruno, back in 1998.

## Tuesday, November 22, 2016

### Double standard

Nobody today would publish a phylogeny that has been inferred using a model that does not account for rate variation across sites, at least not a phylogeny over a large evolutionary scale (e.g. animals, eukaryotes, etc).

And indeed, using such rates-across-sites models, we usually do uncover substantial rate variation in empirical sequence alignments: the shape parameter of the gamma distribution (alpha) is typically less than 1, of the order of 0.7. Below is a plot of the density of the gamma for alpha=0.7. The 0.25 and 0.75 quantiles are 0.18 and 1.38 respectively. Thus, the 25% fastest evolving sites are evolving about 10 times faster than the 25% slowest sites.

Also, it is clear that much of this variation occurs within genes: in about any gene, you expect to find both slow and fast sites (you just need to fit the gamma model on each gene separately and observe that the estimated value for the alpha parameter is, again, typically less than 1).

Now, why is there so much variation across sites in the overall rate of substitution? Mostly because of selection. There is some contribution from varying mutation rates. However, this is probably minor. For the most part, what explains rate variation across sites is simply that some sites are highly conserved (strong purifying selection), other sites are less conserved, and a minor fraction is under positive selection.

But then, if the selective pressure is so different at different sites, in terms of its overall strength, don’t we expect that it should also be different in terms of which types amino-acids are preferred at each site?

What I am suggesting here is that using a model that allows for rate variation across sites, but imposes the same amino-acid replacement matrix across all sites of a given gene — or, worse, across an entire multi-gene alignment — amounts to applying a double standard with respect to the consequences of varying selection along the sequence.

And yet this is what many current phylogenetic reconstruction models still do.

## Monday, November 21, 2016

### Ctenophores and the role of models in phylogenetics

I gave a phyloseminar last week, on the problem of dealing with patter-heterogeneity across sites in phylogenetic inference. I find this idea of phyloseminars really neat. Many thanks (and kudos) to Erick Matsen, for organizing all this. There is now quite a rich collection of phyloseminars given by many people and on many subjects in evolutionary sciences, which represents a such a valuable resource on the web. Here is the link:

To put things in context, I have been caught in the middle of a controversy about the phylogeny of animals — more specifically, but not only, concerning the phylogenetic position of ctenophores (comb jellies).  As you may know, ctenophores have been recently proposed as being the sister-group to all other animals, including sponges (Moroz et al, 2014, Ryan et al, 2014). This ctenophore-sister hypothesis is not supported, however, when you use more sophisticated models accounting for variation across sites in the amino-acid replacement process (the CAT and CAT-GTR models, implemented in PhyloBayes). Instead, these two models give sponges sister to all other animals, as is more traditionally believed (although never with really strong support, and in a way that depends on the exact choice for the outgroup). A detailed analysis is given in Pisani et al (2015).

My interpretation of all this (which is also the interpretation suggested in Pisani et al, 2015) is that ctenophore-sister is fundamentally a long-branch attraction artifact caused by the use of inadequate models of sequence evolution. But then, not everyone agree with this interpretation (e.g. Halanych et al, 2016). And progressively, this controversy about ctenophores is becoming a question about which models — and which software programs —  should be used in the context of large-scale phylogenetic reconstructions. The phyloseminar I gave last week is in fact the last of a series of three (the other two were given by Andrew Roger and Nathan Whelan), series which can be seen as organizing a debate about the role of site-heterogeneity in phylogenetic inference.

In any case, all this raises very important questions about the current situation in the field of phylogenetic studies. Are current models and methods adequate, in particular, for reconstrcuting ‘deep’ phylogenies (over deep evolutionary times)? Where do we stand today, and what can we hope, in terms of methodological progress, for the next years on this front?

In the next posts I will publish on this blog, I will come back to some of those questions. So, bear with me!..

References

Halanych, K. M., Whelan, N. V., Kocot, K. M., Kohn, A. B., & Moroz, L. L. (2016). Miscues misplace sponges. Proceedings of the National Academy of Sciences, 113(8), E946–7. http://doi.org/10.1073/pnas.1525332113

Moroz, L. L., Kocot, K. M., Citarella, M. R., Dosung, S., Norekian, T. P., Povolotskaya, I. S., et al. (2014). The ctenophore genome and the evolutionary origins of neural systems. Nature, 510(7503), 109–114. http://doi.org/10.1038/nature13400

Pisani, D., Pett, W., Dohrmann, M., Feuda, R., Rota-Stabelli, O., Philippe, H., et al. (2015). Genomic data do not support comb jellies as the sister group to all other animals. Proceedings of the National Academy of Sciences, 112(50), 15402–15407. http://doi.org/10.1073/pnas.1518127112

Ryan, J. F., Pang, K., Schnitzler, C. E., Nguyen, A.-D., Moreland, R. T., Simmons, D. K., et al. (2013). The genome of the ctenophore Mnemiopsis leidyi and its implications for cell type evolution. Science, 342(6164), 1242592. http://doi.org/10.1126/science.1242592

### Resurrection

This blog has been silent for such a long time (two years…). But there are quite a few things I would like to share with you, concerning various current developments in phylogenetics and in Bayesian inference. So, let start it up again.