## Friday, November 25, 2016

### Double standard (2)

Just a side remark: in my last post, I pointed out that classical amino-acid replacement matrices implicitly encode site-specific selection in the first-order Markov dependencies of the process, and that doing so is optimal in situations of low sequence divergence (more specifically, when the fractions of sites that have more than one substitution event over the whole tree is negligible).

But then, in this situation, there is also no information that can be gained about rate variation across sites. In this situation, sites have either 0 or 1 substitution event, and the only meaningful statistic that we have is the fraction of sites with one substitution — which gives us an estimate of the mean rate of substitution across sites, but otherwise, does not tell us anything about the variance (the shape parameter alpha).

An analogy can be made here with an imaginary coin-tossing experiment, in which you would make each toss with a different coin, all of which have potentially different biases (different probabilities of giving a head). Thus, there is an unknown distribution, over coins, of the probability of getting head.

In this situation, with only one draw per coin, your fraction of heads is equal to the average probability of getting head over all coins. However, this is the only information that you can get from this experiment. You need to make at least 2 tosses per coin in order to have information about the variance of the head probability over coins. An increasing number of tosses per coin give you information about an increasing number of the moments of the distribution.

In any case, coming back to sequence evolution, the main point I wanted to make is this: empirical amino-acid replacement matrices are the perfect solution in those situations where you also cannot meaningfully estimate the variance in rates across sites. Conversely, as soon as you have sufficient sequence divergence, giving you empirical information about rate-heterogeneity, then you also have useful information concerning pattern-heterogeneity — which will not be captured by a single amino-acid replacement matrix.

On a more general note, making the parallel between rate- and pattern-heterogeneity is, I think, insightful.

Also, to address one of the comments: of course, there are computational issues behind our potential methodological inconsistencies. But this does not prevent us from pointing them out and raising a little sense of discomfort about what we easily take for granted in our daily routine. That's the only way we can make any progress.