Wednesday, December 7, 2016

The effective number of amino-acids per site

As we have seen, a fundamental consequence of site-specific selective constraints is that the number of amino-acids allowed at each site is typically much smaller than 20. This in turn results in a purely combinatorial effect: few amino-acids per site means that convergent substitution events are more frequent than what is anticipated by a model assuming that all 20 amino-acids are equally acceptable at each site — with potentially important consequences on phylogenetic accuracy.

Here is a more quantitative version of this argument.

First, imagine a position that can accept only $K$ amino-acids, but accepts each of them with equal frequency $1/K$. Then, if this position is sufficiently fast evolving, the probability of observing the same amino-acid at that position in two distantly related species (i.e. the long-term probability of convergent evolution) is just $p_c = 1/K$. Or, equivalently, $K = p_c^{-1}$.

Now, consider an arbitrary amino-acid replacement process, with equilibrium frequency vector $(\pi_k)_{k=1..20}$ over the 20 amino-acids. The probability of sampling twice the same amino-acid by chance from this equilibrium frequency vector is:

$p_c \, = \, \sum_{k=1}^{20} \, \pi_k^2$.

By analogy with the simple case considered above, we can define the effective number of amino-acids (or the effective size of the state space) implied by this amino-acid replacement process as:

$K_e \, = \, p_c^{-1} = \, \left( \sum_{k=1}^{20} \, \pi_k^2 \right) ^{-1}$

Note that this definition of $K_e$ is different from the one often used (e.g. Pollock et al, 2006, Echave and Wilke, 2016), which is instead related to Shannon entropy. The present definition is, I think, more adequate in the present context, as it more directly captures the fundamental relation between amino-acid restrictions and probability of convergent evolution ($p_c = 1/K_e$). The two definitions give qualitatively similar results in practice.

We can calculate the effective number of amino-acids at each site, based on the site-specific amino-acid equilibrium frequency profiles estimated by the CAT-GTR model. Here is the distribution of $K_e$ across the entire alignment in the case of Microsporidia:

The average across sites is $<K_e> =7.22$, and the first 25% quantile is 4.1. In other words, according to CAT-GTR, 25% of the sites effectively visit less than 4 distinct amino-acids in the long-run. Importantly, the mean substitution rate for those sites is such that they make on average 16.5 substitution events across the whole tree, and 25% of them make more than 20 substitutions repeatedly over at most 4 amino-acids. Thus, sites with a small $K_e$ are not particularly slowly evolving, and many of them are even highly saturated.

In contrast, the bar shown in red on this plot represents the effective number of amino-acids implied by the use of single amino-acid replacement matrix for all sites (i.e. calculated on the empirical frequencies over the whole alignment). This effective number is equal to 16.1, which is more than twice as much as the mean across sites estimated by CAT-GTR, and 4 times as large as the effective number for the 25% most restricted positions of the alignment. 

Also, the tail-area under the blue curve, to the right of this red bar, is only about 0.2% — which clearly illustrates that the amino-acid replacement process estimated by pooling sites is definitely not typical of what happens at most sites.

Coming back to the question of phylogenetic accuracy, all this means that using a single amino-acid replacement matrix for all sites leads to underestimating the long-run probability of convergent substitution for 99.8% of all sites, by a factor 2 on average, and by a factor at least 4 for 25% of the positions — clearly not a well-calibrated model, in terms of its estimation of the phylogenetic noise.


Pollock, D. D., Thiltgen, G., & Goldstein, R. A. (2012). Amino acid coevolution induces an evolutionary Stokes shift. Proceedings of the National Academy of Sciences of the United States of America, 109(21), E1352–9.

Echave, J. and Wilke, C. O. (2016). Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence.


  1. Unless I'm mistaken, Ke is the inverse Simpson diversity index, yes?

  2. yes, exactly

    you can also derive the same argument in pop. gen. and define the effective number of alleles at a locus (which is basically the inverse of the homozygosity)