Coming back the Microsporidia example, the LG model does not just give an artifactual position for Microsporidia. It also gives an estimate for the total tree length of about 10.3 amino-acid changes per site. On the same dataset, CAT-GTR, which accounts for site-specific amino-acid preferences, infers a total tree length of 18.6.
And this makes total sense. In fact, the problem was very clearly identified more than 18 years ago, by Halpern and Bruno (1998, from the abstract):
Estimation of evolutionary distances from coding sequences must take into account protein-level selection to avoid relative underestimation of longer evolutionary distances. Current modeling of selection via site-to-site rate heterogeneity generally neglects another aspect of selection, namely position-specific amino acid frequencies. These frequencies determine the maximum dissimilarity expected for highly diverged but functionally and structurally conserved sequences, and hence are crucial for estimating long distances.
Thus, nothing new — although I find it instructive to measure this effect more quantitatively. Here, if CAT-GTR is correct, then this means that, by ignoring site-specific amino-acid preferences, we are missing more than one third of all amino-acid substitutions. The same pattern is obtained (also with a difference of about 30 to 40%) on recent phylogenomic datasets at the scale of metazoans.
Incidentally, it is also interesting to note that, in spite of the importance of the problem raised by Halpern and Bruno, the ’current models’ of 1998 are still more or less the default option in many phylogenetic reconstruction programs.
Halpern, A. L., & Bruno, W. J. (1998). Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Molecular Biology and Evolution, 15(7), 910–917.