## Tuesday, October 28, 2014

### Bayesian automatic control of model complexity

Still concerning Bayes factors and model selection, I wanted to show a little example, illustrating how one can rely on Bayesian model averaging and hierarchical priors to implement efficient and automatic control of model complexity -- model auto-tuning, as I like to call it.

First, I have simulated a single-gene amino-acid alignment of ~600 positions under a very simple model, using the LG amino-acid empirical matrix (Le and Gascuel, 2008) and assuming no variation in substitution rates among sites. For this simulation, I used as a template an alignment of elongation factor 2 in 30 species across eukaryotes. The simulation was done using the tree estimated on this dataset, thus under empirically reasonable settings.

Then, I have analyzed this simulated dataset directly using one of the most complex models I have implemented thus far: CAT-GTR + Gamma (Lartillot and Philippe, 2004). This model implements variation among sites in the rate of substitution (distributed across sites according to a discretized gamma distribution, with 4 categories), but also in the equilibrium frequency profile over the 20 amino-acids, using a Dirichlet process non-parametric prior. Finally, the 190 relative exchangeability parameters are also considered as unknown and are therefore estimated directly on the dataset.

According to accepted wisdom, one would think that this model is way too complex, way too rich, to be applied to such a small dataset. Even using a GTR model, instead of a LG model, in the present case would often be considered as excessive (190 free parameters! on a 600-site alignment!). Therefore, my estimation will certainly suffer from excessive variance.

Is this true?

Let us see, by just trying both models: the complex model, CAT-GTR + Gamma, and the true model, LG without Gamma, and see what comes up in terms of phylogenetic estimation (I let you guess which picture corresponds to which model):

Where is the excess variance here? There is virtually no detectable difference between the two estimates.

This represents a typical example showing that a complex model can automatically adapt to the specific case you are dealing with, making model selection totally dispensable. All this, however, is true only if your priors have been correctly designed, so as to get efficient shrinkage of your parameters, in particular under those circumstances where the data in fact contain none of the subtle variation that your model is meant to account for. So, perhaps I should spend some time explaining how my priors have been elicited in the present case. The main point is that these priors are hierarchical.

For the shape parameter $\alpha$ of the gamma distribution of rates across sites, I have used a log-uniform prior restricted between 0.1 and 100, thus allowing a broad range of variance among sites in relative rates. Large values of $\alpha$ effectively implement uniform rates across sites.

A Dirichlet process prior is used over the unknown distribution of amino-acid equilibrium frequency profiles across sites. This non-parametric prior effectively runs over infinite mixtures of amino-acid profiles. There are two main things that have to be controlled in the distribution of the components of the mixture: the relative weights of the components and the dispersion of the profiles attached to them.

The weights are regulated by a granularity parameter $\kappa$. For small values of $\kappa$, the mixture is such that one component of the mixture has an overwhelmingly large weight compared to the infinite series of all other components. In this regime, nearly all sites will be allocated to the large-weight component, and the model will implement strictly identical amino-acid profiles across sites. In contrast, for large values of $\kappa$, the Dirichlet process implements many components of very small weights, so that each site has effectively its own equilibrium frequency profile over amino-acids. Here, $\kappa$ is endowed with an exponential prior of mean 10, so that there is some room for implementing potentially rich mixtures by setting $\kappa$ to reasonably large values, but also to reduce the number of components to 1 or 2 (essentially, when $\kappa$ is less than 0.1).

As for the distribution of profiles across components, it is regulated by a concentration parameter, $\delta$. For large values of $\delta$, amino-acid profiles implemented by the mixture will all be very similar, whereas for small value of $\delta$, they will tend to be very different. Here, $\delta$ is endowed with an exponential prior of mean 20. So, it does not especially favor very similar components, but it can, if needed (when its value is typically of the order of 50 or above).

Finally, the prior on the exchangeability parameters is centered around the LG exchangeabilities, with a concentration hyperparameters $\zeta$ itself endowed with a log-uniform distribution between 0.1 and 100. Large values of $\zeta$ effectively shrink the exchangeabilities very close to those of LG.

Therefore, the most important priors and distributions are tuned by four key-parameters, such that, with large $\alpha$, large $\zeta$, small $\kappa$ and/or large $\delta$, the model reduces to the LG model with uniform rates and profiles across sites. Note that there are two redundant ways the model can come back to virtually identical profiles across sites: either by implementing one single component of overwhelming weight (small $\kappa$), or by just letting the amino-acid profiles of the mixture be all very close to each other (large $\delta$).

So, what happens exactly when I run this model on the dataset simulated under the simple LG-Uniform model?

First, the posterior distribution on $\alpha$, the shape parameter of the distribution of rates across sites (left) is shifted to very large values: $\alpha$ is inferred to be so large that, essentially, all relative rates of substitution across sites are all strongly shrunken around 1 (right) -- in other words, the model effectively implements uniform rates across sites, as should be done in the present case.

Then, what about variation in amino-acid profiles among sites? The figure below shows the posterior distribution on the number of occupied components:

as you can see, we get a large posterior probability for the configuration where all sites are allocated to one single component (pp = 0.6). Even when sites are distributed over more than one component (which still happens around 40% of the time), the components effectively implement nearly identical frequency profiles: $\delta$ is on average around 80 (not shown). Thus, efficient shrinkage of equilibrium frequency profiles across sites is achieved by a combination of discrete model selection (by playing on the number of occupied profiles) and continuous model averaging (by shrinking component-specific frequency profiles toward their mean).

We can have a more direct look at the average profiles estimated across sites, and check that they are indeed, for all practical means, constant across sites:

It is also interesting to get a look at the accuracy of our estimation of the estimated relative exchangeabilities, by comparing them to the true values:

Such a beautifully accurate estimation of the relative exchangeabilities may in fact suggest that using a model centered on LG is just too easy... So, what happens if I use instead a uniform prior over exchangeabilities? In that case, shrinkage is somewhat less efficient, both in terms of the accuracy of the estimated exchangeabilities:

and in terms of the number of occupied components in the mixture (the posterior probability of getting just one occupied component is now less than 0.2):

On the other hand, $\delta$ is still large (again of the order of 80), which probably contributes to inducing a series of very similar amino-acid profiles across sites:

Finally, the estimated consensus tree is exactly the same as with the model centered on LG:

Thus, globally, there is no real loss incurred by using the more complex model, when in fact a much simpler model would be applicable to our data.

In contrast, on the real sequence alignment, there is rate variation among sites, and there is also a great deal of variation in amino-acid preferences along the sequence. We can get a clear picture of that by just running the CAT-GTR + Gamma model directly on the original empirical dataset of elongation factor sequences. Here, the posterior distributions that we get are totally different, in particular on the distribution of rates across sites ($\alpha$ is now clearly less than 1, of the order of 0.55) and concerning the distribution of amino-acid profiles across sites:

Other analyses with CAT-GTR on a now relatively large set of cases suggest that this is a very general pattern: on empirical sequence data, one always sees substantial variation among sites of both rates and amino-acid propensities.

Overall, one should probably not over-interpret those results, which all entirely rely on the fact that the true model is included in the set of possible configurations explored by CAT-GTR. In the real world, things are much more complicated: you almost certainly have other types of model violations (compositional heterogeneity among taxa, epistatic interactions between positions, due to the tertiary structure of the protein), which have been totally ignored in our analysis. One big question is, for instance: how does a model such as CAT-GTR react to these violations, compared to LG or to other models? Not an easy question, for sure.

However, this is not the kind of issues that you can hope to address by just comparing the empirical fit of the models. Addressing such problems requires more substantive arguments, about exactly what you want to estimate and how violations are likely to influence your estimation. In this direction, we may not know the specific impact of other model violations on CAT-GTR, but we already know that, by using LG, we are for sure creating one additional model violation, since we are ignoring what appears to be substantial and ubiquitous variation among sites in amino-acid propensities -- and this does have an impact on the accuracy of phylogenetic estimation (Lartillot et al, 2007).

But in any case, these model violation problems distract us from the main point of this post, which is more theoretical, being concerned with how Bayesian inference deals with model complexity. And the main message is this: unlike what happens in the context of maximum likelihood estimation, where we have to compensate for the intrinsic advantage of more complex models by implementing explicit model penalization and explicit model selection, in a Bayesian inference context, and using correctly specified priors, model complexity is just under automatic control. As a result: (1) it is possible to use a complex model by default, even on small datasets; this model will automatically reduce to the simpler model if needed (2) therefore, we do not need Bayes factors to be sure that we are using the right model, with the right complexity; (3) nor do we not need Bayes factors to see that, on a real dataset, there is variation in rate and in equilibrium frequency profiles: we just see it in the posterior distribution. And if we are really anxious, we can always check, on simulations under simpler models, that our Bayesian analysis would not infer such a variation if there weren't any.

===

Le, S. Q., & Gascuel, O. (2008). An improved general amino acid replacement matrix. Molecular Biology and Evolution, 25(7), 1307–1320. doi:10.1093/molbev/msn067

Lartillot, N., & Philippe, H. (2004). A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution, 21(6), 1095–1109. doi:10.1093/molbev/msh112

Lartillot, N., Brinkmann, H., & Philippe, H. (2007). Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary Biology, 7 Suppl 1, S4. doi:10.1186/1471-2148-7-S1-S4