Wednesday, October 1, 2014

Selecting models without computing Bayes factors

Model comparison or selection is often considered to be a very important step in a typical Bayesian analysis. It is often done by calculating Bayes factors or, more generally, marginal likelihoods.

I used to be a fan of Bayes factors. I even spent quite some time on the numerical methods for reliably computing them, in the aim of comparing amino-acid substitution models or alternative relaxed molecular clocks. But I have progressively changed my mind about Bayes factors, or at least about the need to explicitly calculate them. 

First, numerical evaluation of marginal likelihoods is tricky. For quite some time, it was not even done properly: unreliable methods such as the harmonic mean estimator (‘the worst Monte Carlo method ever’) were most often used for that purpose. Now, better numerical methods are employed, such as thermodynamic integration or its discretized equivalent, the stepping-stone approach. But the reliability of these numerical methods comes at a price: they are computationally expensive.

Given the costs associated with the whole enterprise, it is therefore important to realize that these laborious numerical methods are most often unnecessary. A much simpler alternative is to make a general model encompassing all of the models that we wish to compare. Then, by just carefully designing the prior over the parameters of this model, what we obtain is a probabilistic system that, upon being conditioned on a given dataset, automatically selects the parameter regime effectively implementing the model that happens to have the highest fit for these data.

As a simple but not too trivial example, let us assume we have an amino-acid alignment and want to choose between two alternative substitution models: say, the empirical JTT matrix and the most general time-reversible model (GTR). The two models greatly differ in their dimensionality: the 190 exchangeability parameters are fixed to pre-determined values in the case of JTT, whereas they are estimated directly from the data in the case of GTR.

In this specific case, instead of numerically evaluating the marginal likelihoods of the two models, one could design a GTR model such that the prior on its exchangeability parameters would be centered on the exchangeabilities of JTT. The concentration of this prior around its center would be controlled by one single concentration parameter, which should in turn be endowed with a broad prior. Typically, I would use a log-uniform prior (possibly truncated to make it proper) or a half-Cauchy. Then, I would just condition the model on the data, without any further complicated model selection process. If the JTT model is really fit, then the concentration parameter will automatically take on large values under the posterior, and as a result, the exchangeabilities of the model will shrink towards those defined by JTT. In other words, the GTR model will effectively implement the JTT model if this is what the data prefer. Importantly, this automatic 'collapse' of the model is driven by just one hyperparameter, so that the approach should work well even for relatively small datasets.

Note that the model will never exactly collapse onto the simpler model: using this approach, the exchangeabilities of the GTR model are never exactly equal to those of the JTT model. In all regimes, the model always has its full dimensionality. Is this really a problem?

A typical objection here would be that our estimation might suffer increased variance. However, strictly speaking, this is true only if the data have been produced by a JTT model, which is never the case in practice, except in simulation studies. Even in simulation studies, I would be surprised to see any detectable difference in terms of estimation error (above the MCMC noise) between the hierarchical model settings suggested above and the true model: thanks to the specific hierarchical design of the prior, the extra-variance induced by the more flexible parameterization should be very small — technically, the effective number of parameters is much smaller than the nominal number of parameters.

More fundamentally, nobody believes that JTT is the true model for any real data, and therefore, it does not really make sense to see it as a relevant point null hypothesis to be tested against a more general alternative, as would be implied by a Bayes factor analysis between JTT and GTR. Instead, JTT at best represents a good first guess for the exchangeabilities, to be used when the sequence alignment under investigation is too small to provide this information entirely from scratch. But then, a first guess should probably not be formalized as a point null. Instead, it should be seen as a good center for the prior — which is exactly what the hierarchical prior suggested here does: using JTT as a first guess, while implementing some empirically adjustable flexibility around it.

This method can easily be generalized to deal with all sorts of model selection problems. In all cases, this will be done by defining hierarchical priors in a way that ensures that the model will be able to effectively collapse onto a simpler model if needed. Many examples can be imagined: a large value for the shape parameter of the gamma distribution of rates across sites effectively implements the uniform-rate model; efficient shrinkage of branch-specific equilibrium frequencies for the substitution model can effectively emulate a time-homogeneous model; or a suitably chosen prior over partition models gives the data the opportunity to efficiently funnel the model onto uniform substitution matrices or uniform branch lengths across partitions, if relevant.

Technically, the approach could be seen as a case of model averaging. However, the expression ‘model averaging’ is meant to emphasize the ‘averaging’ aspect: the fact that you don’t want to bet 100% on one single model. Here, the averaging is perhaps not the most important point (even if it does play some role). For that reason, I would call this procedure model auto-tuning, thus referring to the fact that the model automatically finds the most adequate parameter regime.
In any case, I believe that substantial amounts of computational resources could be saved, simply by not computing Bayes factors between phylogenetic models: in most cases, this is at best unnecessary and, at worst, just an idle exercise.


  1. Hi Nicolas. I only recently discovered your blog - great posts!

    Re the present post, I couldn't agree more - this is exactly the right way to think about it. I also used to think Bayes factors are the way to go, when I was still too much under the influence of the ideas that underlie hypothesis testing. I now think that just about every point null (at least in evolutionary modelling) can be rejected a priori, so model testing questions based on point nulls don't really make sense.

    That said, we do routinely use LRT or BF (or posterior prob) to evaluate the evidence for hypotheses of interest, e.g. omega>1 at a given site, where the null is not a point null. I'm guessing your objection to BF does not extend to such examples (especially since one would not normally describe them as model selection)?

  2. Hi Konrad,

    thanks for your feedback.

    concerning the example you mention, that of testing whether omega > 1 on a site-by-site basis: I would not use Bayes factors. Instead, I would rely on posterior probabilities (that omega>1). This makes an important difference, in particular if the law of the values of omega across sites is itself hyperparameterized, or even estimated non-parametrically: in that case, your posterior probabilities have relatively good frequentist properties, in terms of false discovery rate. Bayes factors do not have such nice properties.

    In fact, this is a case of empirical Bayes hypothesis testing: see my post:

    and Guindon, Black, Rodrigo (2006). Molecular biology and evolution, 23(5), 919-926.


  3. Agreed on the frequentist properties. I am thinking of FUBAR (Murrell et al, MBE 2013), where our implementation does actually report posterior probs as the main result. However, in the paper we also mention that one can calculate BF (posterior odds ratio divided by prior odds ratio) - the advantage would be not being sensitive to the prior. So it may be preferable if one does not trust the prior (which is estimated from gene-wide data, the assumption being that other sites in the same gene are informative with respect to a given site of interest).

  4. OK, I can see your argument -- but I am not sure I agree with it.

    By using BFs instead of posterior probs, you loose the natural decision-theoretic interpretation of Bayesian inference, which relies exclusively on posterior probabilities, without gaining anything on the side of the frequentist properties.

    Personally, I would either use posterior probabilities, if I am ready to borrow strength across sites, or p-values, if I don't and want to stay within a type-I error paradigm.

    Somewhow, BFs stand in the middle: no good frequentist properties but no clear Bayesian interpretation either.

    But perhaps is your practical experience with BFs more positive than what I am saying here?