## Friday, October 3, 2014

### Model averaging versus model selection

Coming back to my last post about model selection using Bayes factors, in which I was suggesting that it is most often not necessary to implement explicit Bayes factor evaluation: the model-averaging aspect is in fact more important than I suggested. Indeed, it is precisely because you average over models that you do not need to go through explicit model selection.

I think there is something fundamental to understand here, concerning the difference between the maximum likelihood and the Bayesian approaches to the question of choosing models, so let me belabor the point.

In a maximum likelihood context, one would generally not use the more complex model by default. Often, one would instead use the simplest model, unless this model is rejected by the data in favor of a more complex alternative. This approach to model selection is implemented, for instance, in sequential tests that go from very simple substitution models like Jukes-Cantor to the general time-reversible parameterization, going through the whole series of intermediates (F81, HKY85, TN93, etc) sorted by increasing dimensionality.

However, this paradigm is, I think, inadequate in a Bayesian framework. In fact, the model ‘auto-tuning’ example I was mentioning in my last post suggests exactly the opposite: to always use the most general model by default, the one that has the highest dimension, given that it will automatically include whichever simpler alternative may turn out to be adequate for the particular dataset under consideration.

Why this difference between the two frameworks? I think there are two (related) reasons:

(1) point estimation versus marginalization: maximum-likelihood is an either/or method: either JTT or GTR. Of course, as a model, GTR does include JTT as a particular case, but in practice, you still have to choose between two distinct point estimates for the exchangeability parameters: either those of JTT, or those obtained by maximum likelihood estimation. So you really have two distinct, non-overlapping models at the end, and you will have to choose between them.

In contrast, in a Bayesian context, if you are using GTR, you are in fact averaging over all possible configurations for the exchangeability parameters, including the one represented by JTT. Seen from this perspective, it is clear that you are not choosing between two mutually exclusive options. Instead, choosing between GTR and JTT really means, choosing between a risky bet on one single horse (JTT) or a bet-hedging strategy consisting of averaging over all possible configurations for the unknown exchangeability parameters.

(2) optimization versus conditioning: as I previously explained, maximum likelihood is inherently prone to overfitting. Therefore, you should have good reasons to use a model that invokes additional parameters. In contrast, Bayesian inference works by conditioning, and therefore, does not have over-fitting problems.

Point (1) gives some intuition as to why we need explicit model selection in a maximum likelihood but not in a Bayesian analysis, while point (2) explains why we would typically conduct model selection by starting from the simplest model in a maximum likelihood framework, whereas we can safely opt directly for the more general model in a Bayesian context.

Now, averaging over models also means that we have to correctly define the measure over which to take this average: in other words, we have to correctly design our prior over the models, which is not necessarily something obvious to do. I have suggested some ways to do it properly in my last post (mostly, by having key hyperparameters that are able to drive the whole model configuration down to some remarkable submodels), but there is probably much more to say about this.

In any case, what all this suggests is that the focus, in Bayesian inference, should not be on explicit model selection. Instead, it should be on prior design — including the prior over models.

1. Nice post. I would add that, in practice, computational considerations matter when making these decisions. In many contexts we can think of model selection as a cost-saving approximation to model averaging, and the question of interest becomes where we can cut corners without sacrificing statistical performance.

2. I can see some contexts where model selection is perhaps cheaper than model averaging.

But in many, if not most, cases, what I see in fact is the opposite: people go through a laborious series of Bayes factors, whereas they could in fact use the most general model, thus implicitly and rapidly averaging over all submodels.

one could argue that the most general model is also the one that is computationally the most challenging. But doing the Bayes factor analysis implies that this model will have to be conditioned on the data anyway.

So, I am not sure that the cost-saving approximation is really the main reason.

3. We often do model selection in a frequentist framework, using a greedy step-up procedure with AIC as a stopping criterion. This means we never have to run the most general (complex) versions of the model.

An example would be where one is using an n-category general discrete distribution to approximate a continuous reality, with computation time depending on n. We start with n=1 and increment it until model comparison shows no further improvement.