Coming back to my last post about model selection using Bayes factors, in which I was suggesting that it is most often not necessary to implement explicit Bayes factor evaluation: the model-averaging aspect is in fact more important than I suggested. Indeed, it is precisely because you average over models that you do not need to go through explicit model selection.
I think there is something fundamental to understand here, concerning the difference between the maximum likelihood and the Bayesian approaches to the question of choosing models, so let me belabor the point.
In a maximum likelihood context, one would generally not use the more complex model by default. Often, one would instead use the simplest model, unless this model is rejected by the data in favor of a more complex alternative. This approach to model selection is implemented, for instance, in sequential tests that go from very simple substitution models like Jukes-Cantor to the general time-reversible parameterization, going through the whole series of intermediates (F81, HKY85, TN93, etc) sorted by increasing dimensionality.
(1) point estimation versus marginalization: maximum-likelihood is an either/or method: either JTT or GTR. Of course, as a model, GTR does include JTT as a particular case, but in practice, you still have to choose between two distinct point estimates for the exchangeability parameters: either those of JTT, or those obtained by maximum likelihood estimation. So you really have two distinct, non-overlapping models at the end, and you will have to choose between them.
In contrast, in a Bayesian context, if you are using GTR, you are in fact averaging over all possible configurations for the exchangeability parameters, including the one represented by JTT. Seen from this perspective, it is clear that you are not choosing between two mutually exclusive options. Instead, choosing between GTR and JTT really means, choosing between a risky bet on one single horse (JTT) or a bet-hedging strategy consisting of averaging over all possible configurations for the unknown exchangeability parameters.
(2) optimization versus conditioning: as I previously explained, maximum likelihood is inherently prone to overfitting. Therefore, you should have good reasons to use a model that invokes additional parameters. In contrast, Bayesian inference works by conditioning, and therefore, does not have over-fitting problems.
Point (1) gives some intuition as to why we need explicit model selection in a maximum likelihood but not in a Bayesian analysis, while point (2) explains why we would typically conduct model selection by starting from the simplest model in a maximum likelihood framework, whereas we can safely opt directly for the more general model in a Bayesian context.
In any case, what all this suggests is that the focus, in Bayesian inference, should not be on explicit model selection. Instead, it should be on prior design — including the prior over models.
Nice post. I would add that, in practice, computational considerations matter when making these decisions. In many contexts we can think of model selection as a cost-saving approximation to model averaging, and the question of interest becomes where we can cut corners without sacrificing statistical performance.
ReplyDeleteI can see some contexts where model selection is perhaps cheaper than model averaging.
ReplyDeleteBut in many, if not most, cases, what I see in fact is the opposite: people go through a laborious series of Bayes factors, whereas they could in fact use the most general model, thus implicitly and rapidly averaging over all submodels.
one could argue that the most general model is also the one that is computationally the most challenging. But doing the Bayes factor analysis implies that this model will have to be conditioned on the data anyway.
So, I am not sure that the cost-saving approximation is really the main reason.
We often do model selection in a frequentist framework, using a greedy step-up procedure with AIC as a stopping criterion. This means we never have to run the most general (complex) versions of the model.
ReplyDeleteAn example would be where one is using an n-category general discrete distribution to approximate a continuous reality, with computation time depending on n. We start with n=1 and increment it until model comparison shows no further improvement.