Coming back to my last post about model selection using Bayes factors, in which I was suggesting that it is most often not necessary to implement explicit Bayes factor evaluation: the model-averaging aspect is in fact more important than I suggested. Indeed, it is precisely because you average over models that you do not need to go through explicit model selection.
I think there is something fundamental to understand here, concerning the difference between the maximum likelihood and the Bayesian approaches to the question of choosing models, so let me belabor the point.
In a maximum likelihood context, one would generally not use the more complex model by default. Often, one would instead use the simplest model, unless this model is rejected by the data in favor of a more complex alternative. This approach to model selection is implemented, for instance, in sequential tests that go from very simple substitution models like Jukes-Cantor to the general time-reversible parameterization, going through the whole series of intermediates (F81, HKY85, TN93, etc) sorted by increasing dimensionality.
(1) point estimation versus marginalization: maximum-likelihood is an either/or method: either JTT or GTR. Of course, as a model, GTR does include JTT as a particular case, but in practice, you still have to choose between two distinct point estimates for the exchangeability parameters: either those of JTT, or those obtained by maximum likelihood estimation. So you really have two distinct, non-overlapping models at the end, and you will have to choose between them.
In contrast, in a Bayesian context, if you are using GTR, you are in fact averaging over all possible configurations for the exchangeability parameters, including the one represented by JTT. Seen from this perspective, it is clear that you are not choosing between two mutually exclusive options. Instead, choosing between GTR and JTT really means, choosing between a risky bet on one single horse (JTT) or a bet-hedging strategy consisting of averaging over all possible configurations for the unknown exchangeability parameters.
(2) optimization versus conditioning: as I previously explained, maximum likelihood is inherently prone to overfitting. Therefore, you should have good reasons to use a model that invokes additional parameters. In contrast, Bayesian inference works by conditioning, and therefore, does not have over-fitting problems.
Point (1) gives some intuition as to why we need explicit model selection in a maximum likelihood but not in a Bayesian analysis, while point (2) explains why we would typically conduct model selection by starting from the simplest model in a maximum likelihood framework, whereas we can safely opt directly for the more general model in a Bayesian context.
In any case, what all this suggests is that the focus, in Bayesian inference, should not be on explicit model selection. Instead, it should be on prior design — including the prior over models.