Saturday, December 28, 2013

Maximum marginal likelihood

In my last post, I mentioned the possibility of comparing alternative diversification models, conceived of as alternative priors on divergence times in a molecular dating context, using Bayes factors.

As an alternative to Bayes factors, one could integrate the likelihood over rates and times and maximize the resulting integrated (marginal) likelihood as a function of the diversification parameters (e.g. speciation and extinction rates, let us call them $\theta$). We can then use standard likelihood ratio tests to compare diversification models. Numerically maximizing and calculating the likelihood is a bit tricky, but this is just a technical problem, not a conceptual one.

Finally, for a given diversification model, and once an estimate $\hat \theta$ of its parameters has been obtained by maximum likelihood, divergence times can be inferred based on the plug-in posterior distribution, i.e. the posterior distribution on divergence times obtained by fixing the diversification parameters at $\hat \theta$.

This type of approach is sometimes called maximum marginal likelihood, or ML-II in the literature (Berger, 1985). It may also be called empirical Bayes, thus emphasizing the fact that the estimation of divergence times is still Bayesian in spirit, although now based on an empirically determined plug-in prior. However, for me, "empirical Bayes" has a more general meaning -- after all, the prior on divergence times is also empirically determined in the fully Bayesian approach. For that reason, I prefer the "maximum marginal likelihood" terminology.

Maximum marginal likelihood or full Bayes, which is best ? I am not really sure.

Conceptually speaking, the maximum marginal likelihood approach does not invoke any second stage prior on $\theta$, which can be seen as an advantage if one is not too sure about the choice of the prior nor about its philosophical meaning (subjective? uninformative?, etc). Maximum marginal likelihood may also be better for comparing models, as Bayes factors tend to be sensitive to the prior.

Computationally speaking, on the other hand, I find it easier to put a prior on $\theta$, because this makes the MCMC easier to implement.

Finally, there is a classical argument saying that the fully Bayesian approach has the advantage of integrating uncertainty about $\theta$. However, this again depends on the second-stage prior, a point that certainly requires further discussion, but that's for another day.

---

As for now, we can draw an interesting parallel here. Indeed, all this looks very much like another likelihood approach used in population genetics for estimating the scaled mutation rate (let's call it again $\theta = 4 N_e u$) using intra-specific sequence data (Kuhner et al, 1995). In that case, the sequence data are assumed to be generated by a mutation process along a genealogy, which is itself described by Kingman's coalescent. The probability of the sequence data given the genealogy is integrated over the coalescent and then maximized as a function of the scaled mutation rate $\theta$ (all this by Monte Carlo).

Here also, once $\theta$ has been estimated, one can infer the age of the ancestor of the sample by calculating the plug-in posterior distribution over this age (Thomson et al, 2000). Here also, instead of maximizing with respect to $\theta$, one can decide to put a second-stage prior on $\theta$ and then sample from the joint posterior over $\theta$ and the coalescence times (this is what is done in Beast, for instance, Drummond and Rambaut, 2007).

Altogether, there is a very close parallel between the two situations:
species / individuals
substitutions / mutations
phylogeny / genealogy
prior on divergence times / coalescent
diversification parameters / scaled mutation rate
divergence time of the last common ancestor / age of the ancestor of the sample.

The hierarchical structure of the two models is virtually the same. In both cases, we can either maximize the marginal likelihood with respect to $\theta$, or put a second-stage prior on $\theta$ and sample from the joint posterior over $\theta$ and divergence or coalescence times.

I find this interesting because the two situations are so similar, and yet we tend to consider them as essentially different. In particular, we call the birth-death process a prior on divergence times, and we consider divergence times as parameters of the model. On the other hand, coalescence times are not usually considered as parameters, and I doubt that we would call the coalescent a prior.

I do not think that the status of the intermediate level of the model (the distribution over divergence or coalescence times) should depend on wether we use a maximum likelihood or a Bayesian method for estimating the upper level of the model (the parameter $\theta$). Instead, I believe that these distinctions are merely cultural. They are just a historical accident: one of the two methods is originally a Bayesian approach, which could be modified and turned into a maximum marginal likelihood procedure. The other is a frequentist maximum likelihood method that has only secondarily been endowed with a second-stage prior and implemented in a Bayesian software program.

In any case, all this shows that, in everyday life, there is much more overlap between Bayesian inference and maximum likelihood than the traditional vision in terms of a fundamental opposition between frequentists versus subjectivists would seem to suggest. At the end, the differences are rather subtle.

Finally, this little comparison allows me to anticipate on another problem that will be the subject of a future post: how do you determine the number of parameters of a model?

Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (1985 ed.). New-York: Springer-Verlag.

Drummond, A. J., & Rambaut, A. (2007). BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology, 7, 214.

Kuhner, M. K., Yamato, J., & Felsenstein, J. (1995). Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics, 140(4), 1421–1430.

Thomson, R., Pritchard, J. K., Shen, P., Oefner, P. J., & Feldman, M. W. (2000). Recent common ancestry of human Y chromosomes: evidence from DNA sequence data. Proceedings of the National Academy of Sciences of the United States of America, 97(13), 7360–7365.