## Thursday, November 6, 2014

### Should we condition on non-extinction?

Diversification studies are concerned with the problem of testing models of species diversification and estimating their parameters (such as speciation and extinction rates) based on time-calibrated phylogenies. One particular question that regularly comes up in this context is whether and how we should take into account, in the inference procedure, the fact that, by construction, we only consider non-extinct groups.

A good sign that this question is not so obvious is that many different solutions seem to be considered in practice: likelihood functions are sometimes conditional on non-extinction (Stadler et al, 2013), sometimes conditional on the age of the last common ancestor of the group (Nee et al, 1994), sometimes not conditioned at all (Maddison et al, 2007, FitzJohn, 2010), or even conditioned on the number of species sampled today (in the context of molecular dating, Rannala and Yang, 1996). So, clearly, the question requires some clarification.

As a simple illustrating case, let us assume a diversification process with constant but unknown rates of speciation ($\lambda$) and extinction ($\mu$), and let us call $\theta = (\lambda, \mu)$ our parameter vector. Technically, we also have a third parameter, $t_0$, the time of origin of the process, but let us leave this detail aside for the moment (it does not really change the main argument).

A first attempt at deriving a likelihood for this problem is simply to consider the probability of producing the observed tree $T$, conditional on the parameter $\theta$:

(1)    $p(T \mid \theta)$.

One problem, however, is that there is a positive probability that the process gets extinct and therefore returns an empty tree:

$p(T = \emptyset \mid \theta) = 1 - p(S \mid \theta) > 0$,

where $p(S \mid \theta)$ is the probability of ultimate survival of the group, given the parameters. Yet, by the very design of the experiment, the only cases that we consider in practice are non-empty trees, and we would like our likelihood to sum to one over all possible observable configurations for the data — thus, over all possible non-empty trees.

We can avoid this ‘missing mass’ problem and restore a correct normalization of the probability by just conditioning on ultimate survival:

(2)    $p(T \mid \theta, S) = \frac {p(T \mid \theta)} {p(S \mid \theta)}$.

We could then take this conditional probability as our likelihood and proceed to estimation, for instance, by maximizing this conditional likelihood with respect to $\theta$.

One important thing to note here is that, since the unconditional likelihood (1) accounts for the probability of surviving, while the conditional likelihood (2) factors this effect out of the probability, then, everything else being equal, parameter configurations that tend to increase the chances of ultimate survival will be favored by the unconditional likelihood (1), but not by the conditional likelihood (2). In particular, the probability of survival is strongly dependent on the net expected growth rate of the species group $r = \lambda - \mu$, with larger values of $r$ (higher growth rates) implying a higher chance of ultimate survival. Thus, if you compare your estimates returned by each of the two versions of the likelihood, you will in general obtain a larger value for $r$ under (1) than under (2).

So, which likelihood should we use? At first sight, the normalization argument suggested above would seem to imply that we should of course use the conditional version (equation 2). But is that so simple?

I am not sure. Let us try to derive an algorithmic model of our problem. Fundamentally, what our conditional likelihood means, in algorithmic terms, is this:

Nature chooses a fixed parameter configuration $\theta = (\lambda,\mu)$
repeat
- Nature runs a diversification process with rates $\lambda$ and $\mu$
until process survives

First, note that the repeat loop here refers to a true, objective, repetition of the generating process: we use a conditional likelihood precisely because we consider that a typical instance of the experiment potentially involves several repeated attempts until a surviving species clade is produced. In practice, we can imagine that Nature has 'run' many clades in parallel, only some of which did not suffer from premature extinction — and the one we are considering is randomly chosen from this surviving subset. But since we assume that all the runs are independent, it is mathematically equivalent to imagine, as suggested by the 'repeat' loop above, that Nature runs the process repeatedly until the first surviving clade is produced — and we can then identify this first successful draw with our observation.

But then, it seems that we are assuming that Nature has repeatedly, stubbornly, tried under the same parameter configuration, until succeeding. However, I am not sure that this is really a good model of what actually happens.

Imagine for instance that we are more specifically interested in one particular order of mammals; say, we want to estimate the speciation and extinction rates of Carnivores. Carnivores are just one among ~25 mammalian orders. There is no doubt that there is quite some variation in diversification rates among mammalian orders: think about rodents versus primates. But then, if there is variation in diversification rates among surviving mammalian clades, there has surely been variation, more generally, among all ‘replicates’ that Nature has run in parallel, surviving or not.

So, perhaps our model should instead be something like the following:

Nature chooses a fixed distribution $\phi$ over the parameter $\theta = (\lambda, \mu)$
repeat
- Nature chooses a random $\theta = (\lambda,\mu)$ from $\phi$
- Nature runs a diversification process with rates $\lambda$ and $\mu$
until process survives

Importantly, $\phi$ is not really a Bayesian prior. Instead, it is meant to be a representation of an objective property of the process: the true variation among clades, both extant and extinct.

Notice also that this new model introduces potential selection effects on the parameter configurations that you end up observing: those clades that survive are statistically enriched in values of $\theta$ that are less likely to lead to premature extinction — in particular, enriched in values of $\lambda$ and $\mu$ such that $r = \lambda - \mu$ is larger.

If we now try to write down the probability represented by this model, we first note that $\lambda$ and $\mu$ are now intermediate variables in the algorithm, thus they should be integrated out. In addition, we have a repeat-until loop, therefore, we condition on the exit clause -- which is also averaged over $\lambda$ and $\mu$. Altogether, the model leads to the following probability of producing our tree:

(3)    $p (T \mid \phi, S) = \frac{p(T \mid \phi)}{p(S \mid \phi)} = \frac{\int p(T \mid \theta) \phi(\theta) d \theta}{\int p(S \mid \theta) \phi(\theta) d \theta}$.

This probability is still normalized: it sums to one over all possible non-empty trees, although now, for fixed $\phi$. Note that this probability is not anymore a function of $\lambda$ and $\mu$, which have been integrated away, since they are now random effects. In other words, under this model, there is just no meaningful likelihood that would be a function of our parameters of interest.

If we knew $\phi$, we could still obtain a point estimate of $\theta= (\lambda, \mu)$, by taking the maximum a posteriori (MAP): that is, by calculating the value of $\theta$ maximizing the integrand of the numerator of equation 3:

(4)    $p(T \mid \theta) \phi(\theta)$.

But we generally do not know $\phi$, and without any further assumption, we cannot hope to estimate this distribution just based on one single tree, however non-empty. Still, we can at least consider the following two limiting cases:

At one extreme, $\phi$ might be highly peaked around some unknown value $\theta_0 = (\lambda_0, \mu_0)$. In the limit of a very sharp peak, $\phi$ is nearly a point mass (a ‘Dirac’) at $\theta_0$, and equation (3) then reduces to:

$p(T \mid \theta_0, S) = \frac{p(T \mid \theta_0)}{p(S \mid \theta_0)}$,

which we can maximize w.r.t. $\theta_0$.

This probability is identical to our initial conditional likelihood (equation 2) — which makes sense, since, in that regime, the process is indeed repeatedly running under a (nearly) fixed parameter configuration $\theta_0$, until producing a non-empty tree. This likelihood does not include any selection bias in favor of groups with higher growth rates — which also makes sense, since there is no meaningful variation on which this selection bias could act.

At the other extreme, $\phi$ could be a broad distribution, such that $\phi(\theta)$ is locally constant in the vicinity of the true parameter value. In that case, our posterior probability (equation 4) becomes proportional to

$p(T \mid \theta)$,

and the MAP estimate therefore reduces to the estimate obtained by maximizing the unconditional likelihood. As already mentioned above, this unconditional likelihood includes a bonus for higher growth rates $r$ — which makes sense, since there is now sufficient variation among clades to indeed experience a species selection effect on $\theta$ when considering only surviving groups. This unconditional likelihood is not normalized over observable data, but this is not a problem: this is just because, in fact, this is not really a likelihood in the present context — it is, up to a proportionality constant, a limiting case of a posterior probability.

So, what all this means is the following: if you use a conditional likelihood, this is because you believe that there is actual repetition of the stochastic process (there is a repeat-until loop in your algorithmic model of the process). Now, repetition opens the door to the possibility of variation in the value of the parameters across instances. If you also have good reasons to suspect that this variation is in fact substantial (if there is sufficient available variation in diversification rates among species clades, extant and extinct), then, apparently, you should not condition your likelihood on ultimate survival. You should not, because by doing so, you would factor out from your estimation the selection bias that is in fact induced on the diversification parameters by differential survival of the groups. Conversely, if you believe that variation in diversification rates is limiting, then you should use a conditional likelihood.

In practice, the difference between the two estimates is probably very small anyway. Nevertheless, regardless of the practical impact, it is interesting to contemplate the purely theoretical and philosophical implications of this problem. In particular, one can see here on a specific example that being a good Frequentist does not necessarily imply that you can always consider your parameter of interest as fixed. Sometimes, the design of the problem implies that you have to consider it instead as a random quantity. Or, to put it differently, insisting on using a conditional likelihood with a fixed parameter in the present case cannot be justified purely on the grounds of some preference for one statistical paradigm — it also seems to entail a specific hypothesis about the underlying objective process (no available variation in diversification rates among clades).

Conversely, the entire derivation done here is not, strictly speaking, Bayesian either: I did not invoke any subjective prior, nor any distribution that would not have an objective meaning in terms of the underlying macroevolutionary process. The MAP estimate derived above could in fact be considered as a frequentist MAP. But after all, there are other situations where Frequentists sometimes use MAP estimation: for instance, in population genetics, when you want to estimate the age of the last common ancestor of your sample using genetic sequence data (e.g. Thomson et al, 2000): there, you consider your genealogy, not as a fixed parameter, but as a random draw from Kingman’s coalescent. Therefore, you end up using a posterior distribution, and not a likelihood.

===

Fitzjohn, R. G. (2010). Quantitative traits and diversification. Systematic Biology, 59(6), 619–633. doi:10.1093/sysbio/syq053

Maddison, W., Midford, P., & Otto, S. P. (2007). Estimating a binary character's effect on speciation and extinction. Syst Biol, 56(5), 701–710. doi:10.1080/10635150701607033

Nee, S., May, R. M., & Harvey, P. H. (1994). The Reconstructed Evolutionary Process. Philosophical Transactions of the Royal Society B: Biological Sciences, 344(1309), 305–311. doi:10.1098/rstb.1994.0068

Rannala, B., & Yang, Z. (1996). Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Journal of Molecular Evolution, 43(3), 304–311.

Stadler, T., Kühnert, D., Bonhoeffer, S., & Drummond, A. J. (2013). Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). Proceedings of the National Academy of Sciences, 110(1), 228–233. doi:10.1073/pnas.1207965110

Thomson, R., Pritchard, J. K., Shen, P., Oefner, P. J., & Feldman, M. W. (2000). Recent common ancestry of human Y chromosomes: evidence from DNA sequence data. Proceedings of the National Academy of Sciences of the United States of America, 97(13), 7360–7365. doi:10.1073/pnas.97.13.6927