## Thursday, November 13, 2014

### Bayesian diversification models (2)

Concerning my last post, I realize that my notation for rejection sampling models is a bit elliptic. In particular, the rejection step corresponding to the fact that we condition on the observed data is never explicitly written. So, let me just reformulate the technical arguments and restate the main idea.

In the general case, i.e. for any Bayesian model, the rejection sampling algorithm could be written as follows:

REPEAT
sample $\theta$ from prior
simulate data given $\theta$
UNTIL simulated data match observed data $D$
return $\theta$

What this algorithm returns is one single value of $\theta$ sampled from the posterior distribution:

$p(\theta \mid D) \propto p(\theta) p(D \mid \theta)$.

Of course, you can imagine that one would typically run this algorithm many times and then use the resulting sample of independent draws from the posterior to estimate posterior mean, median, credibility intervals, etc.

Note that I use upper case letters for the REPEAT-UNTIL loop corresponding to the conditioning on empirical data -- to distinguish it from other repeat-until loops that might also be invoked, just because of the data acquisition protocol.

Now, in the case of a diversification model, we have to account for the fact that we observe only surviving species clades. In addition, we imagine that there is some natural variation among clades in diversification rates, represented by a distribution $\phi$ on the diversification parameter $\theta = (\lambda, \mu)$, such that, each clade produced by Nature would have its speciation and extinction rates independently sampled from $\phi$.

Thus, altogether, our complete rejection sampling would look like:

REPEAT
repeat
Nature chooses $\theta$ from $\phi$
Nature runs diversification process given $\theta$
until process survives
UNTIL simulated phylogeny matches observed phylogeny $T$
return $\theta$

We now have two repeat-until loops: the inner one represents the fact that, by construction, we only consider non extinct groups. The outer one represents our conditioning on the observed phylogeny $T$. However, this could just as well be written as:

REPEAT
Nature chooses $\theta$ from $\phi$
Nature runs diversification process given $\theta$
UNTIL simulated phylogeny matches observed phylogeny
return $\theta$

simply because an empty tree will never match our observed phylogeny anyway. Thus, we end up with a value of $\theta$ sampled from a very simple posterior distribution, one that does not invoke any conditional likelihood (this was my equation 1 in the previous post):

(1)    $p(\theta \mid T) \propto \phi(\theta) \, p(T \mid \theta)$

This posterior distribution combines a prior not conditional on survival $\phi(\theta)$, with a likelihood also not conditional on survival $p(T \mid \theta)$.

Now, you can always rewrite this as:

$p(\theta \mid T) \propto \phi(\theta) p(S \mid \theta) \, p(T \mid \theta) / p(S \mid \theta)$

or, equivalently:

(2)   $p(\theta \mid T) \propto p(\theta \mid S) \, p(T \mid \theta, S)$

where $p(S \mid \theta)$ is the probability of survival of the clade. Thus, we now have a prior conditional on survival together with a likelihood also conditional on survival. Equation 1 and 2 are equivalent, and you can choose between them, fundamentally, based on which prior information you have (fossils: equation 1, other neontological studies: equation 2).

But in any case, you never get, under this algorithmic model, an unconditional prior combined with a conditional likelihood. The only way to get this combination is by considering another model:

REPEAT
choose $\theta$ from some prior $p(\theta)$
repeat
Nature runs diversification process given $\theta$
until process survives
UNTIL simulated phylogeny matches observed phylogeny

which basically corresponds to the case where Nature has produced many clades, all under the exact same diversification rates (same parameter vector $\theta$), and we have randomly sampled one among the surviving clades thus produced. Here, we still draw $\theta$ from some prior distribution $p(\theta)$, but this distribution now merely represents our prior uncertainty about the fixed value of $\theta$ 'chosen by Nature' -- not the natural variation of $\theta$ among groups. So, here, we cannot collapse the two repeat-until loops, we have to keep the two: an internal loop corresponding to the actual 'repeated natural experiment', under fixed but unknown $\theta$, and an external loop simply representing our Bayesian inference about this unknown value of $\theta$, by rejection sampling, based on some subjective prior $p(\theta)$.

Personally, I consider this second model less convincing than the first one: we do see in practice quite some variation among closely related groups, so, we should invoke some distribution $\phi$ representing this variation.

Of course, ultimately, none of the two models is really convincing: as soon as you acknowledge the existence of variation in diversification rates between clades, then you should also expect variation within clades: if diversification rates differ between rodents and primates, then, they almost certainly vary among primates, or among rodents. Mathematically, $\theta$ should itself be the result of a stochastic process along the lineages. In fact, some interesting work has already been done along those lines, using a compound Poisson process (Rabosky, 2014).

===

Rabosky, D. L. (2014). Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees. PLoS One, 9(2), e89543. doi:10.1371/journal.pone.0089543

#### 1 comment:

1. Hi Nicolas,

Thanks for your new post. I think I get your point. In the previous one, I understood that you was dealing with the last algorithmic model above. I feel this last one is more natural (i.e. considering $p(\theta|S)p(T|\theta)$). The condition on $\theta$ is for taking into account the fact that you are sampling among phylogenies which do survive. The condition on $T$ is for getting the Bayesian probability.
From a more practical point of view, I tried several kind of conditionings to (maximum likelihood) estimate the diversification rates on artificial evolution. I remember that results were quite disappointing (i.e. the most natural conditioning was not the best).

Cheers,

Gilles