Friday, October 31, 2014

Algorithmic models of probabilistic inference

Rejection sampling is clearly not the most efficient Monte Carlo estimator. Even so, it is a conceptually beautiful idea: fundamentally, rejection sampling represents an algorithmic model of Bayes theorem.

The algorithm works as follows. Given some observed data $d$ and a model parameterized by $\theta$:
- randomly choose parameter $\theta$ from the prior 
- simulate data $D$, using $\theta$ as your parameter vector
- reject if $D \neq d$, otherwise, keep sample aside
Repeat until exhaustion.

Rejection sampling represents an operational definition of what we mean by conditioning a distribution on the data: it identifies the posterior distribution with what you get by filtering out, from the random draws from the prior distribution, those parameter configurations that did not reproduce what you observed. We can get a quantitative derivation of this posterior by just reading out our algorithm: the probability of getting a particular parameter configuration $\theta$ in our final sample is proportional to the probability of first drawing this configuration from the prior, $p(\theta)$, multiplied by the probability of accepting it  which is just the probability that we simulate a dataset exactly equal to $d$, $p(d \mid \theta)$. Therefore, my final sample is distributed according to:

$p(\theta \mid d) \propto p(\theta) p(d \mid \theta)$.

Thus, rejection sampling gives us a simple intuitive 'proof' of Bayes theorem.

Personally, I find the idea of rejection sampling not just useful for explaining Bayesian inference in the context of an introductory course. I also use it regularly, reformulating my most complex problems of the moment in terms of this algorithmic representation, just to clarify my ideas. As I will try to show through some examples in later posts, rejection sampling and, more generally, algorithmic models of probabilistic settings, can sometimes be very helpful for clarifying the meaning our models and sorting out the calculations.

But first, a few more preliminaries, in the form of three little examples of increasing complexity, illustrating how this works, and what this fundamentally means, in practice. Example 1 and 2 are rather trivial, but they are interesting in that they represent two different epistemological meanings that Bayes theorem can have in practical situations. They also set the stage for example 3, which deals with conditional likelihoods.

Example 1. Imagine we want to infer the allele frequencies at a neutral bi-allelic locus, based on a random sample of chromosomes from a given population. Say we have sampled $n=5$ chromosomes, of which $k=3$ had allele 1 and $n-k=2$ allele 0, and we want to infer $\pi$, the frequency of allele 1 in the population. Of course, we could just take $k/n$ as a quick estimate, but if $n$ is small, there will be quite some uncertainty, which we also want to quantify.

If we are dealing with a randomly mating species, with known scaled mutation rate $\theta = 4 N_e u$, we know that the frequency distribution of neutral alleles in the population is a Beta distribution of parameter $\theta$ (figure below, thin line). We can use this distribution as our prior on $\pi$ and use the rejection sampling algorithm:

- randomly choose $\pi$ from prior 
- simulate $n$ chromosomes, each with prob. $\pi$ of being of type 1.
- count number $K$ of chromosomes carrying allele 1.
- reject if $K \neq k$.

Or, more compactly:

- randomly choose $\pi$ from prior 
- simulate $K \sim$ Binomial($\pi,n)$
- reject if $K \neq k$.
Reading out this algorithm leads to the following posterior distribution (figure, thick line):

$p(\pi \mid k) \propto p(\pi) p(k \mid \pi)$.

This is a rather trivial case, although one for which the inferential semantics of the algorithm is particularly compelling: our locus of interest is just any neutral locus for which $k$ out of $n$ randomly chosen chromosomes happen to have allele 1. Therefore, simulating a large number $P$ of random instances of neutral loci, each with its own population frequency $\pi$, reproducing the data-acquistion protocol on each of them and then selecting only those instances that match our observation (conditioning step), gives us a typical set, of which our true observed instance is then supposed to be just an additional random member. In particular, if we could sort in ascending order the $P+1$ values of $\pi$ (the $P$ that we have simulated + the true one), then, since our unknown $\pi$ is indistinguishable from simulated replicates, it should have an equal chance of being the first, or the second, or more generally, of having rank $p$ for any $p=1..P+1$. Therefore, the unknown $\pi$ has equal chance of being contained within any one among the successive $P+1$ intervals of our simulated series (including the two open-ended intervals at both ends): this is the epistemological meaning of a posterior distribution. 

Example 2. As another slightly less trivial problem, suppose now that we have genotyped, not just one, but $M$, unlinked neutral bi-allelic loci (or sites) in our $n$ individuals. We now we want to infer $\theta = 4 N_e u$, based on the allele counts $k_i$, $i=1..M$, across loci. Let us call $k$ the vector of observations. We have prior information about $\theta$, summarized in some prior distribution $p(\theta)$. In that case, the algorithm would run as follows:

randomly choose $\theta$ from $p(\theta)$

for site i=1..M
- randomly choose $\pi_i$ from $p(\pi_i \mid \theta)$
- simulate $K_i \sim$ Binomial$(\pi_i, n)$

reject unless $K_i = k_i$ for all $i=1..M$.

Here, the $\pi_i$’s are now just temporary random variables. The algorithm can therefore be written more compactly as:

randomly choose $\theta$ from $p(\theta)$
for i=1..M
- randomly simulate $K_i \sim p(K_i \mid \theta) = \int p(K_i \mid \pi_i) p(\pi_i \mid \theta) d \pi_i$
reject unless $K_i = k_i$ for all $i=1..M$.

where we have averaged out over all possible ways we could have sampled the $\pi_i$'s. The 'for' loop leads to a product, which gives me the following posterior distribution:

$p(\theta \mid k) \propto p(\theta) \prod_i p(k_i \mid \theta)$.

Still fairly simple. One little detail, however, concerning the epistemological meaning of the whole procedure in this new case: thus far, the formulation of the algorithm has been slightly incomplete. When I say ‘choose’, perhaps I should be more specific about who exactly ‘chooses’ here. In the case of problem 1:

Nature chooses $\pi$ from the neutral allele frequency distribution $p(\pi \mid \theta)$ 
Nature simulates $K \sim p(K \mid \pi)$
I observe and reject if $K \neq k$.

The fact that random draws are entirely under Nature's control is precisely why inference by conditioning is so compelling in that case. But for problem (2), it is a bit different:

I randomly choose $\theta$ from my prior $p(\theta)$ and imagine it is the true $\theta$

for i=1..M
- Nature randomly chooses $\pi_i \sim p(\pi_i \mid \theta)$
- Nature simulates $K_i \sim p(K_i \mid \pi_i)$

reject unless $K_i = k_i$ for all $i=1..M$.

This (somewhat less compelling) application of rejection sampling is more specifically what we call Bayesian inference: happily mixing subjective priors ('I choose') with distributions that are meant to represent objective processes ('Nature chooses'). Epistemologically, what we are now doing is this: we are filtering out, from our prior beliefs about $\theta$, those possible values that don't easily reproduce the observed data — which still makes sense, although this assumes that we are comfortable with translating beliefs into probability distributions and vice versa. I won’t discuss here the philosophical implications of this methodological choice. I will in due time, but for the moment, I just wanted to be as accurate (and as honest...) as possible in my formulation. Making this distinction beween the subjective and the objective components of the model can be useful in less trivial contexts.

Examples 3. Now, a more complicated problem: I still want to estimate $\theta = 4 N_e u$ based on $M$ unlinked loci. But the crucial difference with the previous example is that I have removed all the constant sites of my dataset, so as to consider only segregating sites: that is, only loci that happen to show at least one variant in my sample of $n$ individuals. So, by design, for all $i=1..M$, one has $0 < k_i < n$ (whereas in the previous case, one had instead that $0 \le k_i \le n$). Let us denote this additional constraint by $S$: the sample is entirely made of segregating sites. For $i=1..M$, $S_i$ stands for the constraint that site $i$ is segregating.

I should really take into account this new mode of data acquisition in my inference. If I do not, I will greatly overestimate $\theta$ (since it would take very high values of $\theta$ to make it likely that $M$ randomly chosen neutral sites were all segregating). Thus, I clearly need to work out the probabilistic calculations for my constrained model. 

Since sites are unlinked and, therefore, independent, collecting $M$ segregating sites among a large collection contanining a majority of non-segregating sites is just like letting Nature ‘run’ the process, site by site, intercepting only those that turn out to be segregating, until obtaining a total number of $M$ accepted sites (and then I decide whether or not to reject the whole sample). Which leads to the following algorithmic representation:
I randomly choose $\theta$ from $p(\theta)$ and imagine it is the true $\theta$

for i=1..M
  - Nature randomly chooses $\pi_i \sim p(\pi_i \mid \theta)$
  - Nature randomly simulates $K_i \sim p(K_i \mid \pi_i)$
  until $1 < K_i < n$

I reject unless $K_i = k_i$ for all $i=1..M$.

Note that there are now two levels of rejection: one that is ‘for free’ (in the repeat-until loop), and one that is not (the final step). By ‘free’, I mean: not accounted for in our total acceptance rate -- and, therefore, not accounted for in our total probability.

Now, if we want to write down the posterior, as above, we can first eliminate the $\pi_i$'s:

Randomly choose $\theta$ from $p(\theta)$

for i=1..M
  - Nature randomly simulates $K_i \sim p(K_i \mid \theta) = \int p(K_i \mid \pi_i) p(\pi_i \mid \theta) d \pi_i$
  until $1 < K_i < n$

Reject unless $K_i = k_i$ for all $i=1..M$.

Then, for a given $\theta$ and a given $i$, carefully interpreting the repeat-until loop leads to the following proability for $k_i$, which is now conditioned on the constraint $S_i$ being fulfilled:

(1) $p(k_i \mid \theta, S_i) = p(k_i \mid \theta) / p(S_i \mid \theta)$.

This conditional likelihood is defined only over $k_i = 1..n-1$ (now excluding 0 and n, since I won't exit the repeat-until loop as long as these values are being produced) and has been suitably normalized, by dividing it by the probability of getting a value for $k_i$ in the allowed interval (i.e. by indeed getting a segregating site given $\theta$):

$p(S_i \mid \theta) = \sum_{l=1}^{n-1} p(k_i \mid \theta)$.

Then, according to the algorithm, I just need to take the product over the $M$ sites (the 'for' loop) and combine this with my prior:

$p(\theta \mid k) \propto p(\theta) \prod_i p(k_i \mid \theta, S_i)$,

or, for short:

(2)    $p(\theta \mid k) \propto p(\theta) p(k \mid \theta, S)$.

This particular problem illustrates one general rule:

filtering the data $\implies$ repeat-until loops $\implies$ conditional likelihoods.

However, it is also interesting for another reason: there is no simple and safe way we can have $S$ appear at the correct place, in equation 2, by just blindly manipulating symbolic equations, as we often do in applied Bayesian inference, shuffling symbols around, sometimes to the left, sometimes to the right of the conditioning sign. We could try to write, for instance:

$p(\theta \mid k, S) \propto p(\theta \mid S) p(k \mid \theta, S)$,

but this looks different from equation 2. We could then be tempted to expand the first factor:

$p(\theta \mid S) \propto p(\theta) p(S \mid \theta)$

and combine it with the expanded form of the conditional likelihood (equation 1), leading to:

$p(\theta \mid k, S) \propto p(\theta) p(S \mid \theta) p(k \mid \theta) / p(S \mid \theta) \propto p(\theta) p(k \mid \theta)$,

but then, we are back to the simple posterior of problem 2, which is obviously wrong. One main problem here is that we have an ambiguity in the meaning of $p(\theta \mid k, S)$: usually, this would mean ‘the probability of $\theta$, given the observed data $k$ and given that $S$ is true’. However, what we really want in the present case is: ‘the probability of $\theta$, given the observed data $k$ and given that my data acquisition protocol is constrained to deliver something compatible with $S$’, which is different. Symbolic notation in terms of '$p(A \mid B)$' is just too sloppy. 

We could try to come up with better notations and still play the game of automatic symbolic calculus. However, when facing complex problems, perhaps it is just more helpful, and more insightful, to directly rely on a semantic model of our probabilities. Rejection sampling, and more generally algorithmic models of our sampling protocol, provide such a semantic model. As a result, they represent a relatively foolproof method for deriving the correct version of our posterior distribution (or of our likelihood): we just need to spell out each of the probabilities, such as operationally defined by the algorithm, which gives us the solution.

P.S. (Nov 1st):  In hindsight, I realize that there is something fallacious in this final argument. You can have a good semantic model and good notations. Here, I should perhaps be more careful and denote the probability that $A$ is true given that $B$ is true and given that $C$ will always be fulfilled owing to the data acquisition protocol, by $p(A \mid B;C)$. A correct form for Bayes theorem in the present case would then be:

$p(\theta \mid k; S) \propto p(\theta ; S) p(k \mid \theta; S)$

Since the data acquisition protocol does not influence the prior:

$p(\theta; S) = p(\theta)$

As for our likelihood, under this protocol, as suggested by the algorithmic argument above, it is equal to:

(3)    $p(k \mid \theta; S) = p(k \mid \theta, S) = p(k \mid \theta) / p(S \mid \theta) $

and therefore, equation 2 should be more properly written as:

(2b)    $p(\theta \mid k; S) \propto p(\theta) p(k \mid \theta, S)$.

Tuesday, October 28, 2014

Bayesian automatic control of model complexity

Still concerning Bayes factors and model selection, I wanted to show a little example, illustrating how one can rely on Bayesian model averaging and hierarchical priors to implement efficient and automatic control of model complexity -- model auto-tuning, as I like to call it.

First, I have simulated a single-gene amino-acid alignment of ~600 positions under a very simple model, using the LG amino-acid empirical matrix (Le and Gascuel, 2008) and assuming no variation in substitution rates among sites. For this simulation, I used as a template an alignment of elongation factor 2 in 30 species across eukaryotes. The simulation was done using the tree estimated on this dataset, thus under empirically reasonable settings.

Then, I have analyzed this simulated dataset directly using one of the most complex models I have implemented thus far: CAT-GTR + Gamma (Lartillot and Philippe, 2004). This model implements variation among sites in the rate of substitution (distributed across sites according to a discretized gamma distribution, with 4 categories), but also in the equilibrium frequency profile over the 20 amino-acids, using a Dirichlet process non-parametric prior. Finally, the 190 relative exchangeability parameters are also considered as unknown and are therefore estimated directly on the dataset.

According to accepted wisdom, one would think that this model is way too complex, way too rich, to be applied to such a small dataset. Even using a GTR model, instead of a LG model, in the present case would often be considered as excessive (190 free parameters! on a 600-site alignment!). Therefore, my estimation will certainly suffer from excessive variance.

Is this true?

Let us see, by just trying both models: the complex model, CAT-GTR + Gamma, and the true model, LG without Gamma, and see what comes up in terms of phylogenetic estimation (I let you guess which picture corresponds to which model):

Where is the excess variance here? There is virtually no detectable difference between the two estimates.

This represents a typical example showing that a complex model can automatically adapt to the specific case you are dealing with, making model selection totally dispensable. All this, however, is true only if your priors have been correctly designed, so as to get efficient shrinkage of your parameters, in particular under those circumstances where the data in fact contain none of the subtle variation that your model is meant to account for. So, perhaps I should spend some time explaining how my priors have been elicited in the present case. The main point is that these priors are hierarchical.

For the shape parameter $\alpha$ of the gamma distribution of rates across sites, I have used a log-uniform prior restricted between 0.1 and 100, thus allowing a broad range of variance among sites in relative rates. Large values of $\alpha$ effectively implement uniform rates across sites.

A Dirichlet process prior is used over the unknown distribution of amino-acid equilibrium frequency profiles across sites. This non-parametric prior effectively runs over infinite mixtures of amino-acid profiles. There are two main things that have to be controlled in the distribution of the components of the mixture: the relative weights of the components and the dispersion of the profiles attached to them.

The weights are regulated by a granularity parameter $\kappa$. For small values of $\kappa$, the mixture is such that one component of the mixture has an overwhelmingly large weight compared to the infinite series of all other components. In this regime, nearly all sites will be allocated to the large-weight component, and the model will implement strictly identical amino-acid profiles across sites. In contrast, for large values of $\kappa$, the Dirichlet process implements many components of very small weights, so that each site has effectively its own equilibrium frequency profile over amino-acids. Here, $\kappa$ is endowed with an exponential prior of mean 10, so that there is some room for implementing potentially rich mixtures by setting $\kappa$ to reasonably large values, but also to reduce the number of components to 1 or 2 (essentially, when $\kappa$ is less than 0.1).

As for the distribution of profiles across components, it is regulated by a concentration parameter, $\delta$. For large values of $\delta$, amino-acid profiles implemented by the mixture will all be very similar, whereas for small value of $\delta$, they will tend to be very different. Here, $\delta$ is endowed with an exponential prior of mean 20. So, it does not especially favor very similar components, but it can, if needed (when its value is typically of the order of 50 or above).

Finally, the prior on the exchangeability parameters is centered around the LG exchangeabilities, with a concentration hyperparameters $\zeta$ itself endowed with a log-uniform distribution between 0.1 and 100. Large values of $\zeta$ effectively shrink the exchangeabilities very close to those of LG.

Therefore, the most important priors and distributions are tuned by four key-parameters, such that, with large $\alpha$, large $\zeta$, small $\kappa$ and/or large $\delta$, the model reduces to the LG model with uniform rates and profiles across sites. Note that there are two redundant ways the model can come back to virtually identical profiles across sites: either by implementing one single component of overwhelming weight (small $\kappa$), or by just letting the amino-acid profiles of the mixture be all very close to each other (large $\delta$).

So, what happens exactly when I run this model on the dataset simulated under the simple LG-Uniform model?

First, the posterior distribution on $\alpha$, the shape parameter of the distribution of rates across sites (left) is shifted to very large values: $\alpha$ is inferred to be so large that, essentially, all relative rates of substitution across sites are all strongly shrunken around 1 (right) -- in other words, the model effectively implements uniform rates across sites, as should be done in the present case.

Then, what about variation in amino-acid profiles among sites? The figure below shows the posterior distribution on the number of occupied components:

as you can see, we get a large posterior probability for the configuration where all sites are allocated to one single component (pp = 0.6). Even when sites are distributed over more than one component (which still happens around 40% of the time), the components effectively implement nearly identical frequency profiles: $\delta$ is on average around 80 (not shown). Thus, efficient shrinkage of equilibrium frequency profiles across sites is achieved by a combination of discrete model selection (by playing on the number of occupied profiles) and continuous model averaging (by shrinking component-specific frequency profiles toward their mean).

We can have a more direct look at the average profiles estimated across sites, and check that they are indeed, for all practical means, constant across sites: 

It is also interesting to get a look at the accuracy of our estimation of the estimated relative exchangeabilities, by comparing them to the true values:

Such a beautifully accurate estimation of the relative exchangeabilities may in fact suggest that using a model centered on LG is just too easy... So, what happens if I use instead a uniform prior over exchangeabilities? In that case, shrinkage is somewhat less efficient, both in terms of the accuracy of the estimated exchangeabilities:

and in terms of the number of occupied components in the mixture (the posterior probability of getting just one occupied component is now less than 0.2):

On the other hand, $\delta$ is still large (again of the order of 80), which probably contributes to inducing a series of very similar amino-acid profiles across sites:

Finally, the estimated consensus tree is exactly the same as with the model centered on LG:

Thus, globally, there is no real loss incurred by using the more complex model, when in fact a much simpler model would be applicable to our data.

In contrast, on the real sequence alignment, there is rate variation among sites, and there is also a great deal of variation in amino-acid preferences along the sequence. We can get a clear picture of that by just running the CAT-GTR + Gamma model directly on the original empirical dataset of elongation factor sequences. Here, the posterior distributions that we get are totally different, in particular on the distribution of rates across sites ($\alpha$ is now clearly less than 1, of the order of 0.55) and concerning the distribution of amino-acid profiles across sites:

Other analyses with CAT-GTR on a now relatively large set of cases suggest that this is a very general pattern: on empirical sequence data, one always sees substantial variation among sites of both rates and amino-acid propensities.

Overall, one should probably not over-interpret those results, which all entirely rely on the fact that the true model is included in the set of possible configurations explored by CAT-GTR. In the real world, things are much more complicated: you almost certainly have other types of model violations (compositional heterogeneity among taxa, epistatic interactions between positions, due to the tertiary structure of the protein), which have been totally ignored in our analysis. One big question is, for instance: how does a model such as CAT-GTR react to these violations, compared to LG or to other models? Not an easy question, for sure.

However, this is not the kind of issues that you can hope to address by just comparing the empirical fit of the models. Addressing such problems requires more substantive arguments, about exactly what you want to estimate and how violations are likely to influence your estimation. In this direction, we may not know the specific impact of other model violations on CAT-GTR, but we already know that, by using LG, we are for sure creating one additional model violation, since we are ignoring what appears to be substantial and ubiquitous variation among sites in amino-acid propensities -- and this does have an impact on the accuracy of phylogenetic estimation (Lartillot et al, 2007).

But in any case, these model violation problems distract us from the main point of this post, which is more theoretical, being concerned with how Bayesian inference deals with model complexity. And the main message is this: unlike what happens in the context of maximum likelihood estimation, where we have to compensate for the intrinsic advantage of more complex models by implementing explicit model penalization and explicit model selection, in a Bayesian inference context, and using correctly specified priors, model complexity is just under automatic control. As a result: (1) it is possible to use a complex model by default, even on small datasets; this model will automatically reduce to the simpler model if needed (2) therefore, we do not need Bayes factors to be sure that we are using the right model, with the right complexity; (3) nor do we not need Bayes factors to see that, on a real dataset, there is variation in rate and in equilibrium frequency profiles: we just see it in the posterior distribution. And if we are really anxious, we can always check, on simulations under simpler models, that our Bayesian analysis would not infer such a variation if there weren't any.


Le, S. Q., & Gascuel, O. (2008). An improved general amino acid replacement matrix. Molecular Biology and Evolution, 25(7), 1307–1320. doi:10.1093/molbev/msn067

Lartillot, N., & Philippe, H. (2004). A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution, 21(6), 1095–1109. doi:10.1093/molbev/msh112

Lartillot, N., Brinkmann, H., & Philippe, H. (2007). Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary Biology, 7 Suppl 1, S4. doi:10.1186/1471-2148-7-S1-S4

Monday, October 27, 2014

Multiple testing: a damnation or a blessing?

Just to follow up on a post by Joe Pickrell. I cannot agree more with Joe on the idea that, at least if your aim is to control for the false discovery rate (FDR), then multiple testing is not a damnation — it is, in fact, a blessing.

Why don't people see that the 'problem of multiple testing' has entirely disappeared with the advent of FDR ? Perhaps simply because FDR is still often presented as a method meant to 'correct for multiple testing', which I think is a very misleading way to see it.

As I already discussed, if you see FDR as a method correcting for multiple testing, then, somehow, this conveys the (wrong) idea that, the more tests you conduct, the more you have to fight against the combinatorial curse. But this is not what happens. FDR is just the fraction of true nulls among your rejected items, for a given threshold $\alpha$ on the p-values. Therefore, it depends only on the proportions of true nulls and true alternatives in your sample  not the absolute number of tests.

If, among 100 000 SNPs, using a cutoff of p < 1e-5 gives you an expected 10% FDR, this means that, among all findings with p-values below 1e-5, 10% are false discoveries. Now, you may randomly subsample your dataset, say, take one out of 10 of your original 100 000 SNPs: among those that have a p-value below 1e-5 in your subsample, you will obviously still have 10% of false discoveries. Thus, for a given threshold (here 1e-5), subsampling your original data does not decrease your FDR (or equivalently, for a given target FDR, you would not have to relax your threshold to analyze the subsample) -- which is however what you would expect if you were fighting against some combinatorial effect created by the sheer number of tests.

So, the absolute number of tests is not, per se, a damnation.

Conversely, an empirical Bayes perspective on the problem suggests that the FDR can be interpreted as the posterior probability that a rejected item is in fact a true null. This posterior probability is empirical, in the sense that it is based on priors (prior fraction of true nulls and prior distribution of effect sizes among true alternatives) that have been estimated directly on the dataset: these priors have a frequentist interpretation.

From this empirical Bayes perspective, it is immediately clear that, the more items you have, the better your priors are estimated. So, it is in fact a good idea to work on large datasets: it improves the quality of your FDR control -- and this, without causing any deterioration of your power.

So, the absolute number of tests is in fact, a blessing.

Now, of course, all this assumes that the proportion of true alternatives is the same, whichever the size of your dataset. So, this does not apply to the situations where you would for instance work on a subset of SNPs that have been pre-selected based on some objective criterion (say, SNPs close to protein coding genes), which you hope will increase the proportion of interesting items in your sample. Conversely, if working on large datasets generally means that you are working more indiscriminately (i.e. that you are working on sets that are likely to contain much more garbage), then, yes of course, you will have to use a more stringent threshold to achieve the same FDR. However, this is not, in itself, because you are working on a larger number of cases, but only because you have also swamped your interesting cases with a larger proportion of true nulls.

In fact, once you are comfortable with the empirical Bayes perspective on the concept of FDR, you can see ways of obtaining the best of both worlds: work on a large and indiscriminate set of cases, while at the same time enjoying the power that you would get from first enriching your sample based on good heuristic pre-selection criteria.

The idea is to make an explicit empirical Bayes model, in which your prior of being a true alternative now varies among items, in a way that systematically depends on the criteria that you would have used for pre-selection: essentially, doing a logistic regression model for the (now item-specific) prior of being a true alternative, with covariates provided by external criteria. The parameters of this model would be estimated on the dataset, by empirical Bayes. Doing this will automatically capture the potential enrichment effects within certain interesting subsets and will thus lead to an increase in your global number of findings, compared to the standard, uniform version of FDR analysis, while still controlling for the same level of global FDR.

Fundamentally, it is misleading to see FDR as a method for 'correcting for the devastating effects of multiple comparison'. It is far more adequate to see it as a method that takes advantage of the multiple comparison settings, to derive a frequentist control that is much more useful than the control of type I error.

So, yes, I am definitely up for a multiple-testing party!

Tuesday, October 7, 2014

Cases where Bayes factors can be useful

Here are some situations where explicit computation of Bayes factors could still be meaningful and useful:

Bayes factors can be used for comparing a small number of alternative tree topologies in the context of a phylogenetic analysis. In particular, if the topology that would have been considered the most reasonable one before the analysis turns out to be totally absent from the sample obtained by Monte Carlo, we would typcially want to now how strong the empirical evidence is against what we have deemed thus far to be true. It would therefore make sense to compute a Bayes factor in that case. This would have to be done explicitly, using thermodynamic integration or the stepping-stone method.

Another reason for explicitly comparing the marginal likelihoods of several candidate tree topologies is to make sure that the MCMC did not get stuck in a local maximum, returning a high apparent posterior probability for a tree that has in fact a low marginal likelihood.

On the other hand, the differences in log-marginal likelihoods between alternative (reasonable) topologies are usually very small compared to the absolute values of the log marginal likelihoods, pushing the numerical methods to their limits in terms of numerical accuracy.

Even in the situations mentioned in my previous post, where automatic model selection can easily be implemented, it can still be useful to compute marginal likelihoods or Bayes factors: just to make a point, for instance, about how a new model improves on the models currently considered as the standards of the field or, more generally, what features of a model are important in terms of empirical fit.

Finally, in those cases where it is difficult to implement an efficient reversible-jump MCMC for averaging over non-nested models, model averaging may sometimes be replaced by model selection through explicit calculation of marginal likelihoods (although it is often easy to come up with a more general model encompassing all of the non-nested models that we want to compare).

But then, these are relatively specific situations, quite distinct from the idea of systematically computing Bayes factors in order to first select the best model, and then conduct inference by parameter estimation, which is more specifically what I was arguing against.

For me, systematic and explicit model selection is one of those old reflexes inherited from maximum likelihood thinking -- but not really adapted to the context of Bayesian inference.

Friday, October 3, 2014

Model averaging versus model selection

Coming back to my last post about model selection using Bayes factors, in which I was suggesting that it is most often not necessary to implement explicit Bayes factor evaluation: the model-averaging aspect is in fact more important than I suggested. Indeed, it is precisely because you average over models that you do not need to go through explicit model selection.

I think there is something fundamental to understand here, concerning the difference between the maximum likelihood and the Bayesian approaches to the question of choosing models, so let me belabor the point.

In a maximum likelihood context, one would generally not use the more complex model by default. Often, one would instead use the simplest model, unless this model is rejected by the data in favor of a more complex alternative. This approach to model selection is implemented, for instance, in sequential tests that go from very simple substitution models like Jukes-Cantor to the general time-reversible parameterization, going through the whole series of intermediates (F81, HKY85, TN93, etc) sorted by increasing dimensionality.

However, this paradigm is, I think, inadequate in a Bayesian framework. In fact, the model ‘auto-tuning’ example I was mentioning in my last post suggests exactly the opposite: to always use the most general model by default, the one that has the highest dimension, given that it will automatically include whichever simpler alternative may turn out to be adequate for the particular dataset under consideration.

Why this difference between the two frameworks? I think there are two (related) reasons:

(1) point estimation versus marginalization: maximum-likelihood is an either/or method: either JTT or GTR. Of course, as a model, GTR does include JTT as a particular case, but in practice, you still have to choose between two distinct point estimates for the exchangeability parameters: either those of JTT, or those obtained by maximum likelihood estimation. So you really have two distinct, non-overlapping models at the end, and you will have to choose between them.

In contrast, in a Bayesian context, if you are using GTR, you are in fact averaging over all possible configurations for the exchangeability parameters, including the one represented by JTT. Seen from this perspective, it is clear that you are not choosing between two mutually exclusive options. Instead, choosing between GTR and JTT really means, choosing between a risky bet on one single horse (JTT) or a bet-hedging strategy consisting of averaging over all possible configurations for the unknown exchangeability parameters.

(2) optimization versus conditioning: as I previously explained, maximum likelihood is inherently prone to overfitting. Therefore, you should have good reasons to use a model that invokes additional parameters. In contrast, Bayesian inference works by conditioning, and therefore, does not have over-fitting problems.

Point (1) gives some intuition as to why we need explicit model selection in a maximum likelihood but not in a Bayesian analysis, while point (2) explains why we would typically conduct model selection by starting from the simplest model in a maximum likelihood framework, whereas we can safely opt directly for the more general model in a Bayesian context.

Now, averaging over models also means that we have to correctly define the measure over which to take this average: in other words, we have to correctly design our prior over the models, which is not necessarily something obvious to do. I have suggested some ways to do it properly in my last post (mostly, by having key hyperparameters that are able to drive the whole model configuration down to some remarkable submodels), but there is probably much more to say about this.

In any case, what all this suggests is that the focus, in Bayesian inference, should not be on explicit model selection. Instead, it should be on prior design — including the prior over models.

Wednesday, October 1, 2014

Selecting models without computing Bayes factors

Model comparison or selection is often considered to be a very important step in a typical Bayesian analysis. It is often done by calculating Bayes factors or, more generally, marginal likelihoods.

I used to be a fan of Bayes factors. I even spent quite some time on the numerical methods for reliably computing them, in the aim of comparing amino-acid substitution models or alternative relaxed molecular clocks. But I have progressively changed my mind about Bayes factors, or at least about the need to explicitly calculate them. 

First, numerical evaluation of marginal likelihoods is tricky. For quite some time, it was not even done properly: unreliable methods such as the harmonic mean estimator (‘the worst Monte Carlo method ever’) were most often used for that purpose. Now, better numerical methods are employed, such as thermodynamic integration or its discretized equivalent, the stepping-stone approach. But the reliability of these numerical methods comes at a price: they are computationally expensive.

Given the costs associated with the whole enterprise, it is therefore important to realize that these laborious numerical methods are most often unnecessary. A much simpler alternative is to make a general model encompassing all of the models that we wish to compare. Then, by just carefully designing the prior over the parameters of this model, what we obtain is a probabilistic system that, upon being conditioned on a given dataset, automatically selects the parameter regime effectively implementing the model that happens to have the highest fit for these data.

As a simple but not too trivial example, let us assume we have an amino-acid alignment and want to choose between two alternative substitution models: say, the empirical JTT matrix and the most general time-reversible model (GTR). The two models greatly differ in their dimensionality: the 190 exchangeability parameters are fixed to pre-determined values in the case of JTT, whereas they are estimated directly from the data in the case of GTR.

In this specific case, instead of numerically evaluating the marginal likelihoods of the two models, one could design a GTR model such that the prior on its exchangeability parameters would be centered on the exchangeabilities of JTT. The concentration of this prior around its center would be controlled by one single concentration parameter, which should in turn be endowed with a broad prior. Typically, I would use a log-uniform prior (possibly truncated to make it proper) or a half-Cauchy. Then, I would just condition the model on the data, without any further complicated model selection process. If the JTT model is really fit, then the concentration parameter will automatically take on large values under the posterior, and as a result, the exchangeabilities of the model will shrink towards those defined by JTT. In other words, the GTR model will effectively implement the JTT model if this is what the data prefer. Importantly, this automatic 'collapse' of the model is driven by just one hyperparameter, so that the approach should work well even for relatively small datasets.

Note that the model will never exactly collapse onto the simpler model: using this approach, the exchangeabilities of the GTR model are never exactly equal to those of the JTT model. In all regimes, the model always has its full dimensionality. Is this really a problem?

A typical objection here would be that our estimation might suffer increased variance. However, strictly speaking, this is true only if the data have been produced by a JTT model, which is never the case in practice, except in simulation studies. Even in simulation studies, I would be surprised to see any detectable difference in terms of estimation error (above the MCMC noise) between the hierarchical model settings suggested above and the true model: thanks to the specific hierarchical design of the prior, the extra-variance induced by the more flexible parameterization should be very small — technically, the effective number of parameters is much smaller than the nominal number of parameters.

More fundamentally, nobody believes that JTT is the true model for any real data, and therefore, it does not really make sense to see it as a relevant point null hypothesis to be tested against a more general alternative, as would be implied by a Bayes factor analysis between JTT and GTR. Instead, JTT at best represents a good first guess for the exchangeabilities, to be used when the sequence alignment under investigation is too small to provide this information entirely from scratch. But then, a first guess should probably not be formalized as a point null. Instead, it should be seen as a good center for the prior — which is exactly what the hierarchical prior suggested here does: using JTT as a first guess, while implementing some empirically adjustable flexibility around it.

This method can easily be generalized to deal with all sorts of model selection problems. In all cases, this will be done by defining hierarchical priors in a way that ensures that the model will be able to effectively collapse onto a simpler model if needed. Many examples can be imagined: a large value for the shape parameter of the gamma distribution of rates across sites effectively implements the uniform-rate model; efficient shrinkage of branch-specific equilibrium frequencies for the substitution model can effectively emulate a time-homogeneous model; or a suitably chosen prior over partition models gives the data the opportunity to efficiently funnel the model onto uniform substitution matrices or uniform branch lengths across partitions, if relevant.

Technically, the approach could be seen as a case of model averaging. However, the expression ‘model averaging’ is meant to emphasize the ‘averaging’ aspect: the fact that you don’t want to bet 100% on one single model. Here, the averaging is perhaps not the most important point (even if it does play some role). For that reason, I would call this procedure model auto-tuning, thus referring to the fact that the model automatically finds the most adequate parameter regime.
In any case, I believe that substantial amounts of computational resources could be saved, simply by not computing Bayes factors between phylogenetic models: in most cases, this is at best unnecessary and, at worst, just an idle exercise.