## Saturday, January 11, 2014

### The empirical Bayes future of genomics

Empirical Bayes is the future of genomics. None explains this point more clearly and more convincingly than Brad Efron, in a book that should really belong to your personal library if you are into statistical bioinformatics (Efron, 2010, see also Efron 2008).

The fundamental and very simple idea is that large amounts of data make it possible to estimate the priors directly from them, thus ensuring good frequentist properties for Bayesian estimation and hypothesis testing in this context.

In fact, it is not just that you have good frequentist properties: in many respects, you have a better frequentist paradigm than the classical one -- a paradigm that speaks to your intuition (it gives you the probability that your hypothesis is true!), is more relevant in what it controls for (see below), often has optimal properties, and finally, can much more easily be adapted, modulated, shrunk in various directions, depending on the specific needs of the empirical question of interest.

In evolutionary biology, empirical Bayes is used, among other things, for estimating gene trees, site-specific rates, site-specific selective regimes (dN/dS), or for reconstructing ancestral sequences. In genomics, for calling SNPs, detecting differentially expressed genes or for conducting genome-wide association studies. Empirical Bayes is a very natural idea in a fully Bayesian context but also in a maximum likelihood framework (in fact, it is more often associated with the latter). So, empirical Bayes is already at home in evolutionary biology.

Still, I have the impression that it is not always clearly realized, at least in our community, that these approaches should in principle have good asymptotic frequentist calibration properties. For instance, it is rarely emphasized that clade posterior probabilities might be well calibrated in a gene-tree species-tree reconciliation framework, that the credible intervals attached to site-specific estimates of evolutionary rates or dN/dS should be true confidence intervals (group-wise, i.e. 95% of them across positions contain the true value), and that there is a built-in asymptotic control for the false discovery rate (FDR) when testing for positively selected coding positions.

In particular, I think it is under-appreciated that the good frequentist property that you get nearly immediately in an empirical Bayes hypothesis testing context is, not control for type I error, but control for the false discovery rate.

For instance, if you decide to call "positively selected" all positions that have posterior probablity $p>0.95$ of having $dN/dS>1$, you are not ensuring that you will get at most 5% false positives among sites not under positive selection, which corresponds to the classical type I error idea. On the other hand, you are guaranteeing at most 5% of false discoveries among the set of all sites which you declared significantly under positive selection. Which is perhaps more useful in practice.

There are ways to obtain a good estimate of the actual false discovery rate, instead of just an upper bound. You just need to calculate the average of $(1 - p)$ over all sites for which $p>0.95$ (you can check that this will indeed be smaller than 5%). Conversely, defining a target FDR of, say 5%, the idea would be to identify the threshold $p_0$ such that the average of $(1-p)$ over all sites for which $p>p_0$ is exactly 5% and then call "positively selected" all positions with $p>p_0$.

All this has already been said and done, including in the context of detecting positive selection across sites (Guindon et al, 2005). Yet I have the impression that this is not yet totally integrated by our research community. Many people still seem to think that posterior probabilities, even in an empirical Bayes context, are not to be interpreted in frequentist terms (because they are somehow subjective), and that, conversely, being a frequentist means that you should think within the frame of a "type I error" paradigm.

Of course, all this is true only asymptotically and under correct model specification. Concerning the asymptotic condition, simulations suggest that calibration does indeed deteriorate for very small alignments and under weak selective effects (Guindon et al, 2005). Concerning model specification, the conditions to get good calibration are in particular that that the law of effect sizes (e.g. distribution of dN/dS) under the null and under the alternative should be correctly modelled.

This may look like strict conditions. Yet, for me, this just means that we should move beyond gene-by-gene analyses and implement genome-wide studies instead, using hierarchical priors simultaneously accounting for gene-specific and site-specific effects. In this genome-wide context, we also start to get sufficient amounts of empirical signal for modelling the laws of effect sizes with more generality, using semi-parametric or non-parametric methods.

In brief, with genome-wide sequences, large computers and MPI parallelism, it now becomes possible to experiment the power and the frequentist properties of empirical Bayes in practically relevant situations.

---

Efron, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statistical Science, 1–22.

Efron, B. (2010). Large-scale inference: empirical Bayes methods for estimation, testing and prediction. (Institute of Mathematical Statistics Monographs). Cambridge University Press.

Guindon, S., Black, M., & Rodrigo, A. (2006). Control of the false discovery rate applied to the detection of positively selected amino acid sites. Molecular biology and evolution, 23(5), 919-926.