## Friday, January 17, 2014

### The adventurous Bayesian and the cautious Frequentist

Although the idea of controlling for false discovery rate (FDR) was first proposed in a purely classical context (Benjamini and Hochberg, 1995), the FDR has a simple empirical Bayes interpretation (Efron, 2010), which I would like to illustrate here on a simple case.

Also, I find it useful to contrast the two approaches, classical and empirical Bayes, on the issue of controlling for FDR and, more generally, on hypothesis testing, for it reveals interesting things about the fundamental difference between them.

To fix the ideas, let us consider the classical textbook case of a blood test for detecting some particular disease. The test has a rate of type I error equal to $\alpha$ (an expected fraction $\alpha$ of the people not affected by the disease will show up as positive), and a rate of type II error of $\beta$ (a fraction $\beta$ of affected individuals will be missed by the test). The population is composed of a fraction $p_0$ of unaffected people, and a fraction $p_1 = 1 - p_0$ of individuals affected by the disease.

Conducting the test on a large sample of individuals taken at random, the total fraction of positive outcomes will be:

$\pi = p_0 \alpha + p_1 (1-\beta)$,

that is, the total number of discoveries will be the sum of a fraction $\alpha$ of the unaffected individuals (the false positives) and a fraction ($1-\beta$) of affected individuals (the true positives). The rate of false discovery is then simply the proportion of all discoveries corresponding to false positives, i.e. the relative contribution of the first term:

$FDR = \frac{p_0 \alpha}{\pi}$.

Letting $H_0$ denote the hypothesis that a given individual is unaffected, and $H_1$ the hypothesis that the individual is affected, this can be rewritten as:

$\pi = p(positive) = p(H_0) p(positive \mid H_0) + p(H_1) p(positive \mid H_1)$,

$FDR = \frac{p(H_0) p(positive \mid H_0) } { p(positive) } = p(H_0 \mid positive)$,

which makes the Bayesian nature of the concept of FDR more than apparent: the FDR is just the posterior probability of not being affected given that the test was positive.

An empirical Bayes approach typically tries to achieve an (asymptotically) exact control for the FDR, by estimating $p_0$ and $p_1$ directly from the data.

On the other hand, it is possible to control for FDR -- although now conservatively -- in a non-Bayesian context. The idea is very simple: the fraction $\pi$ is empirically observed, $\alpha$ is known by design, and $p_0 < 1$. Therefore:

$FDR = \frac{p_0 \alpha}{\pi} < \frac{\alpha}{\pi}$,

which gives an upper bound on the FDR. In plain words: at $\alpha=0.05$, I expect at most 5% of positive tests just by chance. I observe 15%. Therefore, at most one third of my disoveries are false.

If $p_1 << 1$, $p_0$ is very close to 1 and, therefore, the upper bound is tight (I expect only slightly less than 5% just by chance). Even if $p_1$ is not that small, as long as it is not very large (as long as we are not in a situation where the vast majority of the population is affected by the disease), the bound is still very useful. For instance, even if half of the population is affected, which is already a lot, and the nominal control is at 5%, then the true FDR is between 2 and 3 %, which is of the same order of magnitude as the nominal rate.

What is also interesting here is that, in order to obtain this useful upper bound on the FDR, we do not need to assume anything about the test nor about the population, except that $p_1$ is small. Thus, to control for FDR, we do not need to take the trouble to explicitly model the whole situation by empirical Bayes: we get good bounds based on elementary arguments and direct counting of the number of positive outcomes that we observe.

On the other hand, if we care, not just about FDR, but more generally about the sensitivity-specificity tradeoff, then we have to look at the power of the test: are we detecting a sufficient proportion of the true cases?

The problem, however, is that power often depends on some underlying effect size (let us call it $\theta$). Typically, the null hypothesis is just $\theta = 0$ and the alternative is $\theta \ne 0$. In our example, $\theta$ could be the concentration of some antibody or viral antigen in the blood. The power of the test will generally increase with $\theta$: values of $\theta$ close to 0 will be less easily detected than very large values of $\theta$.

In this context, what matters for the sensitivity-specificity tradeoff is the average power of the test over the population under $H_1$. Statistically, this corresponds to an average over the prior distribution of $\theta$ conditional on $H_1$. Thus, we see that the realized power of a test also has a natural empirical Bayes interpretation, like the FDR.

Again, typically, an empirical Bayes procedure will try to estimate the true law of $\theta$ under $H_1$. If it succeeds in this estimation, then the control of the sensitivity-specificity compromise is optimal: for a given level $\alpha$, we can obtain good estimate of the exact FDR (not an upper bound), as well as a good estimate of the average power of the test, so we are in the best position to fine-tune the balance between specificity and sensitivity.

On the other hand, all this assumes that we are able to correctly estimate the true law of $\theta$ under $H_1$, which is not so trivial. Failure to do so may result in some discrepancy between the sensitivity-specificity balance that we think we achieve and the true performances of the procedure. In other words, the whole empirical Bayes enterprise is attractive, in terms of its theoretically optimal properties, but potentially risky if the practical implementation details turn out to be difficult to control.

As an alternative, we could try to guarantee good bounds on the power of the test using non-Bayesian arguments. In particular, in some cases, there may be a lower bound on $\theta$ under $H_1$ -- say, we are reasonably certain that almost all affected individuals will have an effect size (e.g. a viral load) larger than some $\theta_0$. If we are lucky, this lower bound on $\theta_0$ may be sufficiently large to simultaneously guarantee a high power and a low FDR, and this, even in the worst-case situation where the entire affected population would be characterized by an effect size $\theta$ no larger than $\theta_0$. Such an argument provides the basis for an acceptable sensitivity-specificity tradeoff that may not be optimal, but good enough and, more importantly, robust with respect to the exact distribution of effect sizes (as long as it is indeed bounded below by $\theta_0$).

There are situations, however, where things are not so easy. In particular, there are many cases where $\theta$ can take values arbitrarily close to 0 under $H_1$, in which case tight bounds cannot be found. There, if we really want to have a good control and good sensitivity-specificity tradeoff, we might need to adopt a Bayesian perspective on the problem and explicitly model the statistical behavior of the effect size $\theta$ across invididuals belonging to $H_1$.

All this illustrates the fundamental difference between classical frequentism and empirical Bayes. Both care about the operational properties of their statistical decisional procedures in the long-run (thus, in a sense, both are frequentist). But it is just that empirical Bayes is more confident in its ability to come up with a complete probabilistic model of the situation and to reliably estimate this model from the data. In principle, the resulting procedure is asymptotically optimal. However, this optimality critically depends on the overall validity of the model and the estimation procedure. If these conditions are not met, then, the empirical Bayes approach may fail to keep its promise to deliver a reliable frequentist control.

In contrast, classical frequentism is more like a bet-hedging strategy, trying to guarantee some good properties even in the worst-case situations, provided a minimum set of critical, generally non-distributional, assumptions. In the case of FDR, it turns out that it is possible to get very useful bounds -- FDR is a case where classical worst-case thinking pays off. On the other hand, a good estimation of the power of a test, and therefore a good compromise between sensitivity and specificity, is typically more difficult to obtain using a worst-case driven statistical approach.

Both strategies are useful. In fact, they could be combined more systematically. Typically, the empirical Bayesian perspective on a given problem could be first contemplated, as the ideal approach, and then used as a starting point from which to derive more robust procedures by trying to obtain tight bounds that are less dependent on the exact prior distributions.

Efron, again: "One definition says that a frequentist is a Bayesian trying to do well, or at least not too badly, against any possible prior distribution."

See also Bickel (2012), who elaborates on the idea of Bayesian and frequentist inference being opposite extremes along a "degree-of-caution" axis.

===

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.

Bickel, D.R. (2012). Controlling the degree of caution in statistical inference with the Bayesian and frequentist approaches as opposite extremes. Electron. J. Statist. 6:1. Arxiv version here.

Efron, B. (2005). Bayesians, Frequentists, and Scientists. Journal of the American Statistical Association 100 (469), 1–5.

Efron, B. (2010). Large-scale inference: empirical Bayes methods for estimation, testing and prediction. (Institute of Mathematical Statistics Monographs). Cambridge University Press.