Thursday, February 27, 2014

Revised standards: p-values and false discoveries

Since we are again talking about p-values, evidence and scientific false discoveries (see my last post), I would like to come back to Valen Johnson's article of last fall (Johnson, 2013). Even if the main ideas are apparently not totally new (see e.g. Berger and Sellke, 1983), Johnson's article revives an important question, and this, apparently, at a very good moment (see also a similar point in Nuzzo, 2014).

The argument of Johnson's article relies on Bayes factors and uses a relatively technical concept of uniformly most powerful Bayesian tests. For that reason, it may give the impression that the whole question is about the choice between Bayes or frequentism, or between p-values or Bayes factors. Yet, for me, the whole problem could more directly be presented of as a false discovery rate issue.

In this direction, what I propose here is a relatively simple derivation of Johnson's main point (at least as I understand it). Apart from emphasizing the false discovery aspect, this derivation does not rely on uniformly most powerful Bayesian tests and is therefore hopefully easier to understand.

Suppose you conduct a one-sided test: you observe $X$ normally distributed of mean $\theta$ and variance 1 and want to test $H_0$: $\theta=0$ against $H_1$: $\theta > 0$. For instance, you observe a log-ratio of expression levels for a given gene between two conditions, and want to deternine whether the gene is over-expressed.

Imagine that you observe $x = 1.646$. This is slighty greater than $x_0 = 1.645$, the usual threshold for a type I error rate of 5%. Thus, you have obtained what is usually considered as marginally significant evidence in favor of over-expression, with a p-value just slightly less than 0.05.

Now, what does that imply in terms of the chances that this gene is indeed over-expressed?

This question sounds Bayesian: it asks for the posterior probability that $H_1$ is true conditional on the empirically observed value of $X$. But you can ask a similar question in a frequentist perspective: consider the collection of all one-sided normal tests (with known variance) that have been published over the last 10 years in evolutionary biology, and for which a marginal significance was reported. What fraction of them are false discoveries?

To answer that question, we need to contemplate the complete collection of all one-sided normal tests that were conducted (not just those that were published because they turned out to be significant): this represents a total of $N = N_0 + N_1$ tests, among which $N_0$ were true nulls and $N_1$ were true alternatives.  Let us call $p_0 = N_0 / N$ and $p_1 = N_1 / N$ the fractions of true nulls and alternatives, and $\theta_i$, $i=1..N_1$ the collection of true effect sizes across all true alternatives.

The probability of obtaining a marginally significant result ($x_0 \pm \delta x$) for a null case is $ p(X=x_0 \mid H_0) \delta x$. For a given non-null case with effect size $\theta_i$, it is $ p(X=x_0 \mid \theta_i) \delta x$. Therefore, the expected total number of marginally significant discoveries over the $N$ tests is just:

$n = N_0  p(X = x_0 \mid H_0) \delta x + \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i) \delta x$

which, upon dividing by $N \delta x$, can be rewritten as:

$\frac{n}{N \delta x} = p_0  L_0 + p_1 \bar L_1$

where $L_0 = p(X=x_0 \mid H_0)$ and $\bar L_1$ is the average likelihood under the alternative cases:

$\bar L_1 = \frac{1}{N_1} \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i)$.

The fraction of false discoveries is simply the contribution of the first term:

$fdr = p_0 L_0 / (p_0 L_0 + p_1 \bar L_1) = p_0 / (p_0 + p_1 B)$

where $B$ is:

$B = \bar L_1 / L_0$.

($B$ can be seen as an empirical version of the Bayes factor between the two hypotheses, but this is not essential for the present derivation.)

The average likelihood under $H_1$, $\bar L_1$, is less than the maximum likelihood under $H_1$, $\hat L_1$, which is here attained for $\hat \theta = x_0$. Using the formula for a normal density gives:

$B < B_{max} = \frac{\hat L_1}{L_0} = e^{\frac{1}{2} x_0^2}$

or equivalently:

$fdr > fdr_{min} = \frac{p_0}{p_0 + p_1 B_{max}}$.

Some numbers here. For $x_0 = 1.645$, $B_{max} = 3.87$. If $p_0 = p_1 = 0.5$, $fdr > 0.20$. In other words, at least 20% of your marginally significant discoveries are false. And still, this assumes that half of the tests were conducted on true alternatives, which is a generous assumption. If only 10% of the tested hypotheses are true ($p_1 = 0.1$), then, $fdr > 0.70$.

And all this is generous for yet another reason: it assumes that all your true alternatives are at $\theta=1.645$, the configuration that gives you the smallest possible local fdr. Reasonable distributions of effect sizes under the alternative can easily result in even higher false discovery rates. For example, if, under $H_1$, $\theta$ is distributed according to the positive half of a normal distribution of mean 0 and standard deviation of 3, then $B$ is less than 1, which implies more than 50% of false discoveries if half of the tested hypotheses are true, and more than 90% of false discoveries if 10% of the tested hypotheses are true.

I guess it is fair to say that most tested hypotheses are in fact false ($p_1$ is most probably less than 0.1) -- if most tested hypotheses were true, then this would mean that we already know most of what we are inquiring about, and thus research would just be an idle confirmatory exercise. It is also fair to say that the whole point of scientific hypothesis testing is to reverse this unfavorable proportion by filtering out the majority of non-findings and publishing an enriched set hopefully composed of a majority of true effects. Yet, as we can see here, this is not what happens, at least if we focus on marginal discoveries.

The entire argument above assumes that the variance of $X$ around $\theta$ is known. If it is unknown, or more generally if there are nuisance parameters under the null, things are a bit more complicated. The exact quantitative results also depend on the sampling model (normal or other). However, for the most part, the message is probably valid in many other circumstances and is very simple: what we call marginally significant findings are most often false discoveries.

Now, we can quibble over many details here, discuss the practical relevance of this result (do we really need to define a threshold for p-values?), object that significance is just one aspect of the problem (you can get very significant but weak effects), etc etc. Nevertheless, one should probably admit one thing: many of us (me included) have perhaps not adopted the correct psychological calibration in the face of p-values and have tended to over-estimate the significance of marginally significant findings.

In other words, it is probably fair to admit that we should indeed revise our standards for statistical evidence.

Also, we should more often think in terms of false discovery rate: it tells us important things that cannot be understood by just looking at p-values and type I errors.

Independently of this theoretical argument, it would certainly be interesting to conduct meta-studies here: based on the collection of reported p-values, and using standard algorithms like the one of Benjamini and Hochberg, 1995, one can retrospectively estimate the fraction of false discoveries across all published results in the context of molecular evolution, comparative or diversification studies, for instance (all of which have often relied on relatively weak effects), over the last ten years. I would be curious about the outcome of such meta-analyses.


Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association, 82:112.

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110:19313.

Nuzzo, R. (2014, February 13). Scientific method: statistical errors. Nature, pp. 150–152. doi:10.1038/506150a

1 comment:

  1. The allegations that p-values exaggerate evidence against a null are all false--from the frequentist perspective. They rely on questionable high piked priors to nulls. This is discussed in several places on my blog