Monday, October 27, 2014

Multiple testing: a damnation or a blessing?

Just to follow up on a post by Joe Pickrell. I cannot agree more with Joe on the idea that, at least if your aim is to control for the false discovery rate (FDR), then multiple testing is not a damnation — it is, in fact, a blessing.

Why don't people see that the 'problem of multiple testing' has entirely disappeared with the advent of FDR ? Perhaps simply because FDR is still often presented as a method meant to 'correct for multiple testing', which I think is a very misleading way to see it.

As I already discussed, if you see FDR as a method correcting for multiple testing, then, somehow, this conveys the (wrong) idea that, the more tests you conduct, the more you have to fight against the combinatorial curse. But this is not what happens. FDR is just the fraction of true nulls among your rejected items, for a given threshold $\alpha$ on the p-values. Therefore, it depends only on the proportions of true nulls and true alternatives in your sample  not the absolute number of tests.

If, among 100 000 SNPs, using a cutoff of p < 1e-5 gives you an expected 10% FDR, this means that, among all findings with p-values below 1e-5, 10% are false discoveries. Now, you may randomly subsample your dataset, say, take one out of 10 of your original 100 000 SNPs: among those that have a p-value below 1e-5 in your subsample, you will obviously still have 10% of false discoveries. Thus, for a given threshold (here 1e-5), subsampling your original data does not decrease your FDR (or equivalently, for a given target FDR, you would not have to relax your threshold to analyze the subsample) -- which is however what you would expect if you were fighting against some combinatorial effect created by the sheer number of tests.

So, the absolute number of tests is not, per se, a damnation.

Conversely, an empirical Bayes perspective on the problem suggests that the FDR can be interpreted as the posterior probability that a rejected item is in fact a true null. This posterior probability is empirical, in the sense that it is based on priors (prior fraction of true nulls and prior distribution of effect sizes among true alternatives) that have been estimated directly on the dataset: these priors have a frequentist interpretation.

From this empirical Bayes perspective, it is immediately clear that, the more items you have, the better your priors are estimated. So, it is in fact a good idea to work on large datasets: it improves the quality of your FDR control -- and this, without causing any deterioration of your power.

So, the absolute number of tests is in fact, a blessing.

Now, of course, all this assumes that the proportion of true alternatives is the same, whichever the size of your dataset. So, this does not apply to the situations where you would for instance work on a subset of SNPs that have been pre-selected based on some objective criterion (say, SNPs close to protein coding genes), which you hope will increase the proportion of interesting items in your sample. Conversely, if working on large datasets generally means that you are working more indiscriminately (i.e. that you are working on sets that are likely to contain much more garbage), then, yes of course, you will have to use a more stringent threshold to achieve the same FDR. However, this is not, in itself, because you are working on a larger number of cases, but only because you have also swamped your interesting cases with a larger proportion of true nulls.

In fact, once you are comfortable with the empirical Bayes perspective on the concept of FDR, you can see ways of obtaining the best of both worlds: work on a large and indiscriminate set of cases, while at the same time enjoying the power that you would get from first enriching your sample based on good heuristic pre-selection criteria.

The idea is to make an explicit empirical Bayes model, in which your prior of being a true alternative now varies among items, in a way that systematically depends on the criteria that you would have used for pre-selection: essentially, doing a logistic regression model for the (now item-specific) prior of being a true alternative, with covariates provided by external criteria. The parameters of this model would be estimated on the dataset, by empirical Bayes. Doing this will automatically capture the potential enrichment effects within certain interesting subsets and will thus lead to an increase in your global number of findings, compared to the standard, uniform version of FDR analysis, while still controlling for the same level of global FDR.

Fundamentally, it is misleading to see FDR as a method for 'correcting for the devastating effects of multiple comparison'. It is far more adequate to see it as a method that takes advantage of the multiple comparison settings, to derive a frequentist control that is much more useful than the control of type I error.

So, yes, I am definitely up for a multiple-testing party!

No comments:

Post a Comment