Imagine that you want to test for the presence of positive selection in a given gene in the human lineage, based on a protein alignment and using a codon model. Your data consist of aligned sequences from 5 mammals (human, chimp, macaque, mouse, rat and dog). You apply the branchwise test of codeml (Nielsen and Yang, 1998) on this alignment, with the branch leading to humans after the divergence from chimps taken as the focal branch. The test returns a p-value of 0.01 against the null model (i.e. the model in which the human lineage is just like all other branches making up the background).
With a p-value of 0.01, is it legimitate to consider that this gene is under positive selection in the human lineage? I guess that most of us would consider that it is legitimate.
Yet, in the context of a genome-wide scan for positive selection, based on the complete genomes of the same 5 mammalian species, Kosiol et al (2008) found that, in order to control for a false discovery rate (FDR) of 5% when detecting episodes of positive selection along the human lineage, one has to impose a significance threshold of p<0.000025. This threshold is 400 times smaller than 0.01. I have not made the exact calculation here, but I would not be surprised if a threshold of 0.01 in this context would imply something like a FDR of more than 50%, and a local fdr (i.e. among findings with a p-value at 0.01) of more than 80 % -- for the purpose of the following discussion, let us assume that this is the case.
Thus, if your gene of interest were analyzed, not in isolation, but in conjunction with all other protein coding sequences across these 5 mammalian genomes, then one would not have considered it as significantly under positive selection. Instead, one would have concluded that it was most probably a false discovery.
So, what should we do here? Are we supposed to adopt double standards and react differently when confronted to the same p-value, depending on whether it is considered in the context of a single-gene or a multiple-gene analysis? Or should we somehow transpose what we have obtained in the context of the FDR-controlled genome-wide scan to a new individual case? More fundamentally, what are we to conclude here, concerning the presence of positive selection on our gene in the human lineage?
At first sight, one could possibly argue that, technically, the two situations are not comparable. FDR is often presented as something meant to correct for multiple comparisons. When analyzing a gene in isolation, there is no multiple comparison involved, and therefore, no correction is needed. Thus, one should not transpose FDR results to a single case.
However, I think that this would be a fallacious argument. Fundamentally, what the FDR calibration tells you is that 50% of the genes for which p<0.01 are false discoveries in the present context. If your gene of interest has nothing special, then it is just like any gene taken at random and for which p<0.01, and therefore it has at least 50% chances of being a false discovery (in fact, 80% because it is at 0.01). As far as I can see, it does not matter whether or not this gene was tested in isolation or in conjunction with other genes.
Personally, I think that it is misleading to see FDR just as a way to 'correct for multiple comparisons'. For me, FDR is fundamentally an empirical Bayes posterior probability: the probability that a case is a true null given that it was rejected. This posterior probability is empirical in the sense that it is based on a prior that has been empirically estimated on a large series of cases. But apart from that, it is just an evidential probability.
Seen from this perspective, it becomes clear that the only difference between the multiple-testing and the single-gene settings is that the multiple-testing framework gives you the means to estimate your priors -- and therefore, in the present case, the means to know that you are in an unfavorable situation where a p-value of 0.01 just does not represent sufficient evidence for positive selection. In contrast, the single-gene experiment merely prevents you from knowing that a p-value of 0.01 does not represent much in the present case -- but of course, not knowing a problem does not imply that the problem does not exist.
Now, you could still object one point against this argument: you could claim that a single-gene analysis is different from a genome-wide scan, in that the gene under study has generally not been chosen uniformly at random among all possible genes. After all, one of the reasons that make the FDR so stringent is that the overall probability that a gene taken at random along the genome is under positive selection in humans is just very low. Therefore it takes strong empirical evidence to turn this very low prior probability into a reasonably high posterior probability. However, your gene may be different in this respect: you may have decided to study that gene precisely because you have good reason to believe that it is under positive selection -- for instance, a gene involved in the immune response against pathogens. If this is the case, then perhaps you could be a bit less stringent when interpreting the p-value returned by the test specifically for that gene.
Maybe. But, then, first, you should note that this is fundamentally a Bayesian argument, which amounts to invoking a special prior in the particular case of your gene of interest.
Second, assuming that you accept to play this Bayesian game, it remains to be determined how you can then justify your special prior. The typical subjective Bayes approach, relying on introspection and verbal arguments to that effect, may be fine in other situations in which one can show that the end result is not too sensitive to the prior anyway. But here, it is not the case: the prior is precisely the critical aspect of the question: it is the only factor that can make a difference between your gene and any gene with p=0.01.
For me, if you really want to claim that immunity-related genes are different and are more likely to be under positive selection, then you have to let the data speak by themselves on this issue. Which means in practice that, instead of searching your soul and coming up with 'your' prior probability for immunity-related genes to be under positive selection, you should use some well-accepted gene ontology, select the subset of genes that are involved in immunity and redo the FDR analysis on this subset of genes. If genes involved in immunity are indeed more likely to be under positive selection, this will automatically show up in the form of a less stringent significance threshold compared to the global genome-wide analysis -- perhaps up to the point that your gene will now be included in the genes significantly under positive selection at a FDR of 5% or 10%.
But as long as you do not let empirical data confirm your prior guess, any claim that your gene is different has just not been vindicated. In which case the global genome-wide FDR control represents, by default, our best calibration for assessing the chances that your gene, with p=0.01, is indeed under positive selection -- and based on what this FDR calibration says, these chances appear to be rather low.
So, in conclusion, once the fundamental meaning of FDR as an evidential (posterior) probability is understood, then it becomes clear that ignoring FDR calibrations derived from genome-wide analyses when interpreting the results of small-scale experiments is not rational. For me, it is like refusing to use what has been learned in the past on a large series of examples when considering a new similar but isolated case.
And indeed, there tends to be a surprising disconnect in our collective statistical practice, between the rather stringent thresholds, usually in the range of p<1e-4 or even p<1e-6, most often imposed in typical genome-wide analyses in which FDR control is explicitly implemented (like GWAS, or genome-wide scans for positive selection), and the substantially more liberal attitude, in the range of p<5e-2, most often adopted in the context of small-scale studies in which FDR is not explicitly considered.
The only possible logical justification for this difference in our attitude would be that we are generally confident that our small scale studies typically concern genes, or SNPs, for which the prior odds in favor of the existence of a true effect are typically 100x higher than for a random gene or SNP along the genome. Yet, I am not totally sure that we have such good reasons to be so confident in our pre-selected candidates.
I am not saying that one should always conduct FDR calibrations. Sometimes it is just too difficult to find a good reference set. But if, each time we can find a reasonably good reference set (and genes or SNPs represent an easy case in this respect), we are surprised by how stringent we have to be in order to control for a decent rate of false discovery, then perhaps it means something about how excessively liberal we tend to be when we don't have the means to know our true FDR.
==
Nielsen R, Yang Z (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148: 929–936.
Kosiol, C., Vinar, T., da Fonseca, R. R., Hubisz, M. J., Bustamante, C. D., Nielsen, R., & Siepel, A. (2008). Patterns of positive selection in six Mammalian genomes. PLoS Genet, 4(8), e1000144. doi:10.1371/journal.pgen.1000144
Very nice post!
ReplyDelete"The only possible logical justification for this difference in our attitude would be that we are generally confident that our small scale studies typically concern genes, or SNPs, for which the prior odds in favor of the existence of a true effect are typically 100x higher than for a random gene or SNP along the genome."
Actually, this argument would make me conclude that a single-gene experiment giving a p=0.05 is NOT significant. Given the strong prior, and given that it is not taken into consideration in the test, you rather expect a very small p-value if it was significant.
Does this make sense?