## Wednesday, January 8, 2014

### Empirical Bayes and calibration

An obvious case where Bayesian inference has good frequentist properties is when the prior itself has a clear interpretation in terms of empirical frequencies.

As a simple example to illustrate the point, imagine that we want to estimate the allele frequencies at a neutral bi-allelic locus, based on a random sample of individuals genotyped at this locus. Say we have sampled $n$ individuals, and $k$ had allele 1 and $n-k$ allele 0, and want to estimate $\pi$, the frequency of allele 1 in the population. We could of course use the usual estimator for the proportion of a binomial, which is simply the sample frequency, $k/n$, and calculate a confidence interval using some standard method for the binomial case.

However, under certain conditions (randomly mating population, constant population size), we have in fact very good prior knowledge about the site frequency spectrum of polymorphisms across neutral loci -- with symmetric mutation rates, it is a beta distribution of parameter $\theta = 4N u$, where $N$ is the population size and $u$ the mutation rate. Therefore, if we know the population size and the mutation rate, we can use this theoretical site frequency distribution as a prior, combine it with the binomial likelihood of our observations, so as to obtain a posterior distribution (the posterior turns out to be again a beta in the present case).

This example is the perfect case, where everyone, frequentists and Bayesians, would agree on using Bayes rule. This estimation method makes an optimal use of all of the available information about the unknown allele frequency. On average, the posterior mean estimate will be better (in quadratic error) than if we had used the standard maximum likelihood estimator for a binomial proportion. As for the credibility interval, it is an exact confidence interval, so we have frequentist calibration.

It is perhaps important to point out exactly what kind of calibration/coverage we have here: we do not have coverage if we draw one value of $\pi$, then repeatedly draw $k$ out of $n$, always from this constant value of $\pi$, and finally calculate our credible interval on each replicate and check whether it contains the true value. Instead, we get good frequentist coverage only if we redraw $\pi$ from the prior each time. In other words, we have good coverage across an imaginary infinite ensemble of independent neutral loci (for 95% of them, the credible interval will contain the true value).

--

Now, imagine that we do not know $\theta = 4Nu$. But we have sequence data from many (unlinked, neutral, bi-allelic) loci across the genome for $n$ individuals.

Here, we can design a two-level model: the unknown frequencies across loci, $\pi_i$, $i=1..P$, are i.i.d. from a beta prior, which is itself parameterized by an unknown $\theta$. With sufficiently many loci, we will have enough information across loci to reliably estimate $\theta$. As a result, our posterior credible intervals for each of the $\pi_i$'s will essentially be the same as if we had used the true value of $\theta$.

This is called empirical Bayes (EB) -- Bayes because we are using Bayes theorem to infer $\pi_i$, empirical because we have estimated the prior distribution on $\pi_i$ empirically, using the data at hand. It is also more specifically a case of parametric empirical Bayes (PEB), because we have assumed a parametric form for the prior (a beta distribution), and even more specifically, a particular application of PEB for which we have a mechanistic justification: we are not using a beta distribution because we find it convenient, but because population genetics theory leads us to indeed expect a beta in that case.

Note that we can use either maximum likelihood or Bayesian inference on $\theta$ (in both cases, however, we are being Bayesian with respect to the $\pi_i$'s). There are discussions about which is best, but asymptotically, this does not matter: they are asymptotically equivalent. For me, the choice between them is mostly driven by practical considerations (depending on what is computationally more convenient).

Thus, empirical Bayes means that we borrow strength across loci to estimate $\theta$, which in turn improves our local application, for each locus, of Bayes rule.

In terms of calibration: our dataset is essentially a finite-but-large version of the imaginary infinite ensemble of independent neutral loci mentioned above. In addition, empirical Bayes entails a finite-but-accurate estimation of $\theta$. Therefore, for sufficiently many loci, our collection of credible intervals across loci give us good coverage across loci -- in other words, under empirical Bayes, we have good group-wise calibration.

--

Finally, what if we do not trust the beta prior ? After all, this prior was derived based on fairly strong assumptions about the demography and the neutralilty of the loci. Our estimation may not be very robust to violations of these assumptions.

If we do not trust the beta prior, we can use a non-parametric empirical Bayes approach (NPEB). Essentially, the idea of NPEB is to estimate an arbitrary distribution of allele frequencies across loci directly from the data, without making the assumption that it belongs to some parametric family. One way to implement it is to explicitly invoke a prior over the space of all possible distributions for the $\pi_i$'s: e.g., a Dirichlet process mixture*.

Again, here, in principle, we get good group-wise calibration asymptotically. I say "in principle", because this depends on many technical conditions, essentially concerning asymptotic consistency of the non-parametric estimation of the unknown distribution.

--

In summary, parametric or non-parametric, the general idea of empirical Bayes is to borrow strength across observations (here, across loci) to somehow make Bayesian inference at the level of each observation (here, at the level of each locus) well justified by frequentist standards.

As Efron puts it, large datasets "carry with them their own prior".

Now, the next step is to identify the interesting problems currently faced by evolutionary biologists, which can be seen as particular instances of empirical Bayes, and for which we can therefore expect good frequentist properties for posterior probability evaluations. And there are in fact quite a few of them.

---

*In fact, here, with a binomial likelihood, there are issues about how much of this unknown distribution we can estimate, depending on the exact experimental settings. Also, to be accurate, the original non-parametric empirical Bayes idea of Robbins is a bit different -- it does not really attempt to estimate the distribution, it somehow eliminates it.