## Wednesday, January 8, 2014

### Empirical Bayes and shrinkage

One thing I did not mention in my last post is that empirical Bayes is fundamentally a shrinkage approach.

Shrinkage relates to the idea that you can improve an estimator by combining it with other information. In the example of the last post, when the population parameters are not known, estimating allele frequencies at a given locus can be improved by pooling similar problems (other unlinked loci), so as to estimate the population parameters across loci, and then use this information about the population for the estimation at each locus.

Shrinkage entails some tradeoff: you get better properties averaged over the group, at the expense of good frequentist properties for each item. The tradeoff is thus between point-wise and group-wise frequentist properties.

There is an interesting story about Stein's paradox, shrinkage and empirical Bayes (see Brad Efron and Carl Morris on the subject, references below).

In brief, here is the idea.

Suppose that we observe $X$, normally distributed of unknown mean $\theta$ and known variance $\sigma^2$: $X\sim N(\theta, \sigma^2)$. We want to estimate $\theta$. Obviously, $\hat \theta = X$ is our best estimator.

Suppose now that we observe $X_i \sim N(\theta_i, \sigma^2)$, for $i=1..N$. We observe all of the $X_i$'s and want to estimate all of the $\theta_i$'s. Obviously, we can use $\hat \theta_i = X_i$ for each $i$ (since these are just independent copies of the same problem).

Stein's paradox basically says that, for $N \ge 3$, our joint estimation is inadmissible, in the sense that we can always do better in terms of mean quadratic error. We can do better, in particular, by shrinking toward the empirical mean of our observations, $\bar X$, e.g.

$\hat \theta_i = \bar X + \left( 1 - \frac{N-3}{N V} \sigma^2 \right) (X_i - \bar X)$

where $V = 1/N \sum_i (X_i - \bar X)^2$ is the empirical variance of the observations.

Stein's shrinkage estimator can in fact be seen as an empirical Bayes procedure, under the following two-level model:
$X_i \sim N(\theta_i, \sigma^2)$
$\theta_i \sim N(\mu, \tau^2)$

Thus, from a Bayesian perspective, Stein's estimator amounts to assuming that the $\theta_i$'s are themselves normally distributed, of unknown mean and variance. These unknown mean and variance are estimated by pooling observations and then used to shrink the estimation of each of the $\theta_i$'s.

Why should shrinkage work, given that our $\theta_i$'s are not assumed to be generated by some common mechanism? Somehow, the fact that there is no condition on the $\theta_i$'s seems to imply that we will improve our estimation even by pooling things that have absolutely nothing to do with each other (this is why it is called a paradox).

I think I still need to understand this point, but in any case, a key property of Stein's shrinkage estimator is this inverse dependence on the empirical variance of the observations, $V$. Which practically means that, if the $\theta_i$'s really have nothing to do with each other, then the variance $V$ will typically be large, in which case the shrinkage effect will essentially vanish: for large $V$, $\hat \theta_i \simeq X_i$. [PS: this is not really an explanation, though, more like a Bayesian interpretation of a mathematical result...]

In any case, what Stein's paradox implies is just that, at least for normal means of same known variance, if properly done, shrinkage never costs you anything (in terms of group-wise performance). Stein's estimator has optimal self-tuning properties -- at worst, if not relevant in the context of the problem of interest, shrinkage automatically fades out. However, this is true only for normal means of same known variance. For more complicated cases, it is a bit less obvious.

In practice, my feeling is that shrinkage is generally a good idea as long as we ensure good self-tuning properties for our shrinkage procedures. In fact, if you think about it, this is the whole point of Bayesian hierarchical priors: you can see them as sophisticated self-tuning devices trying to shrink the estimation in all possible directions -- across loci, across genes, across branches of the phylogeny, across quantitative traits, etc. They are self-tuning in the sense that, through their hyperparameters, they can detect and regulate the amount of shrinkage to impose on the estimation.

In a sense, the fundamental point of hierarchical Bayes is to push the idea of shrinkage as far as it can go -- and this, hopefully for the best. Conversely, however, all this also suggests that, once hierarchical priors have been proposed in the context of a particular estimation problem, then, ideally, the frequentist properties of the resulting shrinkage estimator should perhaps be more carefully assessed.

--

Brad Efron (2010). Large-scale inference: empirical Bayes methods for estimation, testing and prediction. (Institute of Mathematical Statistics Monographs). Cambridge University Press.

Efron, B. and Morris, C (1977). Stein's paradox in statistics. Scientific American

Efron, B. and Morris, C. (1973). Stein's estimation rule and its competitors -- an empirical Bayes approach. Journal of American Statistical Association, 68:117-130.