Can we, under certain conditions, interpret posterior probabilities in frequentist terms? That is, such that, under controlled simulation settings, 95% of our 95% credible intervals across our probabilistic evaluations turn out to contain the true value, or that 95% of the clades that are inferred to be monophyletic with a posterior probability of 0.95 turn out to be indeed monophyletic ?

If our posterior probabilities have this property, then they are said to be

*calibrated*.
Of course, if we conduct simulations under parameter values drawn from a given prior and reanalyze the dataset thus simulated by Bayesian inference under the same prior, then calibration will obtain, but this is trivial.

A more interesting type of calibration is defined by the idea of conducting Bayesian inference on datasets simulated under

*arbitrary*values of the parameters (but under the true model). Calibration then simply means that credible intervals are true frequentist confidence intervals, and that posterior probabilities are equivalent to local true discovery rates.
Note that there are often several ways to define what we mean by calibration. This is particularly the case for complex hierarchical models, where the limit between the likelihood part of the model, which should be matched between simulation and re-analysis, and the prior part of the model, which should not be simulated from because that would be too easy, can be ambiguous. But then this makes the question even more interesting.

The attitude of many Bayesians is to dismiss the concept of calibration as irrelevant -- probabilities are just not meant for that. Yet, for me, calibrated posterior probabilities represent a worthy prospect. Calibrated Bayes would greatly facilitate the interpretation of probabilistic judgements reported in the scientific literature. It would make it possible to compare the uncertainties reported by Bayesian and non-Bayesian statistical approaches on an equal footing. It would make Bayesian inference look globally more convincing to many scientists. So, in my opinion, we should definitely care about calibration.

In general, posterior credible intervals, and more generally posterior probabilities, are not calibrated. Nevertheless, they can be approximately calibrated under certain conditions. In particular, ~~they~~

*posterior credible intervals*are asymptotically calibrated under very general conditions. [*P.S. I realize that results on asymptotic calibration are valid only for credible intervals on continuous parameters; I think it is not true in other cases*].
My current feeling is that, for many practical situations that we currently face in evolutionary biology, and under reasonable priors, posterior probabilities are reasonably well-calibrated. I think that this point tends to be under-estimated, or at least under-explored.

The more general question of the frequentist properties of Bayesian inference is the subject of an immense literature (a good entry point is Bayarri and Berger, 2004). However, most often, the accent is on asymptotic consistency, admissibility, minimaxity, minimization of some frequentist risk, etc, which are mostly properties of point estimates, or at least properties related to how

*good*the estimation is. On the other hand, the calibration of posterior probabilities is a slightly different question, concerned with the*fairness*of the uncertainty attached to the estimation. This more specific point appears to be a bit less explored or, at least, not so often discussed and reviewed in non-technical papers.
One possible reason for this lack of emphasis on calibration is that, traditionally, Bayesians don't really want to care about it because it is not supposed to be the way you should interpret posterior probabilities. And classical frequentists, if they care about the question, tend to emphasize where posterior probabilities fail to be well-calibrated. Which is of course an important thing to be aware of, but still, one would also like to know the positive aspects.

Another problem is that theoretical papers tend to be fairly strict in what they deem to be good calibration properties. Yet, from a more practical standpoint, I think that one would be content with calibration properties that are perhaps not

*very*good, but at least*reasonably*good.
For instance, I said above that posterior credible intervals are asymptotically calibrated. The asymptotics is in fact relatively weak, as $1/\sqrt{n}$, where $n$ is the size of the dataset, but can be improved by using so-called probability-matching priors (Kass and Wasserman, 1986). In one-dimensional cases, the matching prior turns out to be Jeffreys' prior, but in higher-dimensional settings, things get quite more complicated, and a lot of work has been spent on the question (Datta and Sweeting, 2005). However, all this works sounds disproportionately complicated compared to what we might need in practical situations. In a post-genomic era, one can easily expect to have tens of thousands of sites, thousands of genes, or hundreds of species or individuals to analyze. In this context, using standard diffuse priors on the global parameters of the model (we will probably never implement complicated probability-matching priors anyway) may be sufficient for us to reach effective asymptotic calibration.

After all, in applied frequentist statistics, people often use quick-and-dirty methods for computing confidence intervals or p-values, and I think that everyone is content with that. Most such methods are valid only asymptotically, and even then, they may not be exact. In itself, claiming the right to make the same dirty deals as our frequentist neighbors is not necessarily the best argument. But I guess the most important point here is that approximate confidence measures are good enough for most practical purposes, as long as their

*meaning*, however, is the same across both Bayesian and non-Bayesian statistical activity and can be assessed by objective methods.
Of course, there are certainly also many practical situations where posterior probabilities are not reasonably, not even qualitatively, well-calibrated. But then, we should better know when it is the case, why it is the case, and possibly, what could be done in order to obtain more acceptable posterior probability evaluations by frequentist standards in such situations.

---

Bayarri, M. J. & Berger, J.O. (2004). The interplay between Bayesian and frequentist analysis. Statistical Science, 19:58-80.

Datta, G. S., & Sweeting, T. J. (2005). Probability matching priors. Handbook of Statistics.

Kass, R. E., & Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91:1343-1370