Monday, February 10, 2014

Parameter estimation: optimizing versus conditioning

Misunderstandings about how Bayesian inference works are very common. In particular, people often don't understand that, in Bayesian inference, you do not have the problem of parameter-rich models being inherently more prone to over-fitting.

The thing is, much of currently accepted wisdom about statistics has been acquired from a nearly universal and exclusive use, over decades, of classical methods such as least square or maximum likelihood approaches, all of which are based on the idea of optimizing a score (e.g. maximizing a likelihood or minimizing a sum of squares).

Bayesian inference, however, works differently. It does not rely on any optimization principle. In fact, strictly speaking, in the context of Bayesian inference, you do not fit a model to the data. Instead, you condition a model on the data. And this conditioning, as opposed to optimizing, means a radically different logic of inference.

I think that this logical difference is one of the main reasons why common intuition, being too reliant on a long-lasting experience of optimization-oriented paradigms, regularly fails when trying to understand some key aspects of Bayesian inference -- and in particular, totally fails when it comes to correctly visualizing how Bayesian inference deals with model complexity.

Optimization is, in some sense, an aggressive methodology, selecting the single parameter configuration that best explains the data. As a consequence, the more parameters you have, the more you can satisfy your greed -- which means in particular that optimizing an overly rich model will lead to fitting both the structural patterns (the signal) and the irrelevant aspects (the noise) in the data.

Conditioning, on the other hand, is more temperate. It works by eliminating, among all possible model configurations, those that are ruled out by the data, so as to keep all of those that are good enough (and not just the best one). One can see it very clearly in the context of approximate Bayesian computation: there, one repeatedly draws random model configurations from the prior, simulates data and discards those parameter values for which the simulation does not match the empirical data (up to a certain tolerance interval). Then, estimation, and more generally decision making, is typically done by averaging over all configurations remaining after this elimination step.

Thus, optimization and conditioning cannot be more different; a bit like positive versus negative selection. Optimization actively searches for the best fitting configuration and rules out everything else. Conditioning actively rules out what does not fit at all and keeps everything else. 

Given such a fundamental difference between the two paradigms, one may expect that intuitions gained in one of two contexts may be counter-productive in the other context.

For instance, what happens under a parameter-rich model? In the context of an optimization-oriented approach, it is easy to tweak the parameters so as to make the model fit the irrelevant details of the data: you get over-fit. In contrast, in the context of Bayesian inference, such highly contrieved parameter configurations will not be typical among the set of model configurations that have not been eliminated by conditioning, so that their impact on the final estimate or decision will be completely swamped by all other configurations that have been kept.

Therefore, the whole problem that rich models might over-fit irrelevant aspects of the data is just not present in a Bayesian framework. Or, to put it differently, the apparent propensity of parameter-rich models to over-fit is merely a consequence of aggressive optimization -- not an inherent problem of parameter-rich models.

Now, all this does not mean that Bayesian inference is necessarily better. In particular, Bayesian inference may well have replaced over-fitting problems by prior sensitivity issues (although this is exactly where hierarchical priors have something to propose).

But it does raise the following question (for you to ponder...): how much of what you consider as obvious universal facts about statistics, such as the inherent ability of richer models to "have it easier" and their correlative propensity to over-fit, the fact that potentially useless extra-parameters are necessarily harmful, the necessity of externally imposed penalizations on richer models, the bias-variance tradeoff, among other things, are in fact consequences of the aggressive nature of one particular approach to statistics?

Again, this is a purely logical question, not a normative one. Some of the consequences of optimization (in particular, the possibility of playing with the bias-variance balance, which you cannot really do in a Bayesian context) may not be problematic at all. But it is just that one should perhaps not consider as "laws of nature" things that are in fact specific to one particular statistical paradigm.

More fundamentally, I find it interesting to exercise one's intuition by working out the differences between these two approaches to parameter estimation, by optimization or by conditioning. They offer complementary insights about statistical inference. Also, they can be combined (e.g. empirical Bayes by maximum marginal likelihood). Therefore, it is important to get used to both of them.

Statistical pluralism is not a problem, but an opportunity for us to diversify our thinking routines.


  1. This comment has been removed by the author.

  2. You can get overfitting in Bayesian models as well. In a practical setting, you are not operating fully Bayesian. You are going to "optimize" the model hyper-parameters. It is true that you can sample and average over many hyper-parameters, but that is not feasible in many practical settings.

    As for the hierarchical priors, the matters are just delayed to a higher level. At some point, you are going to set your hyper-parameters and the hierarchy ends and you may be sensitive to the choices you need to make at the highest level.

    The deal breaker for both the optimization and the Bayesian methods are the model choices. Bad model choices break both approaches. Given a good model, typically the inference mechanism has little practical impact.

    Speaking from experience, Bayesian methods are not good for operations.They are good for setting an idealized way of looking at a problem. That has its own values. And yes, the Bayesian theory is really beautiful, and impractical.

    There are exceptions however: I could have used Gaussian processes and probabilistic message passing methods. But even then you have to optimize a cost function for the hyper-parameters that involve the logarithm of a determinant, that is horrible, numerically.