Monday, January 27, 2014

Overcoming the fear of over-parameterization



I am always suprised to see many phylogeneticists and evolutionary biologists often so afraid of parameter-rich models. For some reason, nothing scares people more than the risk of overfitting the data. Sometimes, this goes to the point that, in spite of overwhelming evidence of model violations and systematic biases of all kinds, clearly suggesting that the models used are in fact awfully under-parameterized, people will nevertheless stick to rigid models, refusing to abandon them in favor of more flexible alternatives just because they are afraid of overfitting.

For me, shunning rich and flexible models out of fear of over-parameterization is pure superstition -- a bit like refraining from the pleasures of life out of fear of going to hell.

There are at least three different reasons why I think that this fear of over-parameterization is irrational.

First, although overfitting may indeed be a true concern in some contexts, in particular using classical maximum likelihood, its consequences are greatly overestimated. In fact, if one has to choose between under or over-parameterization, the latter is, by far, the lesser of two evils. An over-parameterized model will typically result in high variance. In the context of a phylogenetic analysis, for instance, this will mean low boostrap support values. Low bootstrap support may not always be a good news, but at least, one does not get a strong support for false claims. In contrast, the use of an excessively rigid model often leads to bias, i.e. to a strong support for wrong trees -- we have seen that but too often in applied phylogenetics over the last twenty years. Thus, in case of doubt, and if we want to be on the safe side, we should in fact prefer more, not less, parameter-rich model configurations.

Second, overfitting is much less likely to occur today than it used to be. Multi-gene, not to mention genome-wide, datasets almost certainly contain enough information to estimate even the richest models currently available in phylogenetics. Similar situations are probably encountered in many other domains of applied statistics today. I think that people have kept some Pavlovian reflexes from old times, when data used to be a scarce resource. Clearly, these reflexes are totally outdated.

Third, and more fundamentally, over-parameterization problems are evitable -- parameter-rich models can easily be regularized. In particular, in a Bayesian framework, they can be regularized through the use of hierarchical priors.

The idea is very simple and can be illustrated by making an analogy with the idea of allowing for among site rate variation when reconstructing phylogenies. In this context, since maximizing the likelihood with respect to site-specific rates would result in over-fitting problems, the likelihood is instead integrated over site-specific rates. The integral is done over a distribution that is itself parameterized, so that its overall shape (in particular its variance) can be adjusted according to the information collected across all sites. In practice, the distribution is classically chosen to be a gamma of mean 1, parameterized by its shape parameter $\alpha$ tuning the variance of the distribution (Yang, 1994).

This model has a justification in the context of the classical maximum likelihood paradigm, as a random-effect model. But it can also be seen as a particular case of shrinkage. The amount of shrinkage is tuned by $\alpha$: if $\alpha$ is large, the variance of rates across sites is small, and shrinkage is strong. In turn, the $\alpha$ parameter is tuned by the data. In particular, if the sequences under study are such that there is no among site rate variation, then the estimated value of $\alpha$ will be very large, and rates across sites will be shrunk so strongly toward their mean that the model will effectively collapse into the uniform-rate model. In other words, shrinkage is self-tuned -- it automatically adjusts to what is needed by the data.

Thanks to this self-tuning behavior, explicit model selection (here, between the uniform or the variable rate models) it is not even necessary: one can directly use the more complex model (with variable rates) by default, given that it will automatically reduce to the simpler one if needed*. Note how this contradicts accepted widsom, according to which you are supposed to use the simpler model by default, unless it is firmly rejected by the data. 

This idea of self-tuned shrinkage can be used much more generally. Consider for instance the situation where you want to reconstruct a phylogeny using ribosomal RNA sequences from many bacterial species. We know that there is substantial compositional variation among lineages. We also know that compositional variation is an important source of phylogenetic error, potentially leading to an artifactual grouping of species with similar GC composition. Thus, we want to accomodate variation in GC content over the tree. A simple phenomenological way to do that is to invoke a distinct equilibrium GC for the substitution process on each branch. Doing this entails quite a few additional parameters (1 per branch), but we can shrink those GC parameters toward their mean over branches, e.g. by considering them i.i.d. from a distribution whose mean and variance are themselves hyperparameters of the model. The model will automatically regulate the level of shrinkage through the estimation of the variance parameter.

Endless variations on that theme can be imagined.  One can afford, not just an equilibrium GC, but more generally a distrinct equilbrium frequency profile of even a distinct substitution matrix on each branch, for each gene, for each site, etc. 

It is even possible to invoke several levels of shrinkage. For instance, in the context of a partitioned analysis, one would shrink branch lengths across partitions toward branch-specific means and according to branch-specific variances that are themselves shrunk across branches through global means and variances. In this way, the empirical signal is ultimately funnelled onto a small set of hyperparameters, thus providing a very flexible tuning of the global structure and intensity of shrinkage.

There are of course a few delicate issues here: how to choose the distributions, the architecture of the hierarchy, as well as the hyperpriors. In fact, in some cases, not doing those things properly can result in biases that might be worse than the excess variance that would have been incurred by bold maximization of the likelihood without any consideration for shrinkage (although not as bad as using a less parameter-rich model).

But in any case, what all this means is that over-parameterization is just not, in itself, a problem.

Now, I am not saying that parameter-rich models do not suffer from important limitations. In particular, complex models obviously raise substantial computational challenges. It is generally not very convenient to have as many substitution matrices as there are branches, genes or sites in a phylogenetic analysis. So, clearly, simpler models have an edge on the side of computational efficiency.

On a more philosophical note, it is also true that being able to capture the essence of a phenomenon through a compact model invoking as few parameters as possible is indeed an elegant approach to statistical modeling. I can of course see the value of such an ideal of conceptual parsimony.

But then, even if we sometimes want or need less parameter-rich models, this is for other reasons than just the issue of avoiding over-fitting. It is stupid to avoid parameter-rich models at all costs. Instead, one should learn to enjoy the freedom of choosing between rich or compact models, depending on what we want to do. Exactly as one can sometimes appreciate the idea of refraining from the pleasures of life, in praise of moderation and frugality -- but not out of fear of going to hell.

There are ways to deal with rich and flexible models. So, it is time to move on from our primitive fear of over-parameterization.

===

* The fact that the simpler model is obtained by setting $\alpha$ to infinity is a bit inconvenient, but this can easily be fixed by reparameterizing the model (e.g. in terms of $v = 1/\alpha$, so that $v=0$ is now the uniform-rate model).

Yang, Z. (1993). Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution, 10(6), 1396–1401.

4 comments:

  1. I agree with you that there is a widespread fear of overparameterization in phylogenetics. Do you think much of it may be inherited from other fields where statistical inference is conducted primarily to predict things, as in machine learning? In such cases, overparameterization may be dangerous as it can lead to overfitting and poor predictive power. In phylogenetics however, as we rarely predict things but are interested in inferring parameters, I agree we should be much less worried than we are.

    ReplyDelete
    Replies
    1. I don't think the fear comes from machine learning. For example, Bayesian Learning for Neural Networks by Radford M. Neal, 1995, allows an infinite number of parameters, controlled by a prior.

      Statistical inference takes us from the known to an estimate about the unknown. If the known is the present, and the unknown is the future, we tend to call it 'prediction', but it is the same process. The main difference between machine learning and phylogenetics is that we rarely get to test predictions in phylogenetics, except on simulations, whereas in machine learning there is often a 'ground truth' for real data.

      Delete
  2. Your arguments is fully convincing especially when considering "nuisance" parameters, whose true value/meaning is not of primary interest to the biologist. I think the main "historical" reason why people fear overparametrization appears when one wants to learn about the real word from parameter estimates. In this case adding useless extra-parameters can be harmful. A trivial example: fitting a mixture of two gaussians to a data set which corresponds to a single gaussian would return an irrelevant pair of (mean, sd) values. An analogy in the phylogenetic context: why using trees when we can use networks? Networks typically fit better - yet sometimes we want a tree. Along the same lines, wasn't your effort to turn CAT into a fixed-number-of-profiles model an attempt to learn about the molecular evolutionary process by reducing the number of parameters?
    Amitiés,
    Nicolas Galtier

    ReplyDelete
    Replies
    1. Hi Nicolas.

      Yes, I agree that most of what I am saying concerns mostly nuisance parameters. In phylogenetics, for instance, what I have in mind is the way we model all the heterogeneities of the substitution process, among genes, sites, branches, etc. which are fundamentally nuisances when you want to estimate the topology of the tree.

      Concerning mixture models, the reasons for exploring finite mixtures were a bit more complicated: a mix of computational and statistical considerations. Also, as it turns out, finite mixtures don't work so well: my current experience is that the complete Dirichlet process version works better even on single-gene alignments. Thus, even in small scale analyses, flexible models, as long as they have good self-tuning properties, may outperform compact parametric alternatives.

      Delete