## Friday, October 31, 2014

### Algorithmic models of probabilistic inference

Rejection sampling is clearly not the most efficient Monte Carlo estimator. Even so, it is a conceptually beautiful idea: fundamentally, rejection sampling represents an algorithmic model of Bayes theorem.

The algorithm works as follows. Given some observed data $d$ and a model parameterized by $\theta$:
- randomly choose parameter $\theta$ from the prior
- simulate data $D$, using $\theta$ as your parameter vector
- reject if $D \neq d$, otherwise, keep sample aside
Repeat until exhaustion.

Rejection sampling represents an operational definition of what we mean by conditioning a distribution on the data: it identifies the posterior distribution with what you get by filtering out, from the random draws from the prior distribution, those parameter configurations that did not reproduce what you observed. We can get a quantitative derivation of this posterior by just reading out our algorithm: the probability of getting a particular parameter configuration $\theta$ in our final sample is proportional to the probability of first drawing this configuration from the prior, $p(\theta)$, multiplied by the probability of accepting it  which is just the probability that we simulate a dataset exactly equal to $d$, $p(d \mid \theta)$. Therefore, my final sample is distributed according to:

$p(\theta \mid d) \propto p(\theta) p(d \mid \theta)$.

Thus, rejection sampling gives us a simple intuitive 'proof' of Bayes theorem.

Personally, I find the idea of rejection sampling not just useful for explaining Bayesian inference in the context of an introductory course. I also use it regularly, reformulating my most complex problems of the moment in terms of this algorithmic representation, just to clarify my ideas. As I will try to show through some examples in later posts, rejection sampling and, more generally, algorithmic models of probabilistic settings, can sometimes be very helpful for clarifying the meaning our models and sorting out the calculations.

But first, a few more preliminaries, in the form of three little examples of increasing complexity, illustrating how this works, and what this fundamentally means, in practice. Example 1 and 2 are rather trivial, but they are interesting in that they represent two different epistemological meanings that Bayes theorem can have in practical situations. They also set the stage for example 3, which deals with conditional likelihoods.

Example 1. Imagine we want to infer the allele frequencies at a neutral bi-allelic locus, based on a random sample of chromosomes from a given population. Say we have sampled $n=5$ chromosomes, of which $k=3$ had allele 1 and $n-k=2$ allele 0, and we want to infer $\pi$, the frequency of allele 1 in the population. Of course, we could just take $k/n$ as a quick estimate, but if $n$ is small, there will be quite some uncertainty, which we also want to quantify.

If we are dealing with a randomly mating species, with known scaled mutation rate $\theta = 4 N_e u$, we know that the frequency distribution of neutral alleles in the population is a Beta distribution of parameter $\theta$ (figure below, thin line). We can use this distribution as our prior on $\pi$ and use the rejection sampling algorithm:

- randomly choose $\pi$ from prior
- simulate $n$ chromosomes, each with prob. $\pi$ of being of type 1.
- count number $K$ of chromosomes carrying allele 1.
- reject if $K \neq k$.

Or, more compactly:

- randomly choose $\pi$ from prior
- simulate $K \sim$ Binomial($\pi,n)$
- reject if $K \neq k$.
Reading out this algorithm leads to the following posterior distribution (figure, thick line):

$p(\pi \mid k) \propto p(\pi) p(k \mid \pi)$.

This is a rather trivial case, although one for which the inferential semantics of the algorithm is particularly compelling: our locus of interest is just any neutral locus for which $k$ out of $n$ randomly chosen chromosomes happen to have allele 1. Therefore, simulating a large number $P$ of random instances of neutral loci, each with its own population frequency $\pi$, reproducing the data-acquistion protocol on each of them and then selecting only those instances that match our observation (conditioning step), gives us a typical set, of which our true observed instance is then supposed to be just an additional random member. In particular, if we could sort in ascending order the $P+1$ values of $\pi$ (the $P$ that we have simulated + the true one), then, since our unknown $\pi$ is indistinguishable from simulated replicates, it should have an equal chance of being the first, or the second, or more generally, of having rank $p$ for any $p=1..P+1$. Therefore, the unknown $\pi$ has equal chance of being contained within any one among the successive $P+1$ intervals of our simulated series (including the two open-ended intervals at both ends): this is the epistemological meaning of a posterior distribution.

Example 2. As another slightly less trivial problem, suppose now that we have genotyped, not just one, but $M$, unlinked neutral bi-allelic loci (or sites) in our $n$ individuals. We now we want to infer $\theta = 4 N_e u$, based on the allele counts $k_i$, $i=1..M$, across loci. Let us call $k$ the vector of observations. We have prior information about $\theta$, summarized in some prior distribution $p(\theta)$. In that case, the algorithm would run as follows:

randomly choose $\theta$ from $p(\theta)$

for site i=1..M
- randomly choose $\pi_i$ from $p(\pi_i \mid \theta)$
- simulate $K_i \sim$ Binomial$(\pi_i, n)$

reject unless $K_i = k_i$ for all $i=1..M$.

Here, the $\pi_i$’s are now just temporary random variables. The algorithm can therefore be written more compactly as:

randomly choose $\theta$ from $p(\theta)$
for i=1..M
- randomly simulate $K_i \sim p(K_i \mid \theta) = \int p(K_i \mid \pi_i) p(\pi_i \mid \theta) d \pi_i$
reject unless $K_i = k_i$ for all $i=1..M$.

where we have averaged out over all possible ways we could have sampled the $\pi_i$'s. The 'for' loop leads to a product, which gives me the following posterior distribution:

$p(\theta \mid k) \propto p(\theta) \prod_i p(k_i \mid \theta)$.

Still fairly simple. One little detail, however, concerning the epistemological meaning of the whole procedure in this new case: thus far, the formulation of the algorithm has been slightly incomplete. When I say ‘choose’, perhaps I should be more specific about who exactly ‘chooses’ here. In the case of problem 1:

Nature chooses $\pi$ from the neutral allele frequency distribution $p(\pi \mid \theta)$
Nature simulates $K \sim p(K \mid \pi)$
I observe and reject if $K \neq k$.

The fact that random draws are entirely under Nature's control is precisely why inference by conditioning is so compelling in that case. But for problem (2), it is a bit different:

I randomly choose $\theta$ from my prior $p(\theta)$ and imagine it is the true $\theta$

for i=1..M
- Nature randomly chooses $\pi_i \sim p(\pi_i \mid \theta)$
- Nature simulates $K_i \sim p(K_i \mid \pi_i)$

reject unless $K_i = k_i$ for all $i=1..M$.

This (somewhat less compelling) application of rejection sampling is more specifically what we call Bayesian inference: happily mixing subjective priors ('I choose') with distributions that are meant to represent objective processes ('Nature chooses'). Epistemologically, what we are now doing is this: we are filtering out, from our prior beliefs about $\theta$, those possible values that don't easily reproduce the observed data — which still makes sense, although this assumes that we are comfortable with translating beliefs into probability distributions and vice versa. I won’t discuss here the philosophical implications of this methodological choice. I will in due time, but for the moment, I just wanted to be as accurate (and as honest...) as possible in my formulation. Making this distinction beween the subjective and the objective components of the model can be useful in less trivial contexts.

Examples 3. Now, a more complicated problem: I still want to estimate $\theta = 4 N_e u$ based on $M$ unlinked loci. But the crucial difference with the previous example is that I have removed all the constant sites of my dataset, so as to consider only segregating sites: that is, only loci that happen to show at least one variant in my sample of $n$ individuals. So, by design, for all $i=1..M$, one has $0 < k_i < n$ (whereas in the previous case, one had instead that $0 \le k_i \le n$). Let us denote this additional constraint by $S$: the sample is entirely made of segregating sites. For $i=1..M$, $S_i$ stands for the constraint that site $i$ is segregating.

I should really take into account this new mode of data acquisition in my inference. If I do not, I will greatly overestimate $\theta$ (since it would take very high values of $\theta$ to make it likely that $M$ randomly chosen neutral sites were all segregating). Thus, I clearly need to work out the probabilistic calculations for my constrained model.

Since sites are unlinked and, therefore, independent, collecting $M$ segregating sites among a large collection contanining a majority of non-segregating sites is just like letting Nature ‘run’ the process, site by site, intercepting only those that turn out to be segregating, until obtaining a total number of $M$ accepted sites (and then I decide whether or not to reject the whole sample). Which leads to the following algorithmic representation:
I randomly choose $\theta$ from $p(\theta)$ and imagine it is the true $\theta$

for i=1..M
repeat
- Nature randomly chooses $\pi_i \sim p(\pi_i \mid \theta)$
- Nature randomly simulates $K_i \sim p(K_i \mid \pi_i)$
until $1 < K_i < n$

I reject unless $K_i = k_i$ for all $i=1..M$.

Note that there are now two levels of rejection: one that is ‘for free’ (in the repeat-until loop), and one that is not (the final step). By ‘free’, I mean: not accounted for in our total acceptance rate -- and, therefore, not accounted for in our total probability.

Now, if we want to write down the posterior, as above, we can first eliminate the $\pi_i$'s:

Randomly choose $\theta$ from $p(\theta)$

for i=1..M
repeat
- Nature randomly simulates $K_i \sim p(K_i \mid \theta) = \int p(K_i \mid \pi_i) p(\pi_i \mid \theta) d \pi_i$
until $1 < K_i < n$

Reject unless $K_i = k_i$ for all $i=1..M$.

Then, for a given $\theta$ and a given $i$, carefully interpreting the repeat-until loop leads to the following proability for $k_i$, which is now conditioned on the constraint $S_i$ being fulfilled:

(1) $p(k_i \mid \theta, S_i) = p(k_i \mid \theta) / p(S_i \mid \theta)$.

This conditional likelihood is defined only over $k_i = 1..n-1$ (now excluding 0 and n, since I won't exit the repeat-until loop as long as these values are being produced) and has been suitably normalized, by dividing it by the probability of getting a value for $k_i$ in the allowed interval (i.e. by indeed getting a segregating site given $\theta$):

$p(S_i \mid \theta) = \sum_{l=1}^{n-1} p(k_i \mid \theta)$.

Then, according to the algorithm, I just need to take the product over the $M$ sites (the 'for' loop) and combine this with my prior:

$p(\theta \mid k) \propto p(\theta) \prod_i p(k_i \mid \theta, S_i)$,

or, for short:

(2)    $p(\theta \mid k) \propto p(\theta) p(k \mid \theta, S)$.

This particular problem illustrates one general rule:

filtering the data $\implies$ repeat-until loops $\implies$ conditional likelihoods.

However, it is also interesting for another reason: there is no simple and safe way we can have $S$ appear at the correct place, in equation 2, by just blindly manipulating symbolic equations, as we often do in applied Bayesian inference, shuffling symbols around, sometimes to the left, sometimes to the right of the conditioning sign. We could try to write, for instance:

$p(\theta \mid k, S) \propto p(\theta \mid S) p(k \mid \theta, S)$,

but this looks different from equation 2. We could then be tempted to expand the first factor:

$p(\theta \mid S) \propto p(\theta) p(S \mid \theta)$

and combine it with the expanded form of the conditional likelihood (equation 1), leading to:

$p(\theta \mid k, S) \propto p(\theta) p(S \mid \theta) p(k \mid \theta) / p(S \mid \theta) \propto p(\theta) p(k \mid \theta)$,

but then, we are back to the simple posterior of problem 2, which is obviously wrong. One main problem here is that we have an ambiguity in the meaning of $p(\theta \mid k, S)$: usually, this would mean ‘the probability of $\theta$, given the observed data $k$ and given that $S$ is true’. However, what we really want in the present case is: ‘the probability of $\theta$, given the observed data $k$ and given that my data acquisition protocol is constrained to deliver something compatible with $S$’, which is different. Symbolic notation in terms of '$p(A \mid B)$' is just too sloppy.

We could try to come up with better notations and still play the game of automatic symbolic calculus. However, when facing complex problems, perhaps it is just more helpful, and more insightful, to directly rely on a semantic model of our probabilities. Rejection sampling, and more generally algorithmic models of our sampling protocol, provide such a semantic model. As a result, they represent a relatively foolproof method for deriving the correct version of our posterior distribution (or of our likelihood): we just need to spell out each of the probabilities, such as operationally defined by the algorithm, which gives us the solution.

P.S. (Nov 1st):  In hindsight, I realize that there is something fallacious in this final argument. You can have a good semantic model and good notations. Here, I should perhaps be more careful and denote the probability that $A$ is true given that $B$ is true and given that $C$ will always be fulfilled owing to the data acquisition protocol, by $p(A \mid B;C)$. A correct form for Bayes theorem in the present case would then be:

$p(\theta \mid k; S) \propto p(\theta ; S) p(k \mid \theta; S)$

Since the data acquisition protocol does not influence the prior:

$p(\theta; S) = p(\theta)$

As for our likelihood, under this protocol, as suggested by the algorithmic argument above, it is equal to:

(3)    $p(k \mid \theta; S) = p(k \mid \theta, S) = p(k \mid \theta) / p(S \mid \theta)$

and therefore, equation 2 should be more properly written as:

(2b)    $p(\theta \mid k; S) \propto p(\theta) p(k \mid \theta, S)$.