Comparing Models

Alexei Gilchrist

3 Probability Inference

Model comparison is one of the principal tasks on inference. Given some data, how does the plausibility of different models change? Does the data select out a particular model as being better?

❦

1 Introduction

A common inference task is when we have a number of competing hypotheses or models and we want to know which one is most plausible given some data. This task is just an application of the sum and product rules of probability just as every inference task will be.

Naively we could just use Bayes’ rule and expand \(P(D|I)\) in terms of \(H\) and \(\bar{H}\) which form a mutually exclusive and exhaustive set:

\[\begin{equation*}P(H|DI)=\frac{P(H|I)P(D|HI)}{P(D|I)}=\frac{P(H|I)P(D|HI)}{P(H|I)P(D|HI)+(1-P(H|I))P(D|\bar{H}I)}.\end{equation*}\]

We hit a problem in trying to evaluate the last term in the denominator: \(P(D|\bar{H}I)\), that is, what is the probability of the data given not \(H\)? Unless the problem is particularly simple, this is extraordinarily difficult to reason about. For any given specific proposition \(H\) there could be a vast number of propositions that are not \(H\) for which the data is plausible.

A closer look at updating probabilities via Bayes' rule. Expanding the evidence usually requires well formed alternatives.

2 Pairwise comparison

If we directly compare the probability of two models by forming a ratio, then the troublesome denominator cancels:

\[\begin{equation*}R=\frac{P(H_1|DI)}{P(H_2|DI)} = \frac{P(H_1|I)}{P(H_2|I)}\frac{P(D|H_1I)}{P(D|H_2I)}.\end{equation*}\]

It’s common to express probabilities as odds, especially in gambling. `Odds for something’ is given either as two numbers or as the value of a ratio, in both cases it’s a comparison between the probability of something \(P(A|I)=n/(n+m)\) against the probability of its converse \(P(\bar{A}|I)=1-P(A|I)=m/(n+m)\):

\[\begin{equation*}\mathrm{odds} = \frac{P(A|I)}{P(\bar{A}|I)} = \frac{n/(n+m)}{m/(n+m)} = \frac{n}{m} \equiv n:m.\end{equation*}\]

Since we have only two models \(P(H_2|I)=1-P(H_1|I)\) and \(P(H_2|DI)=1-P(H_1|DI)\) so \(R\) is naturally enough called the odds. From Bayes’ rule the odds is a product of the prior odds and the Bayes factor—the ratio of likelihoods. An odds \(R>1\) indicates the data supports \(H_1\) over \(H_2\), \(R<1\) indicates the data supports \(H_2\) over \(H_1\).

N.B. In common usage there is fairly widespread confusion over odds and probabilities. To add to this, in Bayesian communities it’s common to call \(R\) the `odds ratio’ while in medical (?) communities the `odds ratio’ refers to, well, a ratio of odds, somehting that would better describe the Bayes’ factor... Sigh.

Two competing hypotheses can be directly compared by the ratio of the posterior probabilities.

3 Importance of priors and alternatives

When dealing with small probabilities it’s sometimes convenient to express them in decibels [\(dB\)]:

\[\begin{equation*}P(A|I) \;\;[\text{in }dB] \equiv 10 \log_{10} P(A|I).\end{equation*}\]

On this scale, 0 denotes certainty and as a proposition tends to impossible the probability in decibels tends to \(-\infty\). A probability of 1/1,000 is -30\(dB\), one of 1/1,000,000 is -60\(dB\) and so on.

In \eqref{eq:bayes} we expanded \(P(D|I)\) in terms of \(H\) and \(\bar{H}\) but this is very simplistic. In any realistic situation there are always alternative hypotheses or models \(H_n\) that could be considered, so really the expansion should be more like

\[\begin{equation*}P(D|I)=P(D|HI)P(H|I)+\sum_n P(D|H_nI)P(H_n|I).\end{equation*}\]

In fact if we were to rank the hypotheses \(H_n\) by the prior probability we would find that as the probability dropped there were more and more unlikely hypothesis possible. Alternatives such as faulty apparatus, cheating, sabotage, little green aliens, etc. The lower the prior the more unlikely the hypothesis can be. We can see this as a sea of alternatives sitting somewhere at low probability, say -40 \(dB\) or -50 \(dB\) (see fig). It may be really difficult to even put an order of magnitude on the prior probability of these alternatives but that doesn’t mean they are not there.

Probabilities measured on a decibel scale. In any realistic situation there is a sea of alternative hypotheses at low probability

A great way to deceive using probability theory is to pick something that would be deep in this sea of alternatives and compare it against a plausible hypothesis as if they are the only two contenders. Even better if the whole issue of prior probabilities is ignored. Then, any small-prior hypothesis like faster-than-light communication, or clairvoyance, can be promoted into the limelight against the larger-prior hypothesis `it was due to chance’ with just some moderately surprising data.

Interestingly, I think we naturally do the more complete modelling. Imagine you are invited to witness a clairvoyant in action. As the show proceeds and they get more and more things right, at least my instinct is not to suddenly start believing in clairvoyance but instead to start looking for mirrors, secret communication with an accomplice etc. The more they get right the more sophisticated the cheat would seem. To really convince me of clairvoyance, every alternative explanation sitting higher in this sea would have to be ruled out—extraordinary claims require extraordinary experimental design.

4 Example: multi-sided die

This example is inspired by this post.

Say a friend has a bag with a 4-sided, 6-sided, 8-sided, and a 12-sided die. They select one in secret and roll it, then tell us the outcome. Which die did they pick?

Here we have a simple example with four competing models \(H\in\{H_4, H_6, H_8, H_{12}\}\) where \(n\) in \(H_n\) is the number of sides. \(D\) will represent the data such as “the first roll was a 1”, “the second roll was a 6”, and so on. So, given the observed data we want to determine the probability of each of the competing models \(H\). That is

\[\begin{equation*}P(H|DI) = \frac{P(H|I)P(D|HI)}{P(D|I)}.\end{equation*}\]

Going through each of the terms, first of all \(P(H|I)\) is the probability of each of the models before we know the data. Since the selection was done in secret we have no reason to choose one die over another so should assign them all the same probability of 1/4 since there are four possibilities. Secondly \(P(D|I)\) does not depend on the \(H\) and will not affect the relative probabilities of the models so we can just ignore it if we stick to making relative comparisons. In this case though, the problem is so small that it’s easy to calculate, which we can do by normalising the probabilities at the end, i.e. require

\[\begin{equation*}\sum_H P(H|DI) = \frac{1}{P(D|I)} \sum_H P(H|I)P(D|HI) = 1,\end{equation*}\]

\[\begin{equation*}P(D|I) = \sum_H P(H|I)P(D|HI) = \frac{1}{4}\sum_H P(D|HI).\end{equation*}\]

Notice that `normalising at the end’ is the same as marginalising over the models.

That finally leaves us with \(P(D|HI)\). This is the probability that a particular model gives the observed data. Now, if we’re told the outcome of a roll was \(r\), there are two possibilities. If \(r\) is higher than the number of sides of a die then we can immediately rule out that die since it couldn’t have been the one. So for instance a 6 would rule out \(H_4\), and an 8 would rule out both \(H_4\) and \(H_6\). If \(r\) isn’t larger than the number of sides it could have been rolled by that die but what probability should we assign? Again, assuming we have no reason to suspect the die are weighted, each side should be assigned equal probability so

\[\begin{equation*} P(D=r|H_nI) = \begin{cases} 0,\;\; r>n \\ 1/n,\;\; \mathrm{otherwise}. \end{cases}\end{equation*}\]

Actually there is a final question we need to ask ourselves. Are the rolls independent of each other? Assuming they are, and labelling the \(j^\mathrm{th}\) outcome in \(m\) rolls as \(D_j\), then \(P(D_1,\ldots,D_m|X)=P(D_1|X)P(D_2|X)\ldots P(D_m|X)\). Here \(X\) could stand for either \(I\) or \(HI\) as independence with one doesn’t imply the other. Using our particular expansion of the denominator though, we only require independence given \(HI\). Putting everything together we have

\[\begin{equation*}P(H_n|DI) = \frac{\prod_j P(D_j|H_nI)}{\sum_H \prod_k P(D_k|HI)}.\end{equation*}\]

In the figure below we can see the effect on being told the rolls {1, 2, 6, 5, 6, 6, 8}. Each result shifts the probability amongst the models and occasionally rules one out if the result is outside the die’s range. Once a die has been ruled out it stays out.

The probabilities of the different die given we're told the result of a series of rolls. As we acquire data the probabilities reflect our changing confidence in the models. Note that a 6 rules out a 4-sided die on the 3rd roll, and an 8 in the last roll rules out the 6-sided die.