# Comparing Models

## Alexei Gilchrist

### 1 Introduction

A common inference task is when we have a number of competing hypotheses or models and we want to know which one is most plausible given some data. This task is just an application of the sum and product rules of probability just as every inference task will be.

Naively we could just use Bayes’ rule and expand \(P(D|I)\) in terms of \(H\) and \(\bar{H}\) which form a mutually exclusive and exhaustive set:

*not*\(H\)? Unless the problem is particularly simple, this is extraordinarily difficult to reason about. For any given specific proposition \(H\) there could be a vast number of propositions that are not \(H\) for which the data is plausible.

### 2 Pairwise comparison

If we directly compare the probability of two models by forming a ratio, then the troublesome denominator cancels:

*odds*, especially in gambling. `Odds for something’ is given either as two numbers or as the value of a ratio, in both cases it’s a comparison between the probability of something \(P(A|I)=n/(n+m)\) against the probability of its converse \(P(\bar{A}|I)=1-P(A|I)=m/(n+m)\):

*odds*. From Bayes’ rule the odds is a product of the prior odds and the

*Bayes factor*—the ratio of likelihoods. An odds \(R>1\) indicates the data supports \(H_1\) over \(H_2\), \(R<1\) indicates the data supports \(H_2\) over \(H_1\).

*N.B. In common usage there is fairly widespread confusion over odds and probabilities. To add to this, in Bayesian communities it’s common to call \(R\) the `odds ratio’ while in medical (?) communities the `odds ratio’ refers to, well, a ratio of odds, somehting that would better describe the Bayes’ factor... Sigh. *

### 3 Importance of priors and alternatives

When dealing with small probabilities it’s sometimes convenient to express them in *decibels* [\(dB\)]:

In \eqref{eq:bayes} we expanded \(P(D|I)\) in terms of \(H\) and \(\bar{H}\) but this is very simplistic. In any realistic situation there are always alternative hypotheses or models \(H_n\) that could be considered, so really the expansion should be more like

A great way to deceive using probability theory is to pick something that would be deep in this sea of alternatives and compare it against a plausible hypothesis as if they are the only two contenders. Even better if the whole issue of prior probabilities is ignored. Then, any small-prior hypothesis like faster-than-light communication, or clairvoyance, can be promoted into the limelight against the larger-prior hypothesis `it was due to chance’ with just some moderately surprising data.

Interestingly, I think we naturally do the more complete modelling. Imagine you are invited to witness a clairvoyant in action. As the show proceeds and they get more and more things right, at least my instinct is not to suddenly start believing in clairvoyance but instead to start looking for mirrors, secret communication with an accomplice etc. The more they get right the more sophisticated the cheat would seem. To really convince me of clairvoyance, *every* alternative explanation sitting higher in this sea would have to be ruled out—extraordinary claims require *extraordinary experimental design*.

### 4 Example: multi-sided die

This example is inspired by this post.

Say a friend has a bag with a 4-sided, 6-sided, 8-sided, and a 12-sided die. They select one in secret and roll it, then tell us the outcome. Which die did they pick?

Here we have a simple example with four competing models \(H\in\{H_4, H_6, H_8, H_{12}\}\) where \(n\) in \(H_n\) is the number of sides. \(D\) will represent the data such as “the first roll was a 1”, “the second roll was a 6”, and so on. So, given the observed data we want to determine the probability of each of the competing models \(H\). That is

*before*we know the data. Since the selection was done in secret we have no reason to choose one die over another so should assign them all the same probability of 1/4 since there are four possibilities. Secondly \(P(D|I)\) does not depend on the \(H\) and will not affect the

*relative*probabilities of the models so we can just ignore it if we stick to making relative comparisons. In this case though, the problem is so small that it’s easy to calculate, which we can do by normalising the probabilities at the end, i.e. require

That finally leaves us with \(P(D|HI)\). This is the probability that a particular model gives the observed data. Now, if we’re told the outcome of a roll was \(r\), there are two possibilities. If \(r\) is higher than the number of sides of a die then we can immediately rule out that die since it couldn’t have been the one. So for instance a 6 would rule out \(H_4\), and an 8 would rule out both \(H_4\) and \(H_6\). If \(r\) isn’t larger than the number of sides it *could* have been rolled by that die but what probability should we assign? Again, assuming we have no reason to suspect the die are weighted, each side should be assigned equal probability so

In the figure below we can see the effect on being told the rolls {1, 2, 6, 5, 6, 6, 8}. Each result shifts the probability amongst the models and occasionally rules one out if the result is outside the die’s range. Once a die has been ruled out it stays out.