Bayes’ rule

Alexei Gilchrist

Some consequences of the product rule are explored including the famous Bayes’ rule.

1 Probabilities

We are taking a probability to be a numerical measure of plausibility that must obey the following:

1. $$0\le P(A|I) \le 1$$ with $$P(A|I)=0$$ meaning that $$A$$ is impossible given $$I$$, and $$P(A|I)=1$$ meaning that $$A$$ is certain given $$I$$.
2. Product rule: $$P(AB|C) = P(A|BC)P(B|C)$$.
3. Sum rule: $$P(A|B)+P(\bar{A}|B)=1$$.

2 Bayes’ rule

Clearly, for any two propositions $$A$$ and $$B$$, the plausibility of ($$A$$ and $$B$$) will be the same as the plausibility of ($$B$$ and $$A$$) since the conjunction and is commutative, so we can write the product rule of probability in two different ways which should be equal to each other:

\begin{align}P(AB|C) &= P(A|BC)P(B|C) \\ &=P(B|AC)P(A|C).\end{align}
A simple rearrangement then gives Bayes’ rule:
$$$P(A|BC) = \frac{P(A|C)P(B|AC)}{P(B|C)}.$$$
Note that each term is a probability and nothing magical has happened, it’s a direct consequence of the product rule. The power of Bayes’ rule is in its interpretation. Imagine $$I$$ represents some background information, $$H$$ is a hypothesis and $$D$$ is some data or observations. We can then write Bayes’ rule in a much more suggestive form:
$$$P(H|DI) = \frac{P(H|I)P(D|HI)}{P(D|I)}.$$$
Now the rule is revealed as a learning algorithm. Initially the probability we assign $$H$$ is $$P(H|I)$$ then we receive some new data $$D$$ and the rule provides an update mechanism to include the data in a new estimate of $$H$$, that is $$P(H|D I)$$.

3 Chaining

Interpreting Bayes’ rule as a learning algorithm, what happens if we obtain data in bits and pieces ($$D_0$$, $$D_1$$, $$\ldots$$) and update the probability of some hypothesis $$H$$ along the way? At first, before seeing any data the probability of $$H$$ is just $$P(H|I)$$. After obtaining $$D_0$$ the probability gets updated to

$$$P(H|D_0I) = \frac{P(H|I)P(D_0|HI)}{P(D_0|I)}.$$$
Next, after obtaining $$D_1$$ the above posterior probability becomes the new prior probability in a new update:
$$$\label{eq:2ndupdate} P(H|D_1D_0I) = \frac{P(H|D_0I)P(D_1|HD_0I)}{P(D_1|D_0I)}.$$$
Alternatively, imagine we’d received $$D_0D_1$$ all in one go, then a single update yields
$$$P(H|D_1D_0I) = \frac{P(H|I)P(D_0D_1|HI)}{P(D_0D_1|I)}.$$$
Though this looks different, use the product rule to expand $$P(D_0D_1|HI)$$ and $$P(D_0D_1|I)$$, then Bayes’ rule on the probability $$P(D_0|HI)$$ that turns up to give
\begin{align}P(H|D_1D_0I) &= \frac{P(H|I)P(D_0D_1|HI)}{P(D_0D_1|I)} \\ & = \frac{P(H|I)P(D_1|D_0HI)P(D_0|HI)}{P(D_1|D_0I)P(D_0|I)} \\ & = \frac{P(H|I)P(D_1|D_0HI)}{P(D_1|D_0I)P(D_0|I)}\frac{P(D_0|I)P(H|D_0I)}{P(H|I)}\\ & = \frac{P(D_1|D_0HI)P(H|D_0I)}{P(D_1|D_0I)},\end{align}
which is the same as Eq. \eqref{eq:2ndupdate}. So receiving data sequentially or all in one go makes no difference in the end. Neither does the the order or any grouping as $$D_0D_1D_2 \equiv D_1D_2D_0 \equiv (D_0D_2)D_1$$ etc. This is really part of the consistency desiderata—multiple paths to the same outcome should give the same probability assignment as illustrated in the figure below.

4 Exploring Bayes’ Rule

The following is a neat demonstration of the relationship between the various terms in the Bayes’ rule:

It takes a little bit to unpack how it works. The prior probability $$P(H|I)$$ is set on the left hand vertical axis and the posterior probability $$P(H|D I)$$ is read off the right hand vertical axis. Along the top and bottom axis you can set the probability of getting the data assuming the hypothesis is true and assuming it’s false respectively. First concentrate on the blue dot in the interior of the graph and the dotted blue lines. The horizontal position of the blue dot is the probability of the data $$P(D|I)$$. It is the weighted average of the probability of the data given the hypothesis and not given the hypothesis:

$$$P(D|I) = P(D|H I)P(H|I)+P(D|\bar{H} I)P(\bar{H}|I).$$$
The vertical position of the blue dot is the probability of the hypothesis (which you can set in the demo) and this can also be thought of as a weighted average between the probability of the hypothesis given the data $$P(H|D I)$$ (red dot) and given the data is false $$P(H|\bar{D} I)$$ (intercept of red dotted line and left vertical axis):
$$$P(H|I)=P(H|D I)P(D|I)+P(H|\bar{D} I)P(\bar{D}|I).$$$

Some things to explore: if $$P(D|H I)>P(D|\bar{H} I)$$ the data supports the hypothesis more than not-the-hypothesis and the posterior probability will be increased; conversely, if $$P(D|H I)<P(D|\bar{H} I)$$ the posterior probability will be reduced. The bigger the gap between $$P(D|H I)$$ and $$P(D|\bar{H} I)$$ the more dramatic the change in the probability. If $$P(D|I)$$ is small (i.e. the data is surprising regardless of the hypothesis) then small changes in $$P(D|H I)$$ or $$P(D|\bar{H} I)$$ can have a dramatic effect.

Here is another demonstration with an alternative way of looking at the relationship between the terms:

$$$P(H|DI) = \frac{P(H|I)P(D|HI)}{P(D|I)} = \frac{P(HD|I)}{P(HD|I)+P(\bar{H}D|I)}.$$$
In the demonstration below you can set the prior probability $$P(H|I)$$ and also $$P(D|HI)$$ and $$P(D|\bar{H}I)$$. The posterior probability will be the red portion of the bar at the bottom. The size of the coloured portions are simply the product of the portion at the top and the corresponding conditional probability, e.g. $$P(HD|I)=P(H|I)P(D|H I)$$.