Probability is basic to the Bayesian way of thinking. We use probability directly in our computations, and our conclusions are expressed as probabilities.

### What is probability?

The modern theory of probability is based on work by Andrey Kolmogorov, published in German in 1933 with the English translation appearing in 1950^{1}Kolmogorov, A.N. (1950) Foundations of the theory of probability, 2nd English edition 1956 edn. Chelsea Publishing Company, New York.

**Basic definition**: Probability is a number attached to an **event **which has three properties:

a. It is between 0 and 1.

b. Addition: If events are **mutually exclusive**, we can add the probabilities.

c. Multiplication: If events are **independent**, we can multiply the probabilities.

**Event**: An event may be something which occurs, such as rain tomorrow, or a die coming up 6, or Manchester United winning the Premier League next year. We can also relate it to a statement which can be true or false, for example “It will rain tomorrow” or “There are 231 tigers in Malaysia”.

**Mutually exclusive**: A die cannot come up with 6 and 1 at the same time, so we can add; the probability of each with a fair 6-sided die is 1/6, so the probability of 1 or 6 = 1/6 + 1/6 = 1/3. But for drawing cards from a deck, aces and hearts are not mutually exclusive – it’s possible to draw the ace of hearts; the probability of drawing an ace is 4/52 and the probability of a heart is 13/52, but we can’t just add them up and say the probability of “ace” or “heart” is 17/52.

**Independent**: If I roll two dice, what’s the probability they will both show 6? The score on one die does not affect the score on the other, so we can multiply: 1/6 x 1/6 = 1/36. If I draw two cards from the same deck, what’s the probability they will both be aces? If the first card is an ace, there are now only three aces left in the deck, so the second draw is affected by the first draw. The draws are not independent, and we can’t just multiply 1/13 x 1/13.

### Probability with two variables

When we are working with two or more variables, we need additional terminology: marginal, conditional and joint (or conjoint) probabilities. We’ll use an example to explain these:

To test the effect of vitamin C on the probability of getting a common cold during winter, 818 volunteers were recruited and randomly assigned to receive vitamin C or a placebo. After the winter, each was interviewed to determine if they had had a cold^{2}Data from Anderson, Reid and Beaton (1972) Vitamin C and the common cold, *Canadian Medical Association Journal* 107:503-508. . The results are shown below:

cold | no cold | totals | |

vitamin C | 302 | 105 | 407 |

no vitamin C | 335 | 76 | 411 |

totals | 637 | 181 | 818 |

This para doesn’t appear.

What’s the probability that if we take one volunteer at random they had a cold **and **took vitamin C? This is the **joint probability**. We can calculate that as $${\rm I\!P}(cold \cap vitamin) = \frac {302}{818} = 0.37$$ Here we are using the symbol $\cap$ for intersection, ie, the people who belong to the vitamin group and the cold group.

What’s the probability that a person who took vitamin C had a cold? Now we are only interested in the 407 volunteers who took the vitamin, and we calculate the **conditional probability** as $${\rm I\!P}(cold | vitamin) = \frac {302}{407} = 0.74$$ The vertical bar, |, means “given that”.

What’s the probability that a volunteer got a cold? We don’t care if they took the vitamin or not, just whether they got a cold. Here we need the number in the bottom margin, 637, and calculate the **marginal probability** with $${\rm I\!P}(cold) = \frac {637}{818} = 0.78$$

### Deriving Bayes’ Rule

Notice that we can calculate the joint probability from the marginal and conditional probabilities. We look at the people who got colds and the proportion of those who took vitamin C: $${\rm I\!P}(vitamin\cap cold) = {\rm I\!P}(vitamin | cold) {\rm I\!P}(cold)$$ $$ = \frac {302}{637} \frac {637}{818} = \frac {302}{818} = 0.37$$

We can do it the other way around too: $${\rm I\!P}(vitamin\cap cold) = {\rm I\!P}(cold \cap vitamin) = {\rm I\!P}(cold| vitamin) {\rm I\!P}(vitamin)$$ Hence $$ {\rm I\!P}(vitamin | cold) {\rm I\!P}(cold) = {\rm I\!P}(cold| vitamin) {\rm I\!P}(vitamin)$$ And with a little bit of algebra we have $$ {\rm I\!P}(vitamin | cold) = \frac{ {\rm I\!P}(cold| vitamin) {\rm I\!P}(vitamin)}{ {\rm I\!P}(cold) }$$

Let’s replace vitamin and cold with the usual things we are interested in: parameter values $\theta$ and data: $$ {\rm I\!P}(\theta| data) = \frac{ {\rm I\!P}(data| \theta) {\rm I\!P}(\theta)}{ {\rm I\!P}(data) }$$ This is the usual form of Bayes’ Rule, and it can be derived directly from the axioms of probability theory.

- $ {\rm I\!P}(\theta| data) $ is the
**posterior**probability, based on the likelihood and the prior. - ${\rm I\!P}(data| \theta)$ is the
**likelihood**, the probability of observing the data for given values of $\theta$. - $ {\rm I\!P}(\theta)$ is the
**prior**probability for $\theta$ before considering the data. - ${\rm I\!P}(data)$ is the marginal likelihood, the probability of observing the data ignoring the values of $\theta$.

The marginal likelihood is usually impossible to calculate for models with multiple parameters, but for a given data set it does not depend on $\theta$. We can then use the simpler form: $$ {\rm I\!P}(\theta| data) \propto {\rm I\!P}(data| \theta) {\rm I\!P}(\theta) $$ We can still get proper posterior distributions, since we know that probabilites must add to one.