10.4 Conditional Probabilities and Bayes’ Rule
In order to understand Bayesian statistics we first need to understand Bayes’ rule and conditional probability. So let’s look at an example.
Example 10.1 The following table shows results from a survey of people’s views on the economy and whether or not they voted for the President in the last election. Percentages are reported as decimals. Probability tables are a clever way to organize this information.
Probability | Optimistic view on economy | Pessimistic view on economy | Total |
---|---|---|---|
Voted for the president | 0.20 | 0.20 | 0.40 |
Did not vote for president | 0.15 | 0.45 | 0.60 |
Total | 0.35 | 0.65 | 1.00 |
Solution. Based on the probability table, we define the following probabilities:
- The probability you voted for the President and have an optimistic view on the economy is 0.20
- The probability you did not vote for the President and have an optimistic view on the economy is 0.15
- The probability you voted for the President and have an pessimistic view on the economy is 0.20
- The probability you did not vote for the President and have an pessimistic view on the economy is 0.45
We calculate the probability of having an Optimistic view on the economy by adding the probabilities with an optimistic view, whether or not they voted for the president. For this example, this probability sums to 0.20 + 0.15 = 0.35. On the other hand, the probability you have a pessimistic view on the economy is 0.20 + 0.45 = 0.65. Notice how the two of these together (probability of optimistic and pessimistic views of the economy is 1, or 100% of the outcomes.)
10.4.1 Conditional probabilities
A conditional probability is the probability of an outcome given some previous outcome, or \(\mbox{Pr} (A | B)\), where Pr means “probability of an outcome” and \(A\) and \(B\) are two different outcomes or events. In probability theory you might study the following law of conditional probability:
\[\begin{equation} \begin{split} \mbox{Pr}(A \mbox { and } B) &= \mbox{Pr} (A \mbox{ given } B) \cdot \mbox{Pr}(B) \\ &= \mbox{Pr} (A | B) \cdot \mbox{Pr}(B) \\ &= \mbox{Pr} (B | A) \cdot \mbox{Pr}(A) \end{split} \tag{10.5} \end{equation}\]
Typically when expressing conditional probabilities we remove “and” and write \(P(A \mbox{ and } B)\) as \(P(AB)\) and “given” as \(P(A \mbox{ given } B)\) as \(P(A|B)\).
Solution. To compute the probability you voted for the president given you have an optimistic view of the economy is a rearrangement of Equation (10.5):
\[\begin{equation} \begin{split} \mbox{Pr(Voted for President | Optimistic View on Economy)} = \\ \frac{\mbox{Pr(Voted for President and Optimistic View on Economy)}}{\mbox{Pr(Optimistic View on Economy)}} = \\ \frac{0.20}{0.35} = 0.57 \end{split} \tag{10.6} \end{equation}\]
So the probability you compute in Equation (10.6) seems telling. Contrast this percentage to that of the probability you voted for the President, which is 0.4. Perhaps your view of the economy does indeed influences whether or not you would vote to re-elect the President.
10.4.2 Bayes’ Rule: Application of Conditional Probabilities
How could we systematically incorporate prior information into a parameter estimation problem? We are going to introduce Bayes’ Rule, which is a rearrangment of the rule for conditional probability:
\[\begin{equation} \mbox{Pr} (A | B) = \frac{ \mbox{Pr} (B | A) \cdot \mbox{Pr}(A)}{\mbox{Pr}(B) } \end{equation}\]
It turns out Bayes’ Rule is a really helpful way to understand how we can systematically incorporate this prior information into the likelihood function (and by association the cost function). For data assimilation problems our goal is to estimate parameters, given data. So we can think of Bayes’ rule in terms of parameters and data:
\[\begin{equation} \mbox{Pr}( \mbox{ parameters } | \mbox{ data }) = \frac{\mbox{Pr}( \mbox{ data } | \mbox{ parameters }) \cdot \mbox{ Pr}( \mbox{ parameters }) }{\mbox{Pr}(\mbox{ data }) }. \end{equation}\]
Here are a few observations from that last equation:
- The term \(\mbox{Pr}( \mbox{ data } | \mbox{ parameters })\) is similar to the model data residual, or the standard likelihood function.
- If we think of the term \(\mbox{Pr}( \mbox{ parameters })\), then prior information is a multiplicative effect on the likelihood function - this is good news! You will demonstrate in the homework that the log likelihood is related to the cost function - so when we added that additional term to form \(\tilde{S}(b)\), we accounted for the prior information correctly.
- The expression \(\mbox{Pr}( \mbox{ parameters } | \mbox{ data })\) is the start of a framework for a probability density function, which should integrate to unity. (You will explore this more if you study probability theory.) In many cases we select parameters that optimize a likelihood or cost function. So the expression in the denominator (\(\mbox{Pr}(\mbox{ data })\) ) does not change the location of the optimum values. This denominator term is called a normalizing constant.
In the following sections we will explore Bayes’ Rule in action and how to utilize it for different types of cost functions, but wow - we made some significant progress in our conceptual understanding of how to incorporate models and data.