Probability is plausibility

In the fall of 2021, I taught a seminar on good reasoning. A central aim of the course was to teach students to think about probability as an extension of propositional logic. Although I failed miserably at teaching this perspective, I continue to believe that there’s value in the approach, which was championed by the physicist Edwin Thompson Jaynes, but goes back to the origins of probability theory with Laplace.

My notes from this section of the course are summarized below. Needless to say, none of this constitutes a formal proof of the rationality of probability theory. Rather, I wanted to present ‘intuitive’ justifications for the equations that I use myself to help me remember the rules.

Plausibility

When you hear the word “probability,” you probably think about events that you can count, such as coin tosses or dice rolls. But instead want you to think about “plausibility.”

Take a true/false statement like “Donald Trump will win the 2024 US presidential election.” If this event happens, it will only happen once. Yet, intuitively, we can still reason about how plausible it is. At one extreme, we might think it is impossible. At the other extreme, we might think it is inevitable. Or perhaps we have no idea whether it will happen or not — that is, we think it is equally likely to happen or not happen. However, you probably appreciate that it is most reasonable to take a slightly different attitude. After all, for Trump to win the presidential election, he also needs to win the Republican primary, stay in the race, survive until election day, and so on. Even if he makes it on the final ballot, we know from past elections that the outcome is likely to be close to a coin toss. And so, with all our further uncertainty about the events leading up to this potential moment, it seems reasonable to expect that the statement “Donald Trump will win the 2024 US presidential election” is more likely to be false than true.

Of course, this line of thinking doesn’t get us to a precise understanding of probability in a numerical sense, yet hopefully a picture is emerging of how we might represent the plausibility of this statement as a number. For starters, this number must be bounded on both the left and the right because we cannot be more than certain in the statement’s truth or falsehood. By convention, we assign a plausibility of 0 to a statement that we believe is definitely false and a plausibility of 1 to a statement that we believe is definitely true. If we regard the event as a coin flip, we take the middle plausibility of 0.5. Yet our strength of belief can also vary smoothly between 0 and 1. Indeed, we’ve already reasoned that the plausibility of Donald Trump’s winning the 2024 election should probably be somewhere between 0 and 0.5.

Notation

Let \(X\) be a true/false statement like the one about Donald Trump above. Then we write \(P(X)\) to indicate the plausibility we assign to that statement’s truth, which can range from 0 to 1.

Technically speaking, these plausibility values are always determined in the presence of some background information that we assume to be true. For example, I know that in November 2022 Trump announced his bid for the White House. Learning this information changed my belief about how plausible it is that he will be our next president. More generally, I have a broad set of relevant knowledge \(K\) that I draw upon to reason about the truth of \(X\). I take this knowledge as given when I’m assigning my plausibility. In notation, I write the things I take to be given to the right of a vertical line like so: \(P(X \mid K)\).

Often times we take this knowledge as implicitly given and just write \(P(X)\). But we may also want to imagine how our beliefs would change if we knew some other information that we don’t currently know. For example, suppose I had the knowledge \(Y\) that Donald Trump will win the Republican primary. Clearly, the plausibility of \(X\) should go up; i.e.,

\[ P(X \mid K, Y) > P(X \mid K). \]

The ability to condition on knowledge we don’t yet have will prove critical to deriving some of the most important rules we discuss below. Moreover, we can engage in this exercise of imagination even if we think \(Y\) is unlikely to be true.

Negation

In basic propositional logic, we can flip the truth value of a proposition by negating it. That is, if it’s true that Donald Trump will win the next presidential election (\(X\)), it must be false that he will not win the next presidential election (a statement we can denote “\(\neg X\)”), and vice versa. What does this imply about our beliefs in \(X\) and \(\neg X\)?

Let’s start with the easy cases. Suppose we’re certain of \(X\), i.e., \(P(X)=1\). What follows about \(P(\neg X)\)? Clearly, if we’re certain that \(X\) is true, we should be certain that \(\neg X\) is false, i.e., \(P(\neg X) = 0\). Meanwhile, if we have no idea whether \(X\) is true — i.e., \(P(X) = 0.5\) — then it stands to reason that we also have no idea whether \(\neg X\) is true, i.e., \(P(\neg X) = 0.5\). Lastly, if we negate a statement twice, we should get back our original plausibility because \(X\) is logically equivalent to \(\neg \neg X\).

So, mathematically, we want a function that takes in a belief about \(X\) and outputs a reasonable belief about \(\neg X\) that satisfies the constraints above. Moreover, we want an incremental rise in \(P(X)\) to be accompanied by an equivalent incremental fall in \(P(\neg X)\). Putting this all together, the function we want is

\[ P(\neg X) = 1 - P(X). \]

This function maps 0 to 1 and 1 to 0, is equal to its input only when \(P(X) = 0.5\), and associates an incremental shift in \(P(X)\) with an equal and opposite shift in \(P(\neg X)\). Moreover, we can see that the rule of double negation holds as expected:

\[ P(\neg \neg X) = 1 - P(\neg X) = 1 - (1 - P(X)) = P(X). \]

Conjunction

Suppose we are jointly contemplating the truth of two (or more) statements, such as

  • \(L\): Adam’s left eye is blue.

  • \(R\): Adam’s right eye is blue.

How should we assign a plausibility to the proposition that both of these statements are true, i.e., \(L \land R\)?

Can we write \(P(L \land R)\) simply in terms of our beliefs about each of these statements on their own, i.e., \(P(L)\) and \(P(R)\)? No! We’d be ignoring the possible dependence between the truth of \(L\) and the truth of \(R\). If a person’s left eye color always matched his right eye color1\(P(R \mid L) = 1\) — then we should be just as confident in \(L\) as we are in \(L \land R\), i.e.,

\[ P(L \land R) = P(L). \]

In contrast, if a person’s left eye color could never match his right eye color — \(P(R \mid L) = 0\) — then we know that, even if each statement is plausible on its own, \(L\) and \(R\) can’t both be true, i.e., \(P(L \land R) = 0\).

So we want to find a mathematical expression that scales \(P(L \land R)\) between 0 and \(P(L)\) based on how strongly we should believe in \(R\) once we know \(L\), aka \(P(R \mid L)\). As with negation, we want the strength of this conditional plausibility to smoothly interpolate between these bounds, such that we knew that someone’s right eye color matched his left eye color 50% of the time, then \(P(L \land R)\) cuts \(P(L)\) in half. Hopefully by now you see that the mathematical function we’re after is multiplication:

\[ P(L \land R) = P(L)P(R \mid L). \]

In other words, we can evaluate the plausibility of two statements together by sequentially evaluating the plausibility of the first statement and then evaluating the plausibility of the second statement given that the first statement is true.

Moreover, the order in which we evaluate these statements shouldn’t matter. In propositional logic, we know that the truth value of \(L \land R\) is the same as the truth value of \(R \land L\). So we should believe in them to the same degree, i.e., \(P(L \land R) = P(R \land L)\). This implies that we could just as well write the above expression by evaluating the plausibility of \(R\) first and then the plausibility of \(L\) given \(R\):

\[ P(L \land R) = P(R)P(L \mid R). \]

In turn, this implies that

\[ P(L)P(R \mid L) = P(R)P(L\mid R). \]

If we move \(P(L)\) to the right-hand side, we derive one of the most important formulas in probability theory, Bayes’ theorem:

\[ P(R \mid L) = \frac{P(R)P(L \mid R)}{P(L)}. \]

In the case of eye color, this theorem is not so useful. But it’s incredibly useful if \(R\) is some hypothesis we want to evaluate and \(L\) is some data that we’ve recently learned about — a topic for another time.

(In)dependence

In deriving the plausibility of a conjunction of two statements, we’ve seen that it is essential to consider the informational dependence between these conjuncts. We can’t just consider \(P(L)\) and \(P(R)\) separately; we need to know \(P(R \mid L)\) or \(P(L \mid R)\), depending on the order in which we’re evaluating statements. Sometimes, though, knowing the truth of one statement doesn’t change our belief about the truth of the other statement, e.g., \(P(R) = P(R \mid L)\). For example, knowing my left eye color probably doesn’t change our belief about some other random person’s left eye color. When this is the case, we say that \(L\) and \(R\) are independent.

But the concept of independence is a bit more nuanced. Recall that the plausibility we assign to a statement is always at least implicitly assessed in the context of some background knowledge \(K\). Thus, we should think of independence as dependent on this knowledge. \(L\) and \(R\) are conditionally independent if

\[ P(R \mid K) = P(R \mid L,K). \]

But this relationship could change if our background knowledge changes. For example, suppose I’m on an alien planet in which I observe that one of the aliens has a left eye that’s orange. Does this increase my belief that another random alien’s left eye will also be orange? Yes, because I don’t know anything about the general composition of these aliens’ eye colors, so learning that one alien’s eyes are orange should increase my belief that other aliens have orange eyes. In contrast, here on planet Earth, my knowledge of the statistics of human eye colors renders my eye color and the eye color of some other random person conditionally independent.

Even weirder, sometimes new information can create conditional dependence. For example, perhaps you start with the belief that my left eye color and Cindy’s left eye color are independent, but later learn that we’re both in a special club in which everyone is required to have the same eye color. Now if you learn that my left eye is blue, you should come to believe that it’s very likely that Cindy’s left eye is also blue. That is, these statements are highly dependent once we condition on knowledge about the special club.

Disjunction

Technically speaking, we can now define the plausibility of any compound logical statement in terms of the plausibility of each of its individual statements. Once we’ve learned rules for how to deal with “and” and “not,” we can write any statement with “or” in terms of these other two operations. Specifically,

\[ A \lor B \equiv \neg(\neg A \land \neg B), \]

i.e., “A or B” is true if and only if it’s not the case that both A is false and B is false.2 Armed with this knowledge, we can derive a logical plausibility for disjunction based on our earlier rules:

\[ P(A \lor B) = 1 - P(\neg A \land \neg B). \]

Although the above formula is often useful, I leave it as an exercise to the reader to show how you can further reduce this formula to the following well-known expression:

\[P(A \lor B) = P(A) + P(B) - P(A \land B).\]

It is often helpful to think about this equation in terms of the visual metaphor of a Venn diagram where our belief in \(A \lor B\) is the combined area of two circles that represent \(P(A)\) and \(P(B)\), with their overlapping region \(P(A \land B)\) subtracted out so that it is not double-counted.

Mutual exclusivity

If it’s impossible for both \(A\) and \(B\) to be true at the same time — that is, they are mutually exclusive, \(P(A \land B) = 0\) — then we can simplify the above equation even more:

\[ P(A \lor B) = P(A) + P(B). \]

In other words, in the case of mutually exclusive possibilities, to derive our belief about a disjunction of statements we can conveniently just add up our beliefs about the individual statements.

To take a special case, a statement \(A\) and its negation \(\neg A\) are mutually exclusive, as the truth of one entails the falsehood of the other. This implies

\[ P(A \lor \neg A) = P(A) + P(\neg A) = P(A) + (1 - P(A)) = 1, \]

which makes sense: we should be certain in the tautology that either \(A\) or \(\neg A\) is true.

Marginalization

So far, we’ve focused on rules that allow us to derive the plausibility of a complex expression in terms of the plausibility of simpler statements. But sometimes we want to do the opposite: figure out a reasonable plausibility for a simple statement based on information we have about more complex statements. One very useful trick for doing this is known as “marginalization.”

From logic, we know that any proposition \(A\) is equivalent to \(A \land (B \lor \neg B)\), since \(B \lor \neg B\) is a tautology, as noted earlier. Borrowing again from the rules of propositional logic, we can distribute out the conjunction to rewrite this expression as \((A \land B) \lor (A \land \neg B)\). Since logically equivalent expressions should be given the same plausibility, we therefore know that

\[P(A) = P((A \land B) \lor (A \land \neg B)).\]

Further note that \(A \land B\) cannot be true if \(A \land \neg B\) is true; that is, they are mutually exclusive, which allows us to simply add the plausibility of one expression to the plausibility of the other:

\[P(A) = P(A \land B) + P(A \land \neg B).\]

Finally, we can simplify this expression further by applying our rules for conjunction:

\[P(A) = P(B)P(A \mid B) + P(\neg B)P(A \mid \neg B).\]

In other words, we have learned that the plausibility we assign to \(A\) must be a weighted average of the plausibility we would assign to \(A\) in the world in which \(B\) is true and the plausibility we would assign to \(A\) in the world in which \(B\) is false.

Of course, this trick is only useful if (a) we do not know \(P(A)\) directly, and (b) we know (or can estimate) all the terms on the right-hand side of the equation above. For example, suppose we’re trying to assess whether Jim murdered his wife (\(A\)). If he was home at the time of the murder (\(B\)), we know from other background information that it’s quite likely that he’s the murderer, i.e., \(P(A \mid B)\) is close to 1. If Jim wasn’t home at that time (\(\neg B\)), it’s much less likely he’s the murderer, i.e., \(P(A \mid \neg B)\) is close to 0. A couple of eyewitnesses recall seeing Jim near his house a few minutes before the murder was believed to have taken place. Based on these eyewitness accounts, we can estimate \(P(B)\) (and \(P(\neg B)\), which is just \(1 - P(B)\)). Thus, the marginalization formula above allows us to use all of this information to form an overall impression of whether Jim is guilty.

There’s another more philosophical implication of this theorem. Suppose that in the future we expect to get some evidence that will bear on Jim’s guilt. Perhaps we will learn whether or not blood from the crime scene matches Jim’s DNA. If the DNA matches, we will come to believe \(P(A \mid E)\), and if it doesn’t match, we will come to believe \(P(A \mid \neg E)\). The marginalization theorem tells us that what we should believe now about Jim’s guilt, \(P(A)\), must be the average of what we will believe once we get this evidence, weighed by how likely it is that this evidence turns up positive or negative:

\[ P(A) = P(E)P(A \mid E) + P(\neg E)P(A \mid \neg E). \]

Simply put, today’s belief is what we expect3 to believe tomorrow. Can you see why this makes logical sense?

Lastly, note that this theorem can generalize to more than two possibilities. Let \(E_1, E_2, \dots E_n\) be a set of mutually exclusive and exhaustive ways the evidence could turn out. Then,

\[ P(A) = \sum_{i=1}^{n}P(A \land E_i)=\sum_{i=1}^{n}P(E_i)P(A \mid E_i). \]


  1. I am a counterexample to this rule.↩︎

  2. I’m using the inclusive “or” here. I’ll leave it to you to figure out how to define the exclusive “or.”↩︎

  3. I mean “expect” in the mathematical sense of expected value (i.e., taking an average).↩︎

Avatar
Adam Bear
Research/Data Scientist