Bayes’ RulePSYC 573University of Southern CaliforniaJanuary 25, 20221 / 32

Inverse Probability

Conditional probability: $P (A ∣ B) = \frac{P (A, B)}{P (B)}$

which yields $P (A, B) = P (A ∣ B) P (B)$ (joint = conditional $\times$ marginal)

2 / 32

Inverse Probability

Conditional probability: $P (A ∣ B) = \frac{P (A, B)}{P (B)}$

which yields $P (A, B) = P (A ∣ B) P (B)$ (joint = conditional $\times$ marginal)

On the other side, $P (B ∣ A) = \frac{P (B, A)}{P (A)}$

2 / 32

Bayes' Theorem

$P (B ∣ A) = \frac{P (A ∣ B) P (B)}{P (A)}$

Which says how can go from $P (A ∣ B)$ to $P (B ∣ A)$

3 / 32

Bayes' Theorem

$P (B ∣ A) = \frac{P (A ∣ B) P (B)}{P (A)}$

Which says how can go from $P (A ∣ B)$ to $P (B ∣ A)$

Consider $B_{i}$ $(i = 1, \dots, n)$ as one of the many possible mutually exclusive events

$\begin{aligned} P (B_{i} ∣ A) & = \frac{P (A ∣ B_{i}) P (B_{i})}{P (A)} \\ = \frac{P (A ∣ B_{i}) P (B_{i})}{\sum_{k = 1}^{n} P (A ∣ B_{k}) P (B_{k})} \end{aligned}$

3 / 32

A police officer stops a driver at random and does a breathalyzer test for the driver. The breathalyzer is known to detect true drunkenness 100% of the time, but in 1% of the cases, it gives a false positive when the driver is sober. We also know that in general, for every 1,000 drivers passing through that spot, one is driving drunk. Suppose that the breathalyzer shows positive for the driver. What is the probability that the driver is truly drunk?

4 / 32

Gigerenzer (2004)

$p$ value = $P$ (data | hypothesis), not $P$ (hypothesis | data)

5 / 32

Gigerenzer (2004)

$p$ value = $P$ (data | hypothesis), not $P$ (hypothesis | data)

$H_{0}$ : the person is sober (not drunk)
data: breathalyzer result

$p$ = $P$ (positive | sober) = 0.01 $\to$ reject $H_{0}$ at .05 level

5 / 32

Gigerenzer (2004)

$p$ value = $P$ (data | hypothesis), not $P$ (hypothesis | data)

$H_{0}$ : the person is sober (not drunk)
data: breathalyzer result

$p$ = $P$ (positive | sober) = 0.01 $\to$ reject $H_{0}$ at .05 level

However, as we have been, given that $P (H_{0})$ is small, $P (H_{0} ∣ data)$ is still small

5 / 32

Bayesian Data Analysis6 / 32

Bayes' Theorem in Data AnalysisBayesian statisticsmore than applying Bayes's theorem
a way to quantify the plausibility of every possible value of some parameter θθE.g., population mean, regression coefficient, etc

Goal: update one's Belief about θθ based on the observed data DD

7 / 32

Going back to the example

Goal: Find the probability that the person is drunk, given the test result

Parameter ( $θ$ ): drunk (values: drunk, sober)

Data ( $D$ ): test (possible values: positive, negative)

8 / 32

Going back to the example

Goal: Find the probability that the person is drunk, given the test result

Parameter ( $θ$ ): drunk (values: drunk, sober)

Data ( $D$ ): test (possible values: positive, negative)

Bayes' theorem: $\underset{posterior}{\underset{⏟}{P (θ ∣ D)}} = \underset{likelihood}{\underset{⏟}{P (D ∣ θ)}} \underset{prior}{\underset{⏟}{P (θ)}} / \underset{marginal}{\underset{⏟}{P (D)}}$

8 / 32

Usually, the marginal is not given, so

$P (θ ∣ D) = \frac{P (D ∣ θ) P (θ)}{\sum_{θ^{*}} P (D ∣ θ^{*}) P (θ^{*})}$

$P (D)$ is also called evidence, or the prior predictive distribution
- E.g., probability of a positive test, regardless of the drunk status

9 / 32

Example 2

shiny::runGitHub("plane_search", "marklhc")

Try choosing different priors. How does your choice affect the posterior?
Try adding more data. How does the number of data points affect the posterior?

10 / 32

The posterior is a synthesis of two sources of information: prior and data (likelihood)

Generally speaking, a narrower distribution (i.e., smaller variance) means more/stronger information

Prior: narrower = more informative/strong
Likelihood: narrower = more data/more informative

11 / 32

Setting Priors

Flat, noninformative, vague
Weakly informative: common sense, logic
Informative: publicly agreed facts or theories

Prior beliefs used in data analysis must be admissible by a skeptical scientific audience (Kruschke, 2015, p. 115)

12 / 32

Likelihood/Model/Data $P (D ∣ θ, M)$

Probability of observing the data as a function of the parameter(s)

Also written as $L (θ ∣ D)$ or $L (θ; D)$ to emphasize it is a function of $θ$
Also depends on a chosen model $M$ : $P (D ∣ θ, M)$

13 / 32

Likelihood of Multiple Data Points

Given $D_{1}$ , obtain posterior $P (θ ∣ D_{1})$
Use $P (θ ∣ D_{1})$ as prior, given $D_{2}$ , obtain posterior $P (θ ∣ D_{1}, D_{2})$

The posterior is the same as getting $D_{2}$ first then $D_{1}$ , or $D_{1}$ and $D_{2}$ together, if

data-order invariance is satisfied, which means
$D_{1}$ and $D_{2}$ are exchangeable

14 / 32

Exchangeability

Joint distribution of the data does not depend on the order of the data

E.g., $P (D_{1}, D_{2}, D_{3}) = P (D_{2}, D_{3}, D_{1}) = P (D_{3}, D_{2}, D_{1})$

Example of non-exchangeable data:

First child = male, second = female vs. first = female, second = male
$D_{1}, D_{2}$ from School 1; $D_{3}, D_{4}$ from School 2 vs. $D_{1}, D_{3}$ from School 1; $D_{2}, D_{4}$ from School 2

15 / 32

An Example With Binary Outcomes16 / 32

Coin Flipping

Q: Estimate the probability that a coin gives a head

$θ$ : parameter, probability of a head

Flip a coin, showing head

$y = 1$ for showing head

How do you obtain the likelihood?

17 / 32

Bernoulli Likelihood

The likelihood depends on the probability model chosen

Some models are commonly used, for historical/computational/statistical reasons

One natural way is the Bernoulli model $\begin{aligned} P (y = 1 ∣ θ) & = θ \\ P (y = 0 ∣ θ) & = 1 - θ \end{aligned}$

18 / 32

Bernoulli Likelihood

The likelihood depends on the probability model chosen

Some models are commonly used, for historical/computational/statistical reasons

One natural way is the Bernoulli model $\begin{aligned} P (y = 1 ∣ θ) & = θ \\ P (y = 0 ∣ θ) & = 1 - θ \end{aligned}$

The above requires separating $y = 1$ and $y = 0$ . A more compact way is $P (y ∣ θ) = θ^{y} (1 - θ)^{(1 - y)}$

18 / 32

Multiple Observations

Assume the flips are exchangeable given $θ$ ,

$\begin{aligned} P (y_{1}, \dots, y_{N} ∣ θ) & = \prod_{i = 1}^{N} P (y_{i} ∣ θ) \\ = θ^{\sum_{i = 1}^{N} y_{i}} (1 - θ)^{\sum_{i = 1}^{N} (1 - y_{i})} \\ = θ^{z} (1 - θ)^{N - z} \end{aligned}$

$z$ = # of heads; $N$ = # of flips

Note: the likelihood only depends on the number of heads, not the particular sequence of observations

19 / 32

Posterior

Same posterior, two ways to think about it

Prior belief, weighted by the likelihood

$P (θ ∣ y) \propto \underset{weights}{\underset{⏟}{P (y ∣ θ)}} P (θ)$

20 / 32

Posterior

Same posterior, two ways to think about it

Prior belief, weighted by the likelihood

$P (θ ∣ y) \propto \underset{weights}{\underset{⏟}{P (y ∣ θ)}} P (θ)$

Likelihood, weighted by the strength of prior belief

$P (θ ∣ y) \propto \underset{weights}{\underset{⏟}{P (θ)}} P (θ ∣ y)$

20 / 32

Posterior

Say $N$ = 4 and $z$ = 1

E.g., $P (θ ∣ y_{1} = 1) \propto P (y_{1} = 1 ∣ θ) P (θ)$

For pedagogical purpose, we'll discretize the $θ$ into [.05, .15, .25, ..., .95]

Also called grid approximation

21 / 32

Posterior

Say $N$ = 4 and $z$ = 1

E.g., $P (θ ∣ y_{1} = 1) \propto P (y_{1} = 1 ∣ θ) P (θ)$

For pedagogical purpose, we'll discretize the $θ$ into [.05, .15, .25, ..., .95]

Also called grid approximation

21 / 32

How About the Denominator?

Numerator: relative posterior plausibility of the $θ$ values

We can avoid computing the denominator because

The sum of the probabilities need to be 1

22 / 32

How About the Denominator?

Numerator: relative posterior plausibility of the $θ$ values

We can avoid computing the denominator because

The sum of the probabilities need to be 1

So, for discrete parameters:

Posterior probabilitiy = relative plausibility / sum of relative plausibilities

22 / 32

How About the Denominator?

Numerator: relative posterior plausibility of the $θ$ values

We can avoid computing the denominator because

The sum of the probabilities need to be 1

So, for discrete parameters:

Posterior probabilitiy = relative plausibility / sum of relative plausibilities

However, the denominator is useful for computing the Bayes factor

22 / 32

Summarizing a Posterior Distribution

Simulate (i.e., draw samples) from the posterior distribution

th <- seq(.05, .95, by = .10)
pth <- c(.01, .055, .10, .145, .19, .19, .145, .10, .055, .01)
py_th <- th^1 * (1 - th)^4
pth_y_unscaled <- pth * py_th
pth_y <- pth_y_unscaled / sum(pth_y_unscaled)
post_samples <- sample(th,
  size = 1000, replace = TRUE,
  prob = pth_y
)

>#   mean median     sd    mad   ci.1   ci.9 
>#  0.360  0.350  0.143  0.148  0.150  0.550

23 / 32

Influence of more samples

$N$ = 40, $z$ = 10

24 / 32

Influence of more samples

$N$ = 40, $z$ = 10

24 / 32

Influence of more informative priors

$N$ = 4, $z$ = 1

25 / 32

Influence of more informative priors

$N$ = 4, $z$ = 1

25 / 32

Influence of more informative priors

$N$ = 4, $z$ = 1

The prior needs to be well justified

25 / 32

Prior Predictive Distribution

Bayesian models are generative

Simulate data from the prior distribution to check whether the data fit our intuition

Clearly impossible values/patterns?
Overly restrictive?

26 / 32

$P (y) = \int P (y | θ^{*}) P (θ^{*}) d θ^{*}$ : Simulate a $θ^{*}$ from the prior, then simulate data based on $θ^{*}$

27 / 32

Criticism of Bayesian Methods28 / 32

Criticism of "Subjectivity"

Main controversy: subjectivity in choosing a prior

Two people with the same data can get different results because of different chosen priors

29 / 32

Counters to the Subjectivity Criticism

With enough data, different priors hardly make a difference
Prior: just a way to express the degree of ignorance
- One can choose a weakly informative prior so that the Influence of subjective Belief is small

30 / 32

Counters to the Subjectivity Criticism 2

Subjectivity in choosing a prior is

Same as in choosing a model, which is also done in frequentist statistics
Relatively strong prior needs to be justified,
- Open to critique from other researchers
Inter-subjectivity $\to$ Objectivity

31 / 32

Counters to the Subjectivity Criticism 3

The prior is a way to incorporate previous research efforts to accumulate scientific evidence

Why should we ignore all previous literature every time we conduct a new study?

32 / 32

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Bayes’ Rule

PSYC 573

University of Southern California

January 25, 2022

Inverse Probability

Inverse Probability

Bayes' Theorem

Bayes' Theorem

Gigerenzer (2004)

Gigerenzer (2004)

Gigerenzer (2004)

Bayesian Data Analysis

Bayes' Theorem in Data Analysis

Going back to the example

Going back to the example

Example 2

Setting Priors

Likelihood/Model/Data P(D∣θ,M)P(D∣θ,M)

Likelihood of Multiple Data Points

Exchangeability

An Example With Binary Outcomes

Coin Flipping

Bernoulli Likelihood

Bernoulli Likelihood

Multiple Observations

Posterior

Same posterior, two ways to think about it

Posterior

Same posterior, two ways to think about it

Posterior

Posterior

How About the Denominator?

How About the Denominator?

How About the Denominator?

Summarizing a Posterior Distribution

Influence of more samples

Influence of more samples

Influence of more informative priors

Influence of more informative priors

Influence of more informative priors

Prior Predictive Distribution

Criticism of Bayesian Methods

Criticism of "Subjectivity"

Counters to the Subjectivity Criticism

Counters to the Subjectivity Criticism 2

Counters to the Subjectivity Criticism 3

Inverse Probability

Help

Likelihood/Model/Data $P (D ∣ θ, M)$