Conditional probability: P(A∣B)=P(A,B)P(B)
which yields P(A,B)=P(A∣B)P(B) (joint = conditional × marginal)
Conditional probability: P(A∣B)=P(A,B)P(B)
which yields P(A,B)=P(A∣B)P(B) (joint = conditional × marginal)
On the other side, P(B∣A)=P(B,A)P(A)
P(B∣A)=P(A∣B)P(B)P(A)
Which says how can go from P(A∣B) to P(B∣A)
P(B∣A)=P(A∣B)P(B)P(A)
Which says how can go from P(A∣B) to P(B∣A)
Consider Bi (i=1,…,n) as one of the many possible mutually exclusive events
P(Bi∣A)=P(A∣Bi)P(Bi)P(A)=P(A∣Bi)P(Bi)∑nk=1P(A∣Bk)P(Bk)
A police officer stops a driver at random and does a breathalyzer test for the driver. The breathalyzer is known to detect true drunkenness 100% of the time, but in 1% of the cases, it gives a false positive when the driver is sober. We also know that in general, for every 1,000 drivers passing through that spot, one is driving drunk. Suppose that the breathalyzer shows positive for the driver. What is the probability that the driver is truly drunk?
p value = P(data | hypothesis), not P(hypothesis | data)
p value = P(data | hypothesis), not P(hypothesis | data)
p = P(positive | sober) = 0.01 → reject H0 at .05 level
p value = P(data | hypothesis), not P(hypothesis | data)
p = P(positive | sober) = 0.01 → reject H0 at .05 level
However, as we have been, given that P(H0) is small, P(H0∣data) is still small
Goal: Find the probability that the person is drunk, given the test result
Parameter (θ): drunk (values: drunk, sober)
Data (D): test (possible values: positive, negative)
Goal: Find the probability that the person is drunk, given the test result
Parameter (θ): drunk (values: drunk, sober)
Data (D): test (possible values: positive, negative)
Bayes' theorem: P(θ∣D)posterior=P(D∣θ)likelihoodP(θ)prior/P(D)marginal
Usually, the marginal is not given, so
P(θ∣D)=P(D∣θ)P(θ)∑θ∗P(D∣θ∗)P(θ∗)
shiny::runGitHub("plane_search", "marklhc")
The posterior is a synthesis of two sources of information: prior and data (likelihood)
Generally speaking, a narrower distribution (i.e., smaller variance) means more/stronger information
Prior beliefs used in data analysis must be admissible by a skeptical scientific audience (Kruschke, 2015, p. 115)
Probability of observing the data as a function of the parameter(s)
The posterior is the same as getting D2 first then D1, or D1 and D2 together, if
Joint distribution of the data does not depend on the order of the data
E.g., P(D1,D2,D3)=P(D2,D3,D1)=P(D3,D2,D1)
Example of non-exchangeable data:
Q: Estimate the probability that a coin gives a head
Flip a coin, showing head
How do you obtain the likelihood?
The likelihood depends on the probability model chosen
One natural way is the Bernoulli model P(y=1∣θ)=θP(y=0∣θ)=1−θ
The likelihood depends on the probability model chosen
One natural way is the Bernoulli model P(y=1∣θ)=θP(y=0∣θ)=1−θ
The above requires separating y=1 and y=0. A more compact way is P(y∣θ)=θy(1−θ)(1−y)
Assume the flips are exchangeable given θ,
P(y1,…,yN∣θ)=N∏i=1P(yi∣θ)=θ∑Ni=1yi(1−θ)∑Ni=1(1−yi)=θz(1−θ)N−z
z = # of heads; N = # of flips
Note: the likelihood only depends on the number of heads, not the particular sequence of observations
Prior belief, weighted by the likelihood
P(θ∣y)∝P(y∣θ)weightsP(θ)
Prior belief, weighted by the likelihood
P(θ∣y)∝P(y∣θ)weightsP(θ)
Likelihood, weighted by the strength of prior belief
P(θ∣y)∝P(θ)weightsP(θ∣y)
Say N = 4 and z = 1
E.g., P(θ∣y1=1)∝P(y1=1∣θ)P(θ)
For pedagogical purpose, we'll discretize the θ into [.05, .15, .25, ..., .95]
Say N = 4 and z = 1
E.g., P(θ∣y1=1)∝P(y1=1∣θ)P(θ)
For pedagogical purpose, we'll discretize the θ into [.05, .15, .25, ..., .95]
Numerator: relative posterior plausibility of the θ values
We can avoid computing the denominator because
Numerator: relative posterior plausibility of the θ values
We can avoid computing the denominator because
So, for discrete parameters:
Numerator: relative posterior plausibility of the θ values
We can avoid computing the denominator because
So, for discrete parameters:
However, the denominator is useful for computing the Bayes factor
Simulate (i.e., draw samples) from the posterior distribution
th <- seq(.05, .95, by = .10)pth <- c(.01, .055, .10, .145, .19, .19, .145, .10, .055, .01)py_th <- th^1 * (1 - th)^4pth_y_unscaled <- pth * py_thpth_y <- pth_y_unscaled / sum(pth_y_unscaled)post_samples <- sample(th, size = 1000, replace = TRUE, prob = pth_y)
># mean median sd mad ci.1 ci.9 ># 0.360 0.350 0.143 0.148 0.150 0.550
N = 40, z = 10
N = 40, z = 10
N = 4, z = 1
N = 4, z = 1
N = 4, z = 1
The prior needs to be well justified
Bayesian models are generative
Simulate data from the prior distribution to check whether the data fit our intuition
Clearly impossible values/patterns?
Overly restrictive?
P(y)=∫P(y|θ∗)P(θ∗)dθ∗: Simulate a θ∗ from the prior, then simulate data based on θ∗
Main controversy: subjectivity in choosing a prior
With enough data, different priors hardly make a difference
Prior: just a way to express the degree of ignorance
Subjectivity in choosing a prior is
Same as in choosing a model, which is also done in frequentist statistics
Relatively strong prior needs to be justified,
Inter-subjectivity → Objectivity
The prior is a way to incorporate previous research efforts to accumulate scientific evidence
Why should we ignore all previous literature every time we conduct a new study?
Conditional probability: P(A∣B)=P(A,B)P(B)
which yields P(A,B)=P(A∣B)P(B) (joint = conditional × marginal)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |