Markov Chain Monte CarloPSYC 573University of Southern CaliforniaFebruary 10, 20221 / 30

Monte Carlo2 / 30

Monte Carlo (MC) Methods

3 / 30

Image credit: Wikimedia Commons

Monte Carlo (MC) Methods

1930s and 40s: answer questions in nuclear physics not solvable with conventional mathematical methods
- Key figures: Stanislaw Ulam, John von Neumann, Nicholas Metropolis
Central element of the Manhattan Project in the development of the hydrogen bomb

3 / 30

Image credit: Wikimedia Commons

MC With One Unknown

rbeta(), rnorm(), rbinom(): generate values that imitate independent samples from known distributions

use pseudorandom numbers

E.g., rbeta(n, shape1 = 15, shape2 = 10)

4 / 30

With a large number of draws ( $S$ ),

sample density $\to$ target distribution
most sample statistics (e.g., mean, quantiles) $\to$ corresponding characteristics of the target density

5 / 30

Markov Chain Monte Carlo6 / 30

MCMC

Main problem in Bayesian: no way to draw independent samples from posterior $P (θ ∣ y) = \frac{e^{- (θ - 1 / 2)^{2}} θ^{y} (1 - θ)^{n - y}}{\int_{0}^{1} e^{- (θ^{*} - 1 / 2)^{2}} {θ^{*}}^{y} (1 - θ^{*})^{n - y} d θ^{*}}$

MCMC: draw dependent (correlated) samples without evaluating the integral in the denominator

7 / 30

Some commonly used algorithms:The Metropolis algorithm (also called random-walk Metropolis)
Gibbs sampling (in BUGS, JAGS)
Hamiltonian Monte Carlo (and No-U-Turn sampler; in STAN)

8 / 30

The Metropolis Algorithm9 / 30

An Analogy

You have a task: tour all regions in LA county, and the time your spend on each region should be proportional to its popularity

However, you don't know which region is the most popular

Each day, you will decide whether to stay in the current region or move to a neighboring region

You have a tour guide that tells you whether region A is more or less popular than region B and by how much

How would you proceed?

10 / 30

Image credit: Wikimedia Commons

Using the Metropolis Algorithm

On each day, randomly select a new region
If the proposed region is more popular than the current one, definitely go to the new region
If the proposed region is less popular than the current one, go to the new region with
$P (accept the new region) = \frac{proposed region popularity}{current region popularity}$
- E.g., by spinning a wheel

In the long run, distribution of time spent in each region = distribution of popularity of each region

11 / 30

Demonstration

shiny::runGitHub("metropolis_demo", "marklhc")

12 / 30

Example 1: Estimating the Number of People Taking the Metro13 / 30

Data from LA Barometer (by the USC Dornsife Center for Economic and Social Research)

338 first-gen immigrants, 86 used the metro in the previous year

Question:

What proportion of first-gen immigrants uses the metro in a year?

Press release: https://dornsife.usc.edu/news/stories/3164/labarometer-mobility-in-los-angeles-survey/

14 / 30

Analytic Method

Beta(1.5, 2) prior $\to$ Beta(87.5, 254) posterior

1,000 independent draws from the posterior:

15 / 30

With the Metropolis Algorithm

Proposal density: $N (0, 0.1)$ ; Starting value: $θ^{(1)} = 0.1$

R code for running the algorithm can be found in the note

16 / 30

With enough iterations, the Metropolis will simulate samples from the target distribution

It is less efficient than rbeta because the draws are dependent

17 / 30

With enough iterations, the Metropolis will simulate samples from the target distribution

It is less efficient than rbeta because the draws are dependent

Pros:

does not require solving the integral

can use non-conjugate priors

easy to implement

Cons:

not efficient; not scalable in complex models

require tuning the proposal SD;

17 / 30

MCMC Diagnostics18 / 30

Markov Chain

Markov chain: a sequence of iterations, ${θ^{(1)}, θ^{(2)}, \dots, θ^{(S)}}$

the "state" $θ^{(s)}$ depends on $θ^{(s - 1)}$
- where to travel next depends on where the current region is

Based on ergodic theorems, a well-behaved chain will reach a stationary distribution

after which, every draw is a sample from the stationary distribution

19 / 30

Warm-up

It takes a few to a few hundred thousand iterations for the chain to get to the stationary distribution

Therefore, a common practice is to discard the first $S_{warm-up}$ (e.g., first half of the) iterations

Also called burn-in

20 / 30

When Can We Use MCMC Draws to Approximate the Posterior?The draws need to be representative of the posterior
The draws contain sufficient information to accurately describe the posterior
21 / 30

When Can We Use MCMC Draws to Approximate the Posterior?The draws need to be representative of the posterior
The draws contain sufficient information to accurately describe the posterior
ToolsTrace plots/Rank histograms
^RR^
Effective sample size (ESS)
21 / 30

Representativeness

The chain does not get stuck

Mixing: multiple chains cross each other

22 / 30

Representativeness

For more robust diagnostics (Vehtari et al., 2021, doi: 10.1214/20-BA1221)

The rank histograms should look like uniform distributions

23 / 30

Representativeness

$\hat{R} = \frac{Between-chain variance + within-chain variance}{within-chain variance}$

aka: Gelman-Rubin statistic, the potential scale reduction factor

24 / 30

Representativeness

$\hat{R} = \frac{Between-chain variance + within-chain variance}{within-chain variance}$

aka: Gelman-Rubin statistic, the potential scale reduction factor

When the chains converge, each should be exploring the same stationary distribution

No between-chain differences $\Rightarrow$ $\hat{R} \to 1$
Vehtari et al. (2021) recommended $\hat{R} < 1.01$ for convergence

24 / 30

In the previous examples,

$\hat{R}$ = 2.044 for the poor mixing graph
$\hat{R}$ = 1.033 for the good mixing graph

25 / 30

Effective Sample Size (ESS)

MCMC draws are dependent, so they contain less information for the target posterior distribution

What is the equivalent number of draws if the draws were independent?

E.g., ESS = 98.289 for the good mixing example
- Need ~5087.022 draws to get equal amount of information as 1,000 independent samples

26 / 30

Heuristics for ESSESS (bulk and tail) > 400 to interpret ^RR^ (Vehtari et al., 2021)
ESS > 1000 for stable summary of the posteriorKruschke (2015) recommended 10,000

27 / 30

Sample Convergence Paragraph

We used Markov Chain Monte Carlo (MCMC), specifically a Metropolis algorithm implemented in R, to approximate the posterior distribution of the model parameters. We used two chains, each with 10,000 draws. The first 5,000 draws in each chain were discarded as warm-ups. Trace plots of the posterior samples (Figure X) showed good mixing, and $\hat{R}$ statistics (Vehtari et al., 2021) were < 1.01 for all model parameters, indicating good convergence for the MCMC chains. The effective sample sizes > 2376.931 for all model parameters, so the MCMC draws are sufficient for summarizing the posterior distributions.

28 / 30

Sample Results

The model estimated that 25.569% (posterior SD = 2.328%, 90% CI [21.813%, 29.467%]) of first-generation immigrants took the metro in the year 2019.

29 / 30

Things to RememberMCMC draws dependent/correlated samples to approximate a posterior distributionESS < SS

It needs warm-up iterations to reach a stationary distribution
Check for representativenesTrace/Rank plot and ^RR^

Need large ESS to describe the posterior accurately
30 / 30

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Markov Chain Monte Carlo

PSYC 573

University of Southern California

February 10, 2022

Monte Carlo

Monte Carlo (MC) Methods

Monte Carlo (MC) Methods

MC With One Unknown

Markov Chain Monte Carlo

MCMC

The Metropolis Algorithm

An Analogy

Using the Metropolis Algorithm

Demonstration

Example 1: Estimating the Number of People Taking the Metro

Analytic Method

With the Metropolis Algorithm

MCMC Diagnostics

Markov Chain

Warm-up

When Can We Use MCMC Draws to Approximate the Posterior?

When Can We Use MCMC Draws to Approximate the Posterior?

Tools

Representativeness

Representativeness

Representativeness

Representativeness

Effective Sample Size (ESS)

Heuristics for ESS

Sample Convergence Paragraph

Sample Results

Things to Remember

Monte Carlo

Help