PSYC 573 Bayesian Data Analysis (2022 Spring): Hierarchical Models

library(tidyverse)
library(rstan)
rstan_options(auto_write = TRUE)  # save compiled STAN object
library(shinystan)  # graphical exploration
library(posterior)  # for summarizing draws
library(bayesplot)  # for plotting
theme_set(theme_classic() +
    theme(panel.grid.major.y = element_line(color = "grey92")))

Although many statistical models can be fitted using Bayesian or frequentist methods, some models are more naturally used in the Bayesian framework. One class of such models is the family of hierarchical models. Consider situations when the data contain some clusters, such as multiple data points in each of many participants, multiple participants in each of several treatment conditions, etc. While it is possible to run

Bayesian analyses for the

subsets of the data, it is usually more efficient to pool the data together, such that cluster

has some parameters

, and these

values themselves come from a common distribution. This is the same idea as multilevel modeling.

In this note, you will see two examples, one from the textbook with a hierarchical Bernoulli/binomial model, and another from a classic data set with eight schools, modelled by a hierarchical normal model.

Hierarchical Bernoulli/Binomial

Previously, we have seen the Bernoulli model for

outcomes, such as multiple coin flips from the same coin:

Alternative Parameterization of Beta

Multiple Bernoulli = Binomial

With

exchangeable Bernoulli observations, an equivalent but more efficient way to code the model is to use the binomial distribution. Let

, then

Multiple Binomial Observations

Now, consider the situation with multiple coins, perhaps each coin with some noticeable differences, as discussed in chapter 9.2 of the textbook. If we have

= 3 coins, and

flips for the

th coin. If we believe that the coins all have the same bias, then we could consider the model

We can assign priors to each individual

. However, if our prior belief is such that there’s something common among the different coins, say they’re from the same factory, so that they come from the same distribution, we can have common parameters for the prior distributions of the

What priors to use for

and

is relatively easy because it is the mean bias; if we put a Beta prior for bias, we can again use a Beta prior for the mean bias.

is more challenging. A larger

means that the biases of the coins are more similar to each other. We can perform a prior predictive check to see what the data look like. As a starting point, chapter 9 of the textbook suggested using Gamma(0.01, 0.01). So the full model in our case, with a weak Beta(1.5, 1.5) prior on

, is

Therapeutic Touch Example

# Data file from https://github.com/boboppie/kruschke-doing_bayesian_data_analysis/blob/master/2e/TherapeuticTouchData.csv
tt_dat <- read.csv("data_files/TherapeuticTouchData.csv")
# Get aggregated data by summing the counts
tt_agg <- tt_dat %>%
    group_by(s) %>%
    summarise(y = sum(y),  # total number of correct
              n = n())
# Plot proportion correct distribution
p1 <- ggplot(tt_agg, aes(x = y / n)) +
    geom_histogram(binwidth = .1) +
    labs(x = "Proportion Correct")

STAN Code

data {
  int<lower=0> J;         // number of clusters (e.g., studies, persons)
  int y[J];               // number of "1"s in each cluster
  int N[J];               // sample size for each cluster
}
parameters {
  vector<lower=0, upper=1>[J] theta;   // cluster-specific probabilities
  real<lower=0, upper=1> mu;     // overall mean probability
  real<lower=0> kappa;           // overall concentration
}
model {
  y ~ binomial(N, theta);              // each observation is binomial
  theta ~ beta_proportion(mu, kappa);  // prior; Beta2 dist
  mu ~ beta(1.5, 1.5);           // weak prior
  kappa ~ gamma(.01, .01);       // prior recommended by Kruschke
}

Prior predictive

You can use Stan to sample the prior by commenting out the model line; here I show how to do it in R:

set.seed(1706)
plist <- vector("list", 12L)
plist[[1]] <- p1 +
    labs(x = "Observed data") +
    theme(axis.title.x = element_text(color = "red"))
num_subjects <- 28
for (s in 1:11) {
    # Get prior values of mu and kappa
    mu_s <- rbeta(1, shape1 = 1.5, shape2 = 1.5)
    kappa_s <- rgamma(1, shape = 0.01, rate = 0.01)
    # Generate theta
    theta <- rbeta(num_subjects,
                   shape1 = mu_s * kappa_s, shape2 = (1 - mu_s) * kappa_s)
    # Generate data
    new_y <- rbinom(num_subjects, size = tt_agg$n, prob = theta)
    plist[[s + 1]] <-
        p1 %+% mutate(tt_agg, y = new_y) +
        labs(x = paste("Simulated data", s)) +
        theme(axis.title.x = element_text(color = "black"))
}
gridExtra::grid.arrange(grobs = plist, nrow = 3)

The prior on

is actually not very realistic because it tends to push the bias to either 0 or 1. Using something like Gamma(0.1, 0.1) or Gamma(2, 0.01) may be a bit more reasonable (you can try it out yourself).

Calling rstan

tt_fit <- stan(
    file = here("stan", "hierarchical_bin.stan"),
    data = list(J = nrow(tt_agg),
                y = tt_agg$y,
                N = tt_agg$n),
    seed = 1716,  # for reproducibility
    chains = 4,
    iter = 4000,  # default gives warning saying iter needs to be higher
    # may need higher adapt_delta (default = .8) for hierarchical models
    control = list(adapt_delta = 0.8)
)

You can explore the convergence and posterior distributions using the shinystan package

Table of coefficients

variable	mean	median	sd	mad	q5	q95	rhat	ess_bulk	ess_tail
theta[1]	0.37	0.38	0.08	0.08	0.22	0.49	1.00	2053.33	3524.75
theta[2]	0.39	0.40	0.08	0.07	0.25	0.51	1.00	2511.75	3398.59
theta[3]	0.41	0.41	0.08	0.07	0.28	0.53	1.00	4815.11	4070.49
theta[4]	0.41	0.42	0.08	0.07	0.28	0.53	1.00	4101.83	4049.36
theta[5]	0.41	0.41	0.08	0.07	0.28	0.54	1.00	4772.51	3885.97
theta[6]	0.41	0.41	0.08	0.07	0.28	0.53	1.00	4047.04	3824.50
theta[7]	0.41	0.41	0.08	0.07	0.28	0.53	1.00	4640.84	4110.38
theta[8]	0.41	0.42	0.08	0.07	0.28	0.53	1.00	5153.56	3501.47
theta[9]	0.41	0.42	0.08	0.07	0.28	0.53	1.00	4419.30	3629.47
theta[10]	0.41	0.42	0.08	0.07	0.28	0.53	1.00	4456.35	4172.13
theta[11]	0.43	0.43	0.08	0.07	0.31	0.56	1.00	5807.17	4867.03
theta[12]	0.43	0.43	0.08	0.07	0.31	0.56	1.00	6574.03	4769.95
theta[13]	0.43	0.43	0.07	0.07	0.31	0.55	1.00	5961.52	5136.70
theta[14]	0.43	0.43	0.08	0.07	0.31	0.56	1.00	6730.63	4665.01
theta[15]	0.43	0.43	0.07	0.07	0.31	0.56	1.00	6650.31	4776.49
theta[16]	0.45	0.45	0.08	0.07	0.33	0.58	1.00	5540.38	4309.31
theta[17]	0.45	0.45	0.08	0.07	0.33	0.58	1.00	5923.26	4321.04
theta[18]	0.45	0.45	0.08	0.07	0.33	0.59	1.00	5847.91	4571.55
theta[19]	0.45	0.45	0.08	0.07	0.33	0.58	1.00	6091.28	4497.21
theta[20]	0.45	0.45	0.08	0.07	0.34	0.58	1.00	5986.57	5008.05
theta[21]	0.45	0.45	0.08	0.07	0.34	0.58	1.00	5802.48	4252.67
theta[22]	0.45	0.45	0.08	0.07	0.33	0.58	1.00	6348.07	4875.46
theta[23]	0.47	0.47	0.08	0.07	0.35	0.62	1.00	4740.90	3684.04
theta[24]	0.47	0.47	0.08	0.07	0.35	0.61	1.00	4619.95	3751.58
theta[25]	0.50	0.49	0.08	0.08	0.38	0.64	1.00	3260.56	4164.74
theta[26]	0.50	0.49	0.08	0.08	0.37	0.64	1.00	3150.89	3983.83
theta[27]	0.50	0.49	0.08	0.08	0.37	0.64	1.00	3107.08	4280.18
theta[28]	0.52	0.51	0.09	0.08	0.39	0.68	1.00	2300.07	3538.60
mu	0.44	0.44	0.03	0.03	0.39	0.50	1.00	1982.77	3741.91
kappa	64.85	42.77	65.48	33.83	12.37	189.87	1.01	614.23	1036.57
lp__	-198.32	-198.72	10.33	10.87	-214.29	-180.15	1.01	550.59	935.98

Derived coefficients

One nice thing about MCMC is that it is straightforward to obtain posterior distributions that are functions of the parameters. For example, even though we only sampled from the posteriors of the

s, we can ask questions like whether there is evidence for a nonzero difference in

between person 1 and person 28.

as_draws_df(tt_fit) %>%
    mutate_variables(
        theta1_minus14 = `theta[1]` - `theta[14]`,
        theta1_minus28 = `theta[1]` - `theta[28]`,
        theta14_minus28 = `theta[14]` - `theta[28]`
    ) %>%
    subset(variable = c("theta1_minus14", "theta1_minus28",
                        "theta14_minus28")) %>%
    summarise_draws()

Conclusion

As 0.5 is included in the 95% CI of

for all participants, there is insufficient evidence that people can sense “therapeutic touch.”

Shrinkage

mcmc_intervals(tt_fit,
               # plot only parameters matching "theta"
               regex_pars = "theta") +
    geom_point(
        data = tibble(
            parameter = paste0("theta[", 1:28, "]"),
            x = tt_agg$y / tt_agg$n
        ),
        aes(x = x, y = parameter),
        col = "red"
    ) +
    xlim(0, 1)

As can be seen, the posterior distributions are closer to the center than the data (in red). This pooling is a result of the belief that the participants have something in common.

Multiple Comparisons?

Another benefit of a Bayesian hierarchical model is that you don’t need to worry about multiple comparisons. There are multiple angles on why this is the case, but the basic answer is that the use of common prior distributions builds in the prior belief that the clusters/groups are likely to be equal. See discussion here and here.

Hierarchical Normal Model

Eight Schools Example

This is a classic data set first analyzed by Rubin (1981). It is also the example used in the RStan Getting Started page. The data contains the effect of coaching from randomized experiments in eight schools. The numbers shown (labelled as y) are the mean difference (i.e., effect size) in performance between the treatment group and the control group on SAT-V scores.

In the above data, some numbers are positive, and some are negative. Because the sample sizes are different, the data also contained the standard errors (labelled as sigma) of the effect sizes. Generally speaking, a larger sample size corresponds to a smaller standard error. The research question is

Model

Given the SAT score range, it is unlikely that a coaching program will improve scores by 100 or so, so we use a prior of

and

Note: The model above is the same as one used in a random-effect meta-analysis. See this paper for an introduction.

Non-Centered Parameterization

The hierarchical model is known to create issues in MCMC sampling, such that the posterior draws tend to be highly correlated even with more advanced techniques like HMC. One way to alleviate that is to reparameterize the model using what is called the non-centered parameterization. The basic idea is that, instead of treating the

s as parameters, one uses the standardized deviation from the mean to be parameters. You can think about it as converting the

s into

scores, and then sample the

scores instead of the original

data {
  int<lower=0> J;          // number of schools 
  real y[J];               // estimated treatment effects
  real<lower=0> sigma[J];  // s.e. of effect estimates 
}
parameters {
  real mu;                 // overall mean
  real<lower=0> tau;       // between-school SD
  vector[J] eta;           // standardized deviation (z score)
}
transformed parameters {
  vector[J] theta;
  theta = mu + tau * eta;   // non-centered parameterization
}
model {
  eta ~ std_normal();       // same as eta ~ normal(0, 1);
  y ~ normal(theta, sigma);
  // priors
  mu ~ normal(0, 100);
  tau ~ student_t(4, 0, 100);
}

fit <- stan(
    file = here("stan", "hierarchical_norm.stan"),
    data = schools_dat,
    seed = 1804,  # for reproducibility
    # higher adapt_delta due to divergent transitions
    pars = c("mu", "tau", "theta"),  # skip the etas
    control = list(adapt_delta = 0.95)
)

Treatment effect estimates of individual schools (

), average treatment effect (

), and treatment effect heterogeneity (

On average, based on the 95% CI, coaching seemed to improve SAT-V by -2.2 to 17.54 points. There was substantial heterogeneity across schools.

Prediction Interval

# Prediction Interval (can also be done in Stan)
extract(fit, pars = c("mu", "tau")) %>%
    as_draws_array() %>%
    mutate_variables(theta_tilde = rnorm(4000, mean = mu, sd = tau)) %>%
    summarise_draws()

The posterior interval for

indicates a range that the treatment effect for a new study can be.

Hierarchical Models

Author

Affiliation

Published

DOI

Hierarchical Bernoulli/Binomial

Alternative Parameterization of Beta

Multiple Bernoulli = Binomial

Multiple Binomial Observations

Therapeutic Touch Example

STAN Code

Prior predictive

Calling `rstan`

Table of coefficients

Derived coefficients

Conclusion

Shrinkage

Multiple Comparisons?

Hierarchical Normal Model

Eight Schools Example

Model

Non-Centered Parameterization

Prediction Interval

Last updated

Footnotes

Corrections

Reuse

Hierarchical Models

Author

Affiliation

Published

DOI

Hierarchical Bernoulli/Binomial

Alternative Parameterization of Beta

Multiple Bernoulli = Binomial

Multiple Binomial Observations

Therapeutic Touch Example

STAN Code

Prior predictive

Calling rstan

Table of coefficients

Derived coefficients

Conclusion

Shrinkage

Multiple Comparisons?

Hierarchical Normal Model

Eight Schools Example

Model

Non-Centered Parameterization

Prediction Interval

Last updated

Footnotes

Corrections

Reuse

Calling `rstan`