Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Machine learning estimators of causal effects. Exercise

This exercise uses causal forests to find effect heterogeneity using real-world experimental data. Thanks to Abby Sachar for designing this exercise.

These methods were developed in Wager & Athey 2018 and generalized in Athey, Tibshirani, & Wager 2019. Those papers are very technical—if you want to learn more about application of this method, we would recommend the tutorial page for the accompanying grf software package for R.

This page walks through an application of causal forests using data from an actual randomized experiment.

Hainmueller, Jens et al. 2018. A randomized controlled design reveals barriers to citizenship for low-income immigrants. PNAS 115:939-944. https://doi.org/10.1073/pnas.1714254115.

# loading relevant packages
library(tidyverse)
library(haven)
library(grf)

Why are randomized controlled designs useful for estimating causal effects?

In an observational study, the people who receive treatment and do not receive treatment may differ in many ways long before treatment occurs. When the treatment is randomized, the two groups are the same in expectation along all variables up until the treatment. Differences that arise after the treatment can therefore be understood to be caused by the treatment; therefore, researchers can estimate causal treatment effects on the outcome without concerns about confounding.

Example: fee-waiver effect on naturalization rates for low-income immigrants

Researchers have suggested that eligible immigrants who desire citizenship face many hurdles including the cost of the citizenship application, language barriers, and others.

This study tests the effect of offering a voucher for the naturalization application fee on naturalization rates for low-income lawful permanent residents who are eligible to apply for citizenship. They use a randomized controlled design in which they randomly assign a voucher that removes the financial barrier and pays for the application fee.

The authors used simple OLS to model the conditional average treatment effects in different subpopulations, but we are going to use a causal forest to estimate these effects.

Access the data here. The code below will prepare these data for one of the authors’ analyses.

nny <- read_dta("nny_replication_data_deidentified.dta")

dat_replied <- nny %>%
  filter(outcome == "registered") %>% # only participants of voucher study
  select(-c(treatfw, starts_with("treat_"), ocdistance)) %>% # cols for voucher study
  group_by(ocblock) %>%
  mutate(ocblock_means = mean(winners)) %>%
  ungroup() %>%
  mutate(group_weights = (1/ocblock_means) * winners + (1/(1-ocblock_means))*(1-winners)) %>%
  filter(replied == 1) # in paper, only reported results from respondents who answered yes

a) Step 1: Setup and training a causal forest

  1. Set up your arguments:
  • X is a matrix containing the covariates: gender_f, educ_HS, educ_somecollege, educ_BA,lang_Eng,lang_Span, marital_married, marital_single, all of the variables containing country of origin, age, hhinc_cap, and yrs_greencard
  • Y is a vector containing the outcome: submitted
  • W should be a vector containing the treatment: any_voucher
  • sample.weights is a vector containing the weights given to each sample in estimation: group_weights
# Step 1: set up the arguments

X <- as.matrix( dat_replied %>% select(gender_f, educ_HS, educ_somecollege, 
                                       educ_BA,lang_Eng, lang_Span, 
                                       marital_married, marital_single, 
                                       starts_with("origin"), age, 
                                       hhinc_cap, yrs_greencard) )

Y <- dat_replied %>%
  select(submitted) %>%
  as_vector() %>%
  unname()

W <- dat_replied %>%
  select(any_voucher) %>%
  as_vector() %>%
  unname()

group_weights <- dat_replied %>%
  select(group_weights) %>%
  as_vector() %>%
  unname()
  1. Use causal_forest() to estimate the forest. Store the model in a variable called estimated_forest.
# Step 2: estimate the forest

set.seed(1) # make results reproducible

estimated_forest <- causal_forest(X = X,
                                  Y = Y,
                                  W = W,
                                  sample.weights = group_weights)
  1. Use predict() to estimate the treatment effects for all observations in dat_replied. Store the results in a vector called tau.hat.
# Step 3: predict the treatment effect for all observations

tau.hat <- predict(estimated_forest)$predictions

b) Step 2: Estimating causal effects in different subgroups

Find the average treatment effect for the entire population using the average_treatment_effect() function. Find the conditional average treatment effect for different subgroups of interest using the subset argument. What surprises you?

# average treatment effect for population

ate <- average_treatment_effect(estimated_forest)

# finding conditional average treatment effects in different subgroups

ate.male <- average_treatment_effect(estimated_forest, 
                                     subset = X[,"gender_f"] == 0)
ate.female <- average_treatment_effect(estimated_forest, 
                                       subset = X[,"gender_f"] == 1)

# do this for any other subgroup of interest to you!

Here are some additional examples you could have chosen.

ate.college <- average_treatment_effect(estimated_forest,
                                        subset = (X[,"educ_somecollege"] == 1 |
                                                  X[,"educ_BA"] == 1))
ate.HS <- average_treatment_effect(estimated_forest,
                                   subset = (X[,"educ_HS"] == 1 |
                                               (X[,"educ_HS"] == 0 &
                                                  X[,"educ_somecollege"] == 0 &
                                                  X[,"educ_BA"] == 0)))
ate.lang_Eng <- average_treatment_effect(estimated_forest,
                                         subset = X[,"lang_Eng"] == 1)
ate.lang_Span <- average_treatment_effect(estimated_forest,
                                          subset = X[,"lang_Span"] == 1)
ate.origin_DR <- average_treatment_effect(estimated_forest,
                                          subset = X[,"origin_DR"] == 1)
ate.origin_Ecuador <- average_treatment_effect(estimated_forest,
                                               subset = X[,"origin_Ecuador"] == 1)
ate.origin_Colombia <- average_treatment_effect(estimated_forest,
                                                subset = X[,"origin_Colombia"] == 1)

# creating a data frame containing the subgroup, the estimate for the
# conditional average treatment effect, and the standard error of the
# conditional average treatment effect

cate <- data.frame(subgroup = c("Full Sample", "Male", "Female",
                                "Some College or Higher", "High School or Less",
                                "Registration in English", "Registration in Spanish",
                                "Dominican Republic", "Ecuador", "Colombia"),
                   rbind(ate, ate.male, ate.female, ate.college, ate.HS, 
                         ate.lang_Eng, ate.lang_Span, ate.origin_DR, 
                         ate.origin_Ecuador, ate.origin_Colombia))

c) Step 3: Visualizing Conditional Average Treatment Effects

Create a plot of the heterogeneous conditional average treatment effects in different subpopulations of interest. Try using the standard error to include confidence intervals. For which subgroups does the voucher have the largest effect on the naturalization rate? Why might this be important?

# plotting results
cate %>%
  ggplot(aes(x = estimate, y = fct_rev( fct_inorder(subgroup) ))) +
  geom_point() +
  geom_errorbar(data = cate,
                aes(xmin = estimate - 2*std.err, xmax = estimate + 2*std.err),
                width = .2) + 
  ggtitle("Effect of Offering Voucher") +
  geom_vline(xintercept = 0, linetype = "dotted") +
  scale_x_continuous(labels = scales::percent) +
  xlab("Change in Naturalization Rate") +
  ylab("Subgroup")