Using weights

Surveys often oversample some subgroups in order to make subpopulation estimates in those subgroups. For example, in the CPS

The goal is to estimate unemployment in each state
They need enough sample size in each state, even the small ones

As a result, the probability of being in the sample is higher for those residing in states with small populations.

An example: California and Wyoming

In 2022, California had 14,822 ASEC respondents out of a population of 39,029,342. Wyoming had 2,199 ASEC respondents out of 581,381 residents. The average probability that a CA resident was sampled was about 0.04 percent, whereas the same probability in WY was 0.4 percent. You are 10 times more likely to be sampled for the ASEC if you live in Wyoming.

How oversamples affect data analysis

Oversamples are great for subgroup estimates. But for national-level estimates, they can mess us up if we are not careful. In this case, we need to down-weight respondents from Wyoming and up-weight respondents from California.

How much to upweight? Inverse probability of sampling

Weights tell us how many people a given observation represents. To calculate sampling weights, take the 1 / (Probability this person was chosen in the sample). In practice, weighting is more complicated than this because survey administrators also adjust weights for differential nonresponse across population subgroups (a method called post-stratification).

Example redux: California and Wyoming

Suppose Californians are sampled with probability 0.0004. Then each Californian represents 1 / 0.0004 = 2,500 people. Each Californian should receive a weight of 2,500. Working out the same math for Wyoming, each Wyoming resident should receive a weight of 250. The total weight on these two samples will then be proportional to the sizes of these two populations.

Exercise: Creating a weighted.quantile() function

We will let survey administrators create weights. But we need to be able to use them, and R does not have a canned function for weighted quantiles. We will write one!

Write a function that accepts three arguments

x, a numeric vector
q, a numeric value for the quantile to be estimated
w, a numeric vector of sampling weights

Your function will look something like this

weighted.quantile <- function(x, q, w) {
	# Carry out operations on x, q, w
	# Produce an estimate
	return(estimate)
}

Simulated data

As you work on your function, you might want some simple simulated data to practice with. You can generate data with the code below.

library(tidyverse)
sim <- data.frame(x = runif(100)) %>%
  mutate(w = sqrt(x))

Tips: Possible function structure

To produce the estimate, you might consider the following steps:

Create a data frame with these variables
Arrange the data frame by the values of x. See arrange()
Create a new column cdf for the cumulative distribution function
- Calculate the cumulative sum of weight at each observation. See cumsum()
- Calculate the total sum of the weight. See sum()
- Divide the cumulative sum by the total sum
Filter to the first case where cdf > q
Return the x value of that case

Finished with your function?

Now go back to the class exercise from last week. Replace quantile with your custom weighted.quantile, using the weight asecwt.

Challenge if you finish

Could you write a function that accepts raw data, prepares the data, and returns the plot we created? Why might we want such a function?