Using weights
Surveys often oversample some subgroups in order to make subpopulation estimates in those subgroups. For example, in the CPS
- The goal is to estimate unemployment in each state
- They need enough sample size in each state, even the small ones
As a result, the probability of being in the sample is higher for those residing in states with small populations.
An example: California and Wyoming
In 2022, California had 14,822 ASEC respondents out of a population of 39,029,342. Wyoming had 2,199 ASEC respondents out of 581,381 residents. The average probability that a CA resident was sampled was about 0.04 percent, whereas the same probability in WY was 0.4 percent. You are 10 times more likely to be sampled for the ASEC if you live in Wyoming.
How oversamples affect data analysis
Oversamples are great for subgroup estimates. But for national-level estimates, they can mess us up if we are not careful. In this case, we need to down-weight respondents from Wyoming and up-weight respondents from California.
How much to upweight? Inverse probability of sampling
Weights tell us how many people a given observation represents. To calculate sampling weights, take the 1 / (Probability this person was chosen in the sample). In practice, weighting is more complicated than this because survey administrators also adjust weights for differential nonresponse across population subgroups (a method called post-stratification).
Example redux: California and Wyoming
Suppose Californians are sampled with probability 0.0004. Then each Californian represents 1 / 0.0004 = 2,500 people. Each Californian should receive a weight of 2,500. Working out the same math for Wyoming, each Wyoming resident should receive a weight of 250. The total weight on these two samples will then be proportional to the sizes of these two populations.
Exercise: Creating a weighted.quantile() function
We will let survey administrators create weights. But we need to be able to use them, and R does not have a canned function for weighted quantiles. We will write one!
Write a function that accepts three arguments
x
, a numeric vectorq
, a numeric value for the quantile to be estimatedw
, a numeric vector of sampling weights
Your function will look something like this
weighted.quantile <- function(x, q, w) {
# Carry out operations on x, q, w
# Produce an estimate
return(estimate)
}
Simulated data
As you work on your function, you might want some simple simulated data to practice with. You can generate data with the code below.
library(tidyverse)
sim <- data.frame(x = runif(100)) %>%
mutate(w = sqrt(x))
Tips: Possible function structure
To produce the estimate, you might consider the following steps:
- Create a data frame with these variables
- Arrange the data frame by the values of x. See
arrange()
- Create a new column
cdf
for the cumulative distribution function- Calculate the cumulative sum of weight at each observation. See
cumsum()
- Calculate the total sum of the weight. See
sum()
- Divide the cumulative sum by the total sum
- Calculate the cumulative sum of weight at each observation. See
- Filter to the first case where
cdf > q
- Return the
x
value of that case
Finished with your function?
Now go back to the class exercise from last week. Replace quantile
with your custom weighted.quantile
, using the weight asecwt
.
Challenge if you finish
Could you write a function that accepts raw data
, prepares the data, and returns the plot we created? Why might we want such a function?