Problem Set 1: Visualization

Due: 5pm on Wednesday, January 31.

Before you start

Create an anonymous identifier for yourself here! Want to see how you’ll be evaluated? Check out the rubric

Student identifer: [type your anonymous identifier here]

This problem set involves both data analysis and reading.

Data analysis

This problem set uses the data lifeCourse.csv.

lifeCourse <- read_csv("")

The data contain life course earnings profiles for four cohorts of American workers: those born in 1940, 1950, 1960, and 1970. Each row contains a summary of the annual earnings distribution for a particular birth cohort at a particular age, among the subgroup with a particular level of education. To prepare these data, we aggregated microdata from the Current Population Survey, provided through the Integrated Public Use Microdata Series.

The data contain five variables.

  1. quantity is the metric by which the earnings distribution is summarized: 10th, 50th, or 90th percentile
  2. education is the educational subgroup being summarized: College Degree, Less than College
  3. cohort is the cohort (people with a given birth year) to which these data apply: 1940, 1950, 1960, 1970
  4. age is the age at which earnings were measured: 30–45
  5. income is the value for the given earnings percentile in the given subgroup. Income values are provided in 2022 dollars

1. Visualize (25 points)

Use ggplot to visualize these data. To denote the different trajectories,

  • make your plot using geom_point() or geom_line()
  • use the x-axis for age
  • use the y-axis for income
  • use color for quantity
  • use facet_grid to make a panel of facets where each row is an education value and each column is a cohort value

You should prepare the graph as though you were going to publish it. Modify the axis titles so that a reader would know what is on the axis. Use appropriate capitalization in all labels. Try using the label_dollar() function from the scales package so that the y-axis uses dollar values.

Your code should be well-formatted as defined by R4DS. In your produced PDF, no lines of code should run off the page.

Many different graphs can be equally correct. You will be evaluated by

  • having publication-ready graph aesthetics
  • code that follows style conventions
# your code goes here

2. Interpret (10 points)

Write 2-3 sentences summarizing the trends that you see in the data.

2.1 (3 points). Focus on those born in 1970. For those with a college degree, how do the top and bottom of the income distribution change over the life course?

Type your answer here.

2.2 (3 points). Focus on those born in 1970. How does the pattern differ for those without college degres differ from your answer in 2.1?

Type your answer here.

2.3 (4 points). How do the patterns you identified in 2.1 and 2.2 change from the 1940 to the 1970 cohort?

Type your answer here.

3. Connect to reading (15 points)

Read p. 1–7 of following paper. Stop before the section “Analytic Framework for Decomposing Inequality.”

Cheng, Siwei. 2021. The shifting life course patterns of wage inequality.. Social Forces 100(1):1–28.

Our data are not the same as Cheng’s. But our analysis is able to reproduce many of her findings. Answer each question in two sentences or less.

Cheng discusses period trends, cohort trends, and age trends.

3.1 (3 points) Which dimension of your graph shows a cohort trend?

Type your answer here.

3.2 (3 points) Which dimension of your graph shows an age trend?

Type your answer here.

3.3 (3 points) Cheng discusses education-based cumulative advantage. Describe how you see this in your graph.

Type your answer here.

3.4 (3 points) Cheng discusses within-education trajectory heterogeneity. Describe how your graph shows heterogeneity of outcomes within educational categories.

Type your answer here.

3.5 (3 points) Cheng discusses wage volatility: how wages rise and fall over time for a given person. Why is our data (the Current Population Survey) the wrong dataset to study wage volatility?

Type your answer here.

Grad question: Model-based estimates

This question assumes familiarity with Ordinary Least Squares.

  • For graduate students, this question is worth 20 points.
  • For undergraduate students, this question is optional and worth 0 points.

The data contain nonparametric estimates that contain some noise: the data points provided partly reflect random variation because they are estimatd in a sample.

Model-based estimates reduce noise by pooling information across observations, at the cost of introducing assumptions. Fit an OLS model to the data using age as a numeric variable and education, cohort, and quantity as factor variables. Interact all of these predictors with each other.

fit <- lm(
  income ~ age * factor(cohort) * education * quantity,
  data = lifeCourse

Effectively, this estimates a best-fit line through each set of points depicted in your original figure. For each observation, store a prediction from this model (see predict()).

Re-create your plot from (1) using

  • geom_point() for the nonparametric estimates (as above)
  • geom_line() for your model-based predicted values

Computing environment

Leave this at the bottom of your file, and it will record information such as your operating system, R version, and package versions. This is helpful for resolving any differences in results across people.

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.2.1    lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0  
 [5] dplyr_1.1.3     purrr_1.0.2     readr_2.1.4     tidyr_1.3.0    
 [9] tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] bit_4.0.5         gtable_0.3.4      jsonlite_1.8.7    crayon_1.5.2     
 [5] compiler_4.3.2    tidyselect_1.2.0  parallel_4.3.2    yaml_2.3.7       
 [9] fastmap_1.1.1     R6_2.5.1          generics_0.1.3    knitr_1.44       
[13] htmlwidgets_1.6.2 munsell_0.5.0     pillar_1.9.0      tzdb_0.4.0       
[17] rlang_1.1.1       utf8_1.2.3        stringi_1.7.12    xfun_0.40        
[21] bit64_4.0.5       timechange_0.2.0  cli_3.6.1         withr_2.5.1      
[25] magrittr_2.0.3    digest_0.6.33     grid_4.3.2        vroom_1.6.4      
[29] rstudioapi_0.15.0 hms_1.1.3         lifecycle_1.0.3   vctrs_0.6.4      
[33] evaluate_0.22     glue_1.6.2        fansi_1.0.5       colorspace_2.1-0 
[37] rmarkdown_2.25    tools_4.3.2       pkgconfig_2.0.3   htmltools_0.5.6.1
Back to top