library(tidyverse)
library(scales)
<- read_csv("https://info3370.github.io/data/lifeCourse.csv") lifeCourse
Problem Set 1: Visualization
Due: 5pm on Wednesday, January 31.
Student identifer: [type your anonymous identifier here]
- Use this .qmd template to complete the problem set
- In Canvas, you will upload the PDF produced by your .qmd file
- Put your identifier above, not your name! We want anonymous grading to be possible
This problem set involves both data analysis and reading.
Data analysis
This problem set uses the data lifeCourse.csv
.
The data contain life course earnings profiles for four cohorts of American workers: those born in 1940, 1950, 1960, and 1970. Each row contains a summary of the annual earnings distribution for a particular birth cohort at a particular age, among the subgroup with a particular level of education. To prepare these data, we aggregated microdata from the Current Population Survey, provided through the Integrated Public Use Microdata Series.
The data contain five variables.
quantity
is the metric by which the earnings distribution is summarized: 10th, 50th, or 90th percentileeducation
is the educational subgroup being summarized: College Degree, Less than Collegecohort
is the cohort (people with a given birth year) to which these data apply: 1940, 1950, 1960, 1970age
is the age at which earnings were measured: 30–45income
is the value for the given earnings percentile in the given subgroup. Income values are provided in 2022 dollars
1. Visualize (25 points)
Use ggplot
to visualize these data. To denote the different trajectories,
- make your plot using
geom_point()
orgeom_line()
- use the x-axis for
age
- use the y-axis for
income
- use
color
forquantity
- use
facet_grid
to make a panel of facets where each row is an education value and each column is a cohort value
You should prepare the graph as though you were going to publish it. Modify the axis titles so that a reader would know what is on the axis. Use appropriate capitalization in all labels. Try using the label_dollar()
function from the scales
package so that the y-axis uses dollar values.
Your code should be well-formatted as defined by R4DS. In your produced PDF, no lines of code should run off the page.
Many different graphs can be equally correct. You will be evaluated by
- having publication-ready graph aesthetics
- code that follows style conventions
# your code goes here
2. Interpret (10 points)
Write 2-3 sentences summarizing the trends that you see in the data.
2.1 (3 points). Focus on those born in 1970. For those with a college degree, how do the top and bottom of the income distribution change over the life course?
Type your answer here.
2.2 (3 points). Focus on those born in 1970. How does the pattern differ for those without college degres differ from your answer in 2.1?
Type your answer here.
2.3 (4 points). How do the patterns you identified in 2.1 and 2.2 change from the 1940 to the 1970 cohort?
Type your answer here.
3. Connect to reading (15 points)
Read p. 1–7 of following paper. Stop before the section “Analytic Framework for Decomposing Inequality.”
Cheng, Siwei. 2021. The shifting life course patterns of wage inequality.. Social Forces 100(1):1–28.
Our data are not the same as Cheng’s. But our analysis is able to reproduce many of her findings. Answer each question in two sentences or less.
Cheng discusses period trends, cohort trends, and age trends.
3.1 (3 points) Which dimension of your graph shows a cohort trend?
Type your answer here.
3.2 (3 points) Which dimension of your graph shows an age trend?
Type your answer here.
3.3 (3 points) Cheng discusses education-based cumulative advantage. Describe how you see this in your graph.
Type your answer here.
3.4 (3 points) Cheng discusses within-education trajectory heterogeneity. Describe how your graph shows heterogeneity of outcomes within educational categories.
Type your answer here.
3.5 (3 points) Cheng discusses wage volatility: how wages rise and fall over time for a given person. Why is our data (the Current Population Survey) the wrong dataset to study wage volatility?
Type your answer here.
Grad question: Model-based estimates
This question assumes familiarity with Ordinary Least Squares.
- For graduate students, this question is worth 20 points.
- For undergraduate students, this question is optional and worth 0 points.
The data contain nonparametric estimates that contain some noise: the data points provided partly reflect random variation because they are estimatd in a sample.
Model-based estimates reduce noise by pooling information across observations, at the cost of introducing assumptions. Fit an OLS model to the data using age
as a numeric variable and education
, cohort
, and quantity
as factor variables. Interact all of these predictors with each other.
<- lm(
fit ~ age * factor(cohort) * education * quantity,
income data = lifeCourse
)
Effectively, this estimates a best-fit line through each set of points depicted in your original figure. For each observation, store a prediction from this model (see predict()
).
Re-create your plot from (1) using
geom_point()
for the nonparametric estimates (as above)geom_line()
for your model-based predicted values
Computing environment
Leave this at the bottom of your file, and it will record information such as your operating system, R version, and package versions. This is helpful for resolving any differences in results across people.
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scales_1.2.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0
[5] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[9] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] bit_4.0.5 gtable_0.3.4 jsonlite_1.8.7 crayon_1.5.2
[5] compiler_4.3.2 tidyselect_1.2.0 parallel_4.3.2 yaml_2.3.7
[9] fastmap_1.1.1 R6_2.5.1 generics_0.1.3 knitr_1.44
[13] htmlwidgets_1.6.2 munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0
[17] rlang_1.1.1 utf8_1.2.3 stringi_1.7.12 xfun_0.40
[21] bit64_4.0.5 timechange_0.2.0 cli_3.6.1 withr_2.5.1
[25] magrittr_2.0.3 digest_0.6.33 grid_4.3.2 vroom_1.6.4
[29] rstudioapi_0.15.0 hms_1.1.3 lifecycle_1.0.3 vctrs_0.6.4
[33] evaluate_0.22 glue_1.6.2 fansi_1.0.5 colorspace_2.1-0
[37] rmarkdown_2.25 tools_4.3.2 pkgconfig_2.0.3 htmltools_0.5.6.1