Problem Set 3: Income Prediction Challenge

Due: 5pm on Wednesday, March 6.

Note

Want to see how you’ll be evaluated? Check out the rubric

Student identifer: [type your anonymous identifier here]

Use this .qmd template to complete the problem set
In Canvas, you will upload the PDF produced by your .qmd file
Put your identifier above, not your name! We want anonymous grading to be possible

This problem set is connected to the PSID Income Prediction Challenge from discussion.

Income Prediction Challenge

Collaboration note. This question is an individual write-up connected to your group work from discussion. We expect that the approach you tell us might be the same as that of your other group members, but your answers to these questions should be in your own words.

1.1 (5 points) How did you choose the predictor variables you used? Correct answers might be entirely conceptual, entirely data-driven, or a mixture of both.

1.2 (5 points) What learning algorithms or models did you consider, and how did you choose one? Correct answers might be entirely conceptual, entirely data-driven, or a mixture of both.

1.3 (20 points) Split the learning data randomly into train and test. Your split can be 50-50 or another ratio. Learn in the train set and make predictions in the test set. What do you estimate for your out-of-sample mean squared error? There is no written answer here; the answer is the code and result.

Create a new task

The predictability of life outcomes is not likely to be the same in every setting. Imagine you were designing a challenge like this one in a new setting, to study how outcomes change over the life course or across generations.

2.1 (5 points) From what population would you draw your sample?

2.2 (5 points) What outcome would you study?

2.3 (5 points) What predictors would you include?

2.4 (5 points) Why would it be interesting in your setting if predictions were accurate? Why would it be interesting if predictions were inaccurate?

Grad. Machine learning versus statistics

This question is required for grad students. It is optional for undergrads, and worth no extra credit.

20 points. This question is about the relative gain in this problem as we move from no model to a statistical model to a machine learning model.

First, use your train set to estimate 3 learners and predict in your test set.

No model. For every test observation, predict the mean of the train outcomes
Ordinary Least Squares. Choose a set of predictors \(\vec{X}\). For every test observation, predict using a linear model lm() fit to the train set with the predictors \(\vec{X}\).
Machine learning. Use the same set of predictors \(\vec{X}\). For every test observation, predict using a machine learning model fit to the train set with the predictors \(\vec{X}\). Your machine learning model could be a Generalized Additive Model (gam()), a decision tree (rpart()), or some other machine learning approach.

Report your out-of-sample mean squared error estimates for each approach. How did mean squared error change from (a) to (b)? From (b) to (c)?

Interpret what you found. To what degree does machine learning improve predictability, beyond what can be achieved by Ordinary Least Squares?