Problem Set 3: Income Prediction Challenge
Due: 5pm on Wednesday, March 6.
Want to see how you’ll be evaluated? Check out the rubric
Student identifer: [type your anonymous identifier here]
- Use this .qmd template to complete the problem set
- In Canvas, you will upload the PDF produced by your .qmd file
- Put your identifier above, not your name! We want anonymous grading to be possible
This problem set is connected to the PSID Income Prediction Challenge from discussion.
Income Prediction Challenge
Collaboration note. This question is an individual write-up connected to your group work from discussion. We expect that the approach you tell us might be the same as that of your other group members, but your answers to these questions should be in your own words.
1.1 (5 points) How did you choose the predictor variables you used? Correct answers might be entirely conceptual, entirely data-driven, or a mixture of both.
1.2 (5 points) What learning algorithms or models did you consider, and how did you choose one? Correct answers might be entirely conceptual, entirely data-driven, or a mixture of both.
1.3 (20 points) Split the learning
data randomly into train
and test
. Your split can be 50-50 or another ratio. Learn in the train
set and make predictions in the test
set. What do you estimate for your out-of-sample mean squared error? There is no written answer here; the answer is the code and result.
Create a new task
The predictability of life outcomes is not likely to be the same in every setting. Imagine you were designing a challenge like this one in a new setting, to study how outcomes change over the life course or across generations.
2.1 (5 points) From what population would you draw your sample?
2.2 (5 points) What outcome would you study?
2.3 (5 points) What predictors would you include?
2.4 (5 points) Why would it be interesting in your setting if predictions were accurate? Why would it be interesting if predictions were inaccurate?
Grad. Machine learning versus statistics
This question is required for grad students. It is optional for undergrads, and worth no extra credit.
20 points. This question is about the relative gain in this problem as we move from no model to a statistical model to a machine learning model.
First, use your train
set to estimate 3 learners and predict in your test
set.
- No model. For every
test
observation, predict the mean of thetrain
outcomes - Ordinary Least Squares. Choose a set of predictors \(\vec{X}\). For every
test
observation, predict using a linear modellm()
fit to thetrain
set with the predictors \(\vec{X}\). - Machine learning. Use the same set of predictors \(\vec{X}\). For every
test
observation, predict using a machine learning model fit to thetrain
set with the predictors \(\vec{X}\). Your machine learning model could be a Generalized Additive Model (gam()
), a decision tree (rpart()
), or some other machine learning approach.
Report your out-of-sample mean squared error estimates for each approach. How did mean squared error change from (a) to (b)? From (b) to (c)?
Interpret what you found. To what degree does machine learning improve predictability, beyond what can be achieved by Ordinary Least Squares?