# Problem Set 3: Income Prediction Challenge

**Due: 5pm on Wednesday, March 6.**

Want to see how you’ll be evaluated? Check out the rubric

Student identifer: [type your anonymous identifier here]

- Use this .qmd template to complete the problem set
- In Canvas, you will upload the PDF produced by your .qmd file
- Put your identifier above, not your name! We want anonymous grading to be possible

This problem set is connected to the PSID Income Prediction Challenge from discussion.

## Income Prediction Challenge

**Collaboration note.** This question is an individual write-up connected to your group work from discussion. We expect that the approach you tell us might be the same as that of your other group members, but your answers to these questions should be in your own words.

**1.1 (5 points)** How did you choose the predictor variables you used? Correct answers might be entirely conceptual, entirely data-driven, or a mixture of both.

**1.2 (5 points)** What learning algorithms or models did you consider, and how did you choose one? Correct answers might be entirely conceptual, entirely data-driven, or a mixture of both.

**1.3 (20 points)** Split the `learning`

data randomly into `train`

and `test`

. Your split can be 50-50 or another ratio. Learn in the `train`

set and make predictions in the `test`

set. What do you estimate for your out-of-sample mean squared error? There is no written answer here; the answer is the code and result.

## Create a new task

The predictability of life outcomes is not likely to be the same in every setting. Imagine you were designing a challenge like this one in a new setting, to study how outcomes change over the life course or across generations.

**2.1 (5 points)** From what population would you draw your sample?

**2.2 (5 points)** What outcome would you study?

**2.3 (5 points)** What predictors would you include?

**2.4 (5 points)** Why would it be interesting in your setting if predictions were accurate? Why would it be interesting if predictions were inaccurate?

## Grad. Machine learning versus statistics

This question is required for grad students. It is optional for undergrads, and worth no extra credit.

**20 points.** This question is about the relative gain in this problem as we move from no model to a statistical model to a machine learning model.

First, use your `train`

set to estimate 3 learners and predict in your `test`

set.

- No model. For every
`test`

observation, predict the mean of the`train`

outcomes - Ordinary Least Squares. Choose a set of predictors \(\vec{X}\). For every
`test`

observation, predict using a linear model`lm()`

fit to the`train`

set with the predictors \(\vec{X}\). - Machine learning. Use the same set of predictors \(\vec{X}\). For every
`test`

observation, predict using a machine learning model fit to the`train`

set with the predictors \(\vec{X}\). Your machine learning model could be a Generalized Additive Model (`gam()`

), a decision tree (`rpart()`

), or some other machine learning approach.

Report your out-of-sample mean squared error estimates for each approach. How did mean squared error change from (a) to (b)? From (b) to (c)?

Interpret what you found. To what degree does machine learning improve predictability, beyond what can be achieved by Ordinary Least Squares?