Machine learning: Nonlinear smooths with mgcv

Standard linear models assume that the response is a linear, additive function of predictors. Splines relax the first part of that assumption: perhaps we want to assume that children’s incomes are a nonlinear function of their parents’ and grandparents’ incomes.

The mgcv package supports this type of estimation. Start by preparing the environment.

library(tidyverse)
library(mgcv)
learning <- read_csv("learning.csv")
holdout_public <- read_csv("holdout_public.csv")

Now, fit a gam() object with the mgcv package. This example asks R to predict respondent income (g3_log_income) as a smooth but potentially nonlinear function of parent income (g2_log_income) plus a function of respondent education (g3_educ).

fit <- gam(g3_log_income ~ s(g2_log_income) + g3_educ,
           data = learning)

Predict in the holdout set exactly as you would with OLS.

fitted <- holdout_public %>%
  mutate(g3_log_income = predict(fit, newdata = holdout_public))

In this case, the nonlinearity we detect might be mostly noise; it is possible that OLS is in fact the better algorithm!

To learn more, I would recommend

typing library(mgcv) and then ?gam in your R console
Wood, Simon. 2006. Generalized Additive Models: An Introduction with R.
Wood’s website of supporting materials

Machine learning: Nonlinear smooths with mgcv

Summary video: What we covered today