`<- read_csv("https://info3370.github.io/data/baseball_with_record.csv") population `

# Statistical Learning

The reading with this class is Berk 2020 Ch 1 p. 1–5, stopping at paragraph ending “…is nonlinear.” Then p. 14–17 “Model misspecification…” through “…will always be in play.”

Statistical learning is a term that captures a broad set of ideas in statistics and machine learning. This page focuses on one sense of statistical learning: using data on a sample to learn a subgroup mean in the population.

As an example, we continue to use the data on baseball salaries, with a small twist. The file `baseball_with_record.csv`

contains the following variables

`player`

is the player name`team`

is the team name`salary`

is the 2023 salary`record`

is the 2022 proportion of games won by that team`target_subgroup`

is coded`TRUE`

for the L.A. Dodgers and`FALSE`

for all others

Our goal: using a sample, estimate the mean salary of all Dodger players in 2023. Because we have the population, we know the true mean is $6.23m.

A sparse sample will hinder our ability to accomplish the goal. We will work with samples containing many MLB players, but only a few Dodgers. We will use statistical learning strategies to pool information from those other teams’ players to help us make a better estimate of the Dodger mean salary.

Our predictor will be the `record`

from the previous year. We assume that teams with similar win percentages in 2022 might have similar salaries in 2023.

## Prepare our data environment

For illustration, draw a sample of 5 players per team

```
<- population |>
sample group_by(team) |>
sample_n(5) |>
ungroup()
```

Construct a tibble with the observation to be predicted: the Dodgers.

```
<- population |>
to_predict filter(target_subgroup) |>
distinct(team, record)
```

## Ordinary least squares

We could model salary next year as a linear function of team record by Ordinary Least Squares. In math, OLS produces a prediction \[\hat{Y}_i = \hat\alpha + \hat\beta X_i\] with \(\hat\alpha\) and \(\hat\beta\) chosen to minimize the sum of squared errors, \(\sum_{i=1}^n \left(Y_i - \hat{Y}_i\right)^2\). Visually, it minimizes all the line segments below.

Here is how to estimate an OLS model using R.

`<- lm(salary ~ record, data = sample) model `

Then we could predict the mean salary for the Dodgers.

```
|>
to_predict mutate(predicted = predict(model, newdata = to_predict))
```

```
# A tibble: 1 × 3
team record predicted
<chr> <dbl> <dbl>
1 L.A. Dodgers 0.685 6668053.
```

Our model-based estimate compares to the true population mean of $6.23m.

## Penalized regression

Penalized regression is just like OLS, except that it prefers coefficient estimates that are closer to 0. This can reduce sampling variability. One penalized regression is ridge regression, which penalizes the sum of squared coefficients. In our example, it estimates the parameters to minimize

\[\underbrace{\sum_{i=1}^n \left(Y_i - \hat{Y}_i\right)^2}_\text{Squared Error} + \underbrace{\lambda\beta^2}_\text{Penalty}\]

where the positive scalar penalty \(\lambda\) encodes our preference for coefficients to be near zero. Otherwise, penalized regression is just like OLS!

The `gam()`

function in the `mgcv`

package will allow you to fit a ridge regression as follows.

`library(mgcv)`

```
<- gam(
model ~ s(record, bs = "re"),
salary data = sample
)
```

Predict the Dodger mean salary just as before,

```
|>
to_predict mutate(predicted = predict(model, newdata = to_predict))
```

```
# A tibble: 1 × 3
team record predicted
<chr> <dbl> <dbl[1d]>
1 L.A. Dodgers 0.685 6252449.
```

## Splines

We may want to allow a nonlinear relationship between the predictor and the outcome. One way to do that is with splines, which estimate part of the model locally within regions of the predictor space separated by **knots**. The code below uses a linear spline with knots at 0.4 and 0.6.

```
library(splines)
<- lm(
model ~ bs(record, degree = 1, knots = c(.4,.6)),
salary data = sample
)
```

We can predict the Dodger mean salary just as before!

```
|>
to_predict mutate(predicted = predict(model, newdata = to_predict))
```

```
# A tibble: 1 × 3
team record predicted
<chr> <dbl> <dbl>
1 L.A. Dodgers 0.685 3269869.
```

## Trees

Perhaps our response surface is bumpy, and poorly approximated by a smooth function. Decision trees search the predictor space for discrete places where the outcome changes, and assume that the response is flat within those regions.

```
library(rpart)
<- rpart(salary ~ record, data = sample) model
```

Predict as in the other strategies.

```
|>
to_predict mutate(predicted = predict(model, newdata = to_predict))
```

```
# A tibble: 1 × 3
team record predicted
<chr> <dbl> <dbl>
1 L.A. Dodgers 0.685 4253845.
```

## Conclusion

Statistical learning in this framing is all about

- we have a subgroup with few sampled units (the Dodgers)
- we want to use other units to help us learn
- our goal is to predict the population mean in the subgroup