<- function(sample_size = 1000, num_features = 100) {
generate_data # Predictors are independent Normal
<- replicate(num_features, rnorm(sample_size))
X colnames(X) <- paste0("x",1:num_features)
as_tibble(X) |>
# Outcome is an independent Normal
mutate(y = rnorm(n()))
}
Sample Splitting
A predictive model is an input-output function:
- the input is a set of features \(\vec{x}\)
- the output is a predicted outcome \(\hat{y}\)
The performance of a model can be measured by how well the output \(\hat{y}\) corresponds to the true outcome \(y\). This page considers three ways to assess the performance of a model.
- In-sample prediction
- learn a model using our sample
- predict in that same sample
- evaluate mean squared error
- Out-of-sample prediction
- learn a model in one sample
- predict in a new sample from the same data generating process
- evaluate mean squared error in the new sample
- Split-sample prediction
- split our sampled cases randomly into training and testing sets
- learn a model in the training set
- predict in the testing set
- evaluate mean squared error in the testing set
Often, the purpose a predictive model is to accomplish out-of-sample prediction (2). But when learning the model, often only one sample is available. Therefore, evaluation by split-sample prediction (3) is often desirable because it most closely mimics this task.
Simulated setting
When a model is evaluated by in-sample prediction, there is a danger: even if the features have no predictive value in the population, a model might discover patterns that exist in the training sample due to random variation.
To illustrate this, we generate a simulation with 100 features x1
,…,x100
and one outcome y
, all of which are independent normal variables. We know from the beginning that x*
should be useless predictors: they contain no information about y
.
We first load tidyverse
and write a function to generate the data
before applying that function to generate one sample.
<- generate_data(sample_size = 1000, num_features = 100) data
In-sample prediction
We then estimate a linear regression model, where .
includes all the x1
,…,x100
features other than y
as predictors.
<- lm(y ~ ., data = data) model
and a benchmark of no model, which only includes an intercept.
<- lm(y ~ 1, data = data) no_model
We know from the simulation that the model is useless: the x
-variables contain no information about y
. But if we make predictions in-sample, we will see that the mean squared error of the model is surprisingly lower (better) than no model.
|>
data mutate(
predicted_model = predict(model),
predicted_no_model = predict(no_model),
squared_error_model = (y - predicted_model) ^ 2,
squared_error_no_model = (y - predicted_no_model) ^ 2,
|>
) select(starts_with("squared")) |>
summarize_all(.funs = mean)
# A tibble: 1 × 2
squared_error_model squared_error_no_model
<dbl> <dbl>
1 0.902 1.00
Out-of-sample prediction
The problem is that the model fit to the noise in the data. We can see this with the out-of-sample performance assessment. First, we generate a new dataset of out-of-sample data.
<- generate_data() out_of_sample
Then, we use the model learned in data
and predict in out_of_sample
. By this evaluation, predictions are now worse (higher mean squared error) than no model.
|>
out_of_sample mutate(
predicted_model = predict(model, newdata = out_of_sample),
predicted_no_model = predict(no_model, newdata = out_of_sample),
squared_error_model = (y - predicted_model) ^ 2,
squared_error_no_model = (y - predicted_no_model) ^ 2,
|>
) select(starts_with("squared")) |>
summarize_all(.funs = mean)
# A tibble: 1 × 2
squared_error_model squared_error_no_model
<dbl> <dbl>
1 1.05 0.981
Split-sample prediction
In practice, we often do not have a second sample. We can therefore mimic the out-of-sample task by a sample split, which we can create using the initial_split
function in the rsample
package,
library(rsample)
<- initial_split(data, prop = .5) split
which randomly assigns the data into training and testing sets of equal size. We can extract those data by typing training(split)
and testing(split)
.
The strategy is to learn on training(split)
<- lm(y ~ ., data = training(split))
model <- lm(y ~ 1, data = training(split)) no_model
and evaluate performance on testing(split)
.
testing(split) |>
mutate(
predicted_model = predict(model, newdata = testing(split)),
predicted_no_model = predict(no_model, newdata = testing(split)),
squared_error_model = (y - predicted_model) ^ 2,
squared_error_no_model = (y - predicted_no_model) ^ 2,
|>
) select(starts_with("squared")) |>
summarize_all(.funs = mean)
# A tibble: 1 × 2
squared_error_model squared_error_no_model
<dbl> <dbl>
1 1.29 1.01
Just like the out-of-sample prediction, this shows that the model is worse than no model at all. By sample splitting, we can learn this even when we have only one sample.
An important caveat is that sample splitting has a cost: the number of cases available for training is smaller once we split the sample. This can mean that the sample-split predictions will have worse performance than predictions trained on the full sample and evaluated out-of-sample.
Repeating this many times
If we repeat the above many times, we can see the distribution of these performance evaluation strategies across repeated samples.
Attaching package: 'foreach'
The following objects are masked from 'package:purrr':
accumulate, when
By the gold standard of out-of-sample prediction, no model is better than a model. In-sample prediction yields the misleading appearance that a model is better than no model. Split-sample prediction successfully mimics the out-of-sample behavior when only one sample is available.
Closing thoughts
Sample splitting is an art as much as a science. In particular applications, the gain from sample splitting is not always clear and must be balanced against the reduction in cases available for training. It is important to remember that out-of-sample prediction remains the gold standard, and sample splitting is one way to approximate that when only one sample is available.