Cross validation

Cross validationDr. D’Agostino McGowan1 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Cross validation💡 Big ideaWe have determined that it is sensible to use a test set to calculate metrics like prediction error
2 / 22

Cross validation

💡 Big idea

We have determined that it is sensible to use a test set to calculate metrics like prediction error

Why?

2 / 22

Cross validation

💡 Big idea

We have determined that it is sensible to use a test set to calculate metrics like prediction error

How have we done this so far?

3 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Cross validation💡 Big ideaWe have determined that it is sensible to use a test set to calculate metrics like prediction error
What if we don't have a seperate data set to test our model on?
4 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Cross validation💡 Big ideaWe have determined that it is sensible to use a test set to calculate metrics like prediction error
What if we don't have a seperate data set to test our model on?
🎉 We can use resampling methods to estimate the test-set prediction error
4 / 22

Training error versus test error

What is the difference? Which is typically larger?

5 / 22

Training error versus test error

What is the difference? Which is typically larger?

The training error is calculated by using the same observations used to fit the statistical learning model

5 / 22

Training error versus test error

What is the difference? Which is typically larger?

The training error is calculated by using the same observations used to fit the statistical learning model
The test error is calculated by using a statistical learning method to predict the response of new observations

5 / 22

Training error versus test error

What is the difference? Which is typically larger?

The training error is calculated by using the same observations used to fit the statistical learning model
The test error is calculated by using a statistical learning method to predict the response of new observations
The training error rate typically underestimates the true prediction error rate

5 / 22

6 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Estimating prediction errorBest case scenario: We have a large data set to test our model on 
7 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Estimating prediction errorBest case scenario: We have a large data set to test our model on 
This is not always the case!
7 / 22

Estimating prediction error

Best case scenario: We have a large data set to test our model on
This is not always the case!

💡 Let's instead find a way to estimate the test error by holding out a subset of the training observations from the model fitting process, and then applying the statistical learning method to those held out observations

7 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Approach #1: Validation setRandomly divide the available set up samples into two parts: a training set and a validation set
8 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Approach #1: Validation setRandomly divide the available set up samples into two parts: a training set and a validation set
Fit the model on the training set, calculate the prediction error on the validation set
8 / 22

Approach #1: Validation set

Randomly divide the available set up samples into two parts: a training set and a validation set
Fit the model on the training set, calculate the prediction error on the validation set

If we have a quantitative predictor what metric would we use to calculate this test error?

8 / 22

Approach #1: Validation set

Randomly divide the available set up samples into two parts: a training set and a validation set
Fit the model on the training set, calculate the prediction error on the validation set

If we have a quantitative predictor what metric would we use to calculate this test error?

Often we use Mean Squared Error (MSE)

8 / 22

Approach #1: Validation set

Randomly divide the available set up samples into two parts: a training set and a validation set
Fit the model on the training set, calculate the prediction error on the validation set

If we have a qualitative predictor (classification) what metric would we use to calculate this test error?

9 / 22

Approach #1: Validation set

Randomly divide the available set up samples into two parts: a training set and a validation set
Fit the model on the training set, calculate the prediction error on the validation set

If we have a qualitative predictor (classification) what metric would we use to calculate this test error?

Often we use misclassification rate

9 / 22

Approach #1: Validation set

10 / 22

Approach #1: Validation set

$M S E_{test-split} = {Ave}_{i \in test-split} [y_{i} - \hat{f} (x_{i})]^{2}$

10 / 22

Approach #1: Validation set

$M S E_{test-split} = {Ave}_{i \in test-split} [y_{i} - \hat{f} (x_{i})]^{2}$ $E r r_{test-split} = {Ave}_{i \in test-split} I [y_{i} \neq \hat{C} (x_{i})]$

10 / 22

Approach #1: Validation set

Auto example:

We have 392 observations
Trying to predict mpg from horsepower
We can split the data in half and use 196 to fit the model and 196 to test

11 / 22

Approach #1: Validation set

$M S E_{test-split}$

12 / 22

Approach #1: Validation set

$M S E_{test-split}$

12 / 22

Approach #1: Validation set

$M S E_{test-split}$

12 / 22

Approach #1: Validation set

$M S E_{test-split}$

12 / 22

Approach #1: Validation set

Auto example:

We have 392 observations
Trying to predict mpg from horsepower
We can split the data in half and use 196 to fit the model and 196 to test - what if we did this many times?

13 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Approach #1: Validation set (Drawbacks)the validation estimate of the test error can be highly variable, depending on which observations are included in the training set and which observations are included in the validation set
14 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Approach #1: Validation set (Drawbacks)the validation estimate of the test error can be highly variable, depending on which observations are included in the training set and which observations are included in the validation set
In the validation approach, only a subset of the observations (those that are included in the training set rather than in the validation set) are used to fit the model
14 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Approach #1: Validation set (Drawbacks)the validation estimate of the test error can be highly variable, depending on which observations are included in the training set and which observations are included in the validation set
In the validation approach, only a subset of the observations (those that are included in the training set rather than in the validation set) are used to fit the model
Therefore, the validation set error may tend to overestimate the test error for the model fit on the entire data set
14 / 22

Approach #2: K-fold cross validation

💡 The idea is to do the following:

Randomly divide the data into $K$ equal-sized parts

15 / 22

Approach #2: K-fold cross validation

💡 The idea is to do the following:

Randomly divide the data into $K$ equal-sized parts
Leave out part $k$ , fit the model to the other $K - 1$ parts (combined)

15 / 22

Approach #2: K-fold cross validation

💡 The idea is to do the following:

Randomly divide the data into $K$ equal-sized parts
Leave out part $k$ , fit the model to the other $K - 1$ parts (combined)
Obtain predictions for the left-out $k$ th part

15 / 22

Approach #2: K-fold cross validation

💡 The idea is to do the following:

Randomly divide the data into $K$ equal-sized parts
Leave out part $k$ , fit the model to the other $K - 1$ parts (combined)
Obtain predictions for the left-out $k$ th part
Do this for each part $k = 1, 2, \dots K$ , and then combine the result

15 / 22

K-fold cross validation

$M S E_{test-split-1}$

16 / 22

K-fold cross validation

$M S E_{test-split-1}$

$M S E_{test-split-2}$

16 / 22

K-fold cross validation

$M S E_{test-split-1}$

$M S E_{test-split-2}$

$M S E_{test-split-3}$

16 / 22

K-fold cross validation

$M S E_{test-split-1}$

$M S E_{test-split-2}$

$M S E_{test-split-3}$

$M S E_{test-split-4}$

16 / 22

K-fold cross validation

$M S E_{test-split-1}$

$M S E_{test-split-2}$

$M S E_{test-split-3}$

$M S E_{test-split-4}$

Take the mean of the $k$ MSE values

16 / 22

Estimating prediction error (quantitative outcome)

Split the data into K parts, where $C_{1}, C_{2}, \dots, C_{k}$ indicate the indices of observations in part $k$

$C V_{(K)} = \sum_{k = 1}^{K} \frac{n_{k}}{n} M S E_{k}$

17 / 22

Estimating prediction error (quantitative outcome)

Split the data into K parts, where $C_{1}, C_{2}, \dots, C_{k}$ indicate the indices of observations in part $k$

$C V_{(K)} = \sum_{k = 1}^{K} \frac{n_{k}}{n} M S E_{k}$

$M S E_{k} = \sum_{i \in C_{k}} (y_{i} - {\hat{y}}_{i})^{2} / n_{k}$

17 / 22

Estimating prediction error (quantitative outcome)

Split the data into K parts, where $C_{1}, C_{2}, \dots, C_{k}$ indicate the indices of observations in part $k$

$C V_{(K)} = \sum_{k = 1}^{K} \frac{n_{k}}{n} M S E_{k}$

$M S E_{k} = \sum_{i \in C_{k}} (y_{i} - {\hat{y}}_{i})^{2} / n_{k}$
$n_{k}$ is the number of observations in group $k$
${\hat{y}}_{i}$ is the fit for observation $i$ obtained from the data with the part $k$ removed

17 / 22

Estimating prediction error (quantitative outcome)

Split the data into K parts, where $C_{1}, C_{2}, \dots, C_{k}$ indicate the indices of observations in part $k$

$C V_{(K)} = \sum_{k = 1}^{K} \frac{n_{k}}{n} M S E_{k}$

$M S E_{k} = \sum_{i \in C_{k}} (y_{i} - {\hat{y}}_{i})^{2} / n_{k}$
$n_{k}$ is the number of observations in group $k$
${\hat{y}}_{i}$ is the fit for observation $i$ obtained from the data with the part $k$ removed
If we set $K = n$ , we'd have $n - f o l d$ cross validation which is the same as leave-one-out cross validation (LOOCV)

17 / 22

Leave-one-out cross validation

18 / 22

Leave-one-out cross validation

18 / 22

Leave-one-out cross validation

18 / 22

Leave-one-out cross validation

18 / 22

Leave-one-out cross validation

18 / 22

Leave-one-out cross validation

18 / 22

Leave-one-out cross validation

18 / 22

Leave-one-out cross validation

$⋮$

18 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Picking KKKK can vary from 2 (splitting the data in half each time) to nn (LOOCV)
19 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Picking KKKK can vary from 2 (splitting the data in half each time) to nn (LOOCV)
LOOCV is sometimes useful but usually the estimates from each fold are very correlated, so their average can have a high variance
19 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Picking KKKK can vary from 2 (splitting the data in half each time) to nn (LOOCV)
LOOCV is sometimes useful but usually the estimates from each fold are very correlated, so their average can have a high variance
A better choice tends to be K=5K=5 or K=10K=10
19 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Bias variance trade-offSince each training set is only (K−1)/K(K−1)/K as big as the original training set, the estimates of prediction error will typically be biased upward
20 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Bias variance trade-offSince each training set is only (K−1)/K(K−1)/K as big as the original training set, the estimates of prediction error will typically be biased upward
This bias is minimized when K=nK=n (LOOCV), but this estimate has a high variance
20 / 22

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Bias variance trade-offSince each training set is only (K−1)/K(K−1)/K as big as the original training set, the estimates of prediction error will typically be biased upward
This bias is minimized when K=nK=n (LOOCV), but this estimate has a high variance
K=5K=5 or K=10K=10 provides a nice compromise for the bias-variance trade-off
20 / 22

Approach #2: K-fold Cross Validation

Auto example:

We have 392 observations
Trying to predict mpg from horsepower

21 / 22

Estimating prediction error (qualitative outcome)

The premise is the same as cross valiation for quantitative outcomes
Split the data into K parts, where $C_{1}, C_{2}, \dots, C_{k}$ indicate the indices of observations in part $k$

$C V_{K} = \sum_{k = 1}^{K} \frac{n_{k}}{n} E r r_{k}$

22 / 22

Estimating prediction error (qualitative outcome)

The premise is the same as cross valiation for quantitative outcomes
Split the data into K parts, where $C_{1}, C_{2}, \dots, C_{k}$ indicate the indices of observations in part $k$

$C V_{K} = \sum_{k = 1}^{K} \frac{n_{k}}{n} E r r_{k}$

$E r r_{k} = \sum_{i \in C_{k}} I (y_{i} \neq {\hat{y}}_{i}) / n_{k}$ (missclassification rate)

22 / 22

Estimating prediction error (qualitative outcome)

The premise is the same as cross valiation for quantitative outcomes
Split the data into K parts, where $C_{1}, C_{2}, \dots, C_{k}$ indicate the indices of observations in part $k$

$C V_{K} = \sum_{k = 1}^{K} \frac{n_{k}}{n} E r r_{k}$

$E r r_{k} = \sum_{i \in C_{k}} I (y_{i} \neq {\hat{y}}_{i}) / n_{k}$ (missclassification rate)
$n_{k}$ is the number of observations in group $k$
${\hat{y}}_{i}$ is the fit for observation $i$ obtained from the data with the part $k$ removed

22 / 22

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Cross validation

Dr. D’Agostino McGowan

Cross validation

💡 Big idea

Cross validation

💡 Big idea

Cross validation

💡 Big idea

Cross validation

💡 Big idea

Cross validation

💡 Big idea

Training error versus test error

Training error versus test error

Training error versus test error

Training error versus test error

Estimating prediction error

Estimating prediction error

Estimating prediction error

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set

Approach #1: Validation set (Drawbacks)

Approach #1: Validation set (Drawbacks)

Approach #1: Validation set (Drawbacks)

Approach #2: K-fold cross validation

Approach #2: K-fold cross validation

Approach #2: K-fold cross validation

Approach #2: K-fold cross validation

K-fold cross validation

K-fold cross validation

K-fold cross validation

K-fold cross validation

K-fold cross validation

Estimating prediction error (quantitative outcome)

Estimating prediction error (quantitative outcome)

Estimating prediction error (quantitative outcome)

Estimating prediction error (quantitative outcome)

Leave-one-out cross validation

Leave-one-out cross validation

Leave-one-out cross validation

Leave-one-out cross validation

Leave-one-out cross validation

Leave-one-out cross validation

Leave-one-out cross validation

Leave-one-out cross validation

Picking KK

Picking KK

Picking KK

Bias variance trade-off

Bias variance trade-off

Bias variance trade-off

Approach #2: K-fold Cross Validation

Estimating prediction error (qualitative outcome)

Estimating prediction error (qualitative outcome)

Estimating prediction error (qualitative outcome)

Cross validation

💡 Big idea

Help

Picking $K$

Picking $K$

Picking $K$