+ - 0:00:00
Notes for current slide
Notes for next slide

tidymodels

Dr. D’Agostino McGowan

1 / 26

tidymodels

lm_spec <-
linear_reg() %>% # Pick linear regression
set_engine(engine = "lm") # set engine
lm_spec
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
lm_fit <- fit(lm_spec,
mpg ~ horsepower,
data = Auto)
2 / 26

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
## <Analysis/Assess/Total>
## <196/196/392>
3 / 26

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
## <Analysis/Assess/Total>
## <196/196/392>
  • Extract the training and testing data
training(Auto_split)
testing(Auto_split)
3 / 26

Validation set approach

Auto_train <- training(Auto_split)
Auto_train
## # A tibble: 196 x 9
## mpg cylinders displacement horsepower weight acceleration year origin
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 18 8 307 130 3504 12 70 1
## 2 18 8 318 150 3436 11 70 1
## 3 16 8 304 150 3433 12 70 1
## 4 17 8 302 140 3449 10.5 70 1
## 5 15 8 429 198 4341 10 70 1
## 6 15 8 390 190 3850 8.5 70 1
## 7 14 8 340 160 3609 8 70 1
## 8 15 8 400 150 3761 9.5 70 1
## 9 24 4 113 95 2372 15 70 3
## 10 22 6 198 95 2833 15.5 70 1
## # … with 186 more rows, and 1 more variable: name <fct>
4 / 26

A faster way!

  • You can use last_fit() and specify the split
  • This will automatically train the data on the train data from the split
  • Instead of specifying which metric to calculate (with rmse as before) you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split
set.seed(100)
Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)
lm_fit %>%
collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.87
## 2 rsq standard 0.625
5 / 26

What about cross validation?

Auto_cv <- vfold_cv(Auto, v = 5)
Auto_cv
## # 5-fold cross-validation
## # A tibble: 5 x 2
## splits id
## <list> <chr>
## 1 <split [313/79]> Fold1
## 2 <split [313/79]> Fold2
## 3 <split [314/78]> Fold3
## 4 <split [314/78]> Fold4
## 5 <split [314/78]> Fold5
6 / 26

What if we wanted to do some preprocessing

  • For the shrinkage methods we discussed it was important to scale the variables
7 / 26

What if we wanted to do some preprocessing

  • For the shrinkage methods we discussed it was important to scale the variables

What does this mean?

7 / 26

What if we wanted to do some preprocessing

  • For the shrinkage methods we discussed it was important to scale the variables

What does this mean?

What would happen if we scale before doing cross-validation? Will we get different answers?

7 / 26

What if we wanted to do some preprocessing

Auto_scaled <- Auto %>%
mutate(horsepower = scale(horsepower))
sd(Auto_scaled$horsepower)
## [1] 1
Auto_cv_scaled <- vfold_cv(Auto_scaled, v = 5)
map_dbl(Auto_cv_scaled$splits,
function(x) {
dat <- as.data.frame(x)$horsepower
sd(dat)
})
## [1] 1.0115202 1.0025849 0.9834936 0.9733806 1.0293404
8 / 26

What if we wanted to do some preprocessing

  • recipe()!
9 / 26

What if we wanted to do some preprocessing

  • recipe()!
  • Using the recipe() function along with step_*() functions, we can specify preprocessing steps and R will automagically apply them to each fold appropriately.
9 / 26

What if we wanted to do some preprocessing

  • recipe()!
  • Using the recipe() function along with step_*() functions, we can specify preprocessing steps and R will automagically apply them to each fold appropriately.
rec <- recipe(mpg ~ horsepower, data = Auto) %>%
step_scale(horsepower)
9 / 26

What if we wanted to do some preprocessing

  • recipe()!
  • Using the recipe() function along with step_*() functions, we can specify preprocessing steps and R will automagically apply them to each fold appropriately.
rec <- recipe(mpg ~ horsepower, data = Auto) %>%
step_scale(horsepower)
9 / 26

Where do we plug in this recipe?

  • The recipe gets plugged into the fit_resamples() function
10 / 26

Where do we plug in this recipe?

  • The recipe gets plugged into the fit_resamples() function
Auto_cv <- vfold_cv(Auto, v = 5)
rec <- recipe(mpg ~ horsepower, data = Auto) %>%
step_scale(horsepower)
results <- fit_resamples(lm_spec,
preprocessor = rec,
resamples = Auto_cv)
results %>%
collect_metrics()
## # A tibble: 2 x 5
## .metric .estimator mean n std_err
## <chr> <chr> <dbl> <int> <dbl>
## 1 rmse standard 4.88 5 0.317
## 2 rsq standard 0.613 5 0.0249
10 / 26

What if we want to predict mpg with more variables

  • Now we still want to add a step to scale predictors
  • We could either write out all predictors individually to scale them
11 / 26

What if we want to predict mpg with more variables

  • Now we still want to add a step to scale predictors
  • We could either write out all predictors individually to scale them
  • OR we could use the all_predictors() short hand.
11 / 26

What if we want to predict mpg with more variables

  • Now we still want to add a step to scale predictors
  • We could either write out all predictors individually to scale them
  • OR we could use the all_predictors() short hand.
rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) %>%
step_scale(all_predictors())
11 / 26

Putting it together

rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) %>%
step_scale(all_predictors())
results <- fit_resamples(lm_spec,
preprocessor = rec,
resamples = Auto_cv)
results %>%
collect_metrics()
## # A tibble: 2 x 5
## .metric .estimator mean n std_err
## <chr> <chr> <dbl> <int> <dbl>
## 1 rmse standard 4.22 5 0.272
## 2 rsq standard 0.709 5 0.0153
12 / 26

Ridge, Lasso, and Elastic net

  • When specifying your model, you can indicate whether you would like to use ridge, lasso, or elastic net. We can write a general equation to minimize:

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

13 / 26

Ridge, Lasso, and Elastic net

  • When specifying your model, you can indicate whether you would like to use ridge, lasso, or elastic net. We can write a general equation to minimize:

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

lm_spec <- linear_reg() %>%
set_engine("glmnet")
  • First specify the engine. We'll use glmnet
13 / 26

Ridge, Lasso, and Elastic net

  • When specifying your model, you can indicate whether you would like to use ridge, lasso, or elastic net. We can write a general equation to minimize:

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

lm_spec <- linear_reg() %>%
set_engine("glmnet")
  • First specify the engine. We'll use glmnet
  • The linear_reg() function has two additional parameters, penalty and mixture
13 / 26

Ridge, Lasso, and Elastic net

  • When specifying your model, you can indicate whether you would like to use ridge, lasso, or elastic net. We can write a general equation to minimize:

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

lm_spec <- linear_reg() %>%
set_engine("glmnet")
  • First specify the engine. We'll use glmnet
  • The linear_reg() function has two additional parameters, penalty and mixture
  • penalty is \(\lambda\) from our equation.
13 / 26

Ridge, Lasso, and Elastic net

  • When specifying your model, you can indicate whether you would like to use ridge, lasso, or elastic net. We can write a general equation to minimize:

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

lm_spec <- linear_reg() %>%
set_engine("glmnet")
  • First specify the engine. We'll use glmnet
  • The linear_reg() function has two additional parameters, penalty and mixture
  • penalty is \(\lambda\) from our equation.
  • mixture is a number between 0 and 1 representing \(\alpha\)
13 / 26

Ridge, Lasso, and Elastic net

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

What would we set mixture to in order to perform Ridge regression?

14 / 26

Ridge, Lasso, and Elastic net

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

What would we set mixture to in order to perform Ridge regression?

ridge_spec <- linear_reg(penalty = 100, mixture = 0) %>%
set_engine("glmnet")
14 / 26

Ridge, Lasso, and Elastic net

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

ridge_spec <- linear_reg(penalty = 100, mixture = 0) %>%
set_engine("glmnet")
15 / 26

Ridge, Lasso, and Elastic net

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

ridge_spec <- linear_reg(penalty = 100, mixture = 0) %>%
set_engine("glmnet")
lasso_spec <- linear_reg(penalty = 5, mixture = 1) %>%
set_engine("glmnet")
15 / 26

Ridge, Lasso, and Elastic net

$$RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)$$

ridge_spec <- linear_reg(penalty = 100, mixture = 0) %>%
set_engine("glmnet")
lasso_spec <- linear_reg(penalty = 5, mixture = 1) %>%
set_engine("glmnet")
enet_spec <- linear_reg(penalty = 60, mixture = 0.7) %>%
set_engine("glmnet")
15 / 26

Okay, but we wanted to look at 3 different models!

ridge_spec <- linear_reg(penalty = 100, mixture = 0) %>%
set_engine("glmnet")
results <- fit_resamples(ridge_spec,
preprocessor = rec,
resamples = Auto_cv)
16 / 26

Okay, but we wanted to look at 3 different models!

ridge_spec <- linear_reg(penalty = 100, mixture = 0) %>%
set_engine("glmnet")
results <- fit_resamples(ridge_spec,
preprocessor = rec,
resamples = Auto_cv)
lasso_spec <- linear_reg(penalty = 5, mixture = 1) %>%
set_engine("glmnet")
results <- fit_resamples(lasso_spec,
preprocessor = rec,
resamples = Auto_cv)
16 / 26
elastic_spec <- linear_reg(penalty = 60, mixture = 0.7) %>%
set_engine("glmnet")
results <- fit_resamples(elastic_spec,
preprocessor = rec,
resamples = Auto_cv)
17 / 26
elastic_spec <- linear_reg(penalty = 60, mixture = 0.7) %>%
set_engine("glmnet")
results <- fit_resamples(elastic_spec,
preprocessor = rec,
resamples = Auto_cv)
  • 😱 this looks like copy + pasting!
17 / 26

tune 🎶

penalty_spec <- linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
  • Notice the code above has tune() for the the penalty and the mixture. Those are the things we want to vary!
18 / 26

tune 🎶

  • Now we need to create a grid of potential penalties ( \(\lambda\) ) and mixtures ( \(\alpha\) ) that we want to test
  • Instead of fit_resamples() we are going to use tune_grid()
grid <- expand_grid(penalty = seq(0, 100, by = 10),
mixture = seq(0, 1, by = 0.2))
results <- tune_grid(penalty_spec,
preprocessor = rec,
grid = grid,
resamples = Auto_cv)
19 / 26

tune 🎶

results %>%
collect_metrics()
## # A tibble: 132 x 7
## penalty mixture .metric .estimator mean n std_err
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 0 0 rmse standard 4.23 5 0.280
## 2 0 0 rsq standard 0.708 5 0.0166
## 3 0 0.2 rmse standard 4.22 5 0.273
## 4 0 0.2 rsq standard 0.709 5 0.0154
## 5 0 0.4 rmse standard 4.22 5 0.273
## 6 0 0.4 rsq standard 0.709 5 0.0154
## 7 0 0.6 rmse standard 4.22 5 0.273
## 8 0 0.6 rsq standard 0.709 5 0.0154
## 9 0 0.8 rmse standard 4.22 5 0.273
## 10 0 0.8 rsq standard 0.709 5 0.0153
## # … with 122 more rows
20 / 26

Subset results

results %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
arrange(mean)
## # A tibble: 66 x 7
## penalty mixture .metric .estimator mean n std_err
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 0 0.2 rmse standard 4.22 5 0.273
## 2 0 0.6 rmse standard 4.22 5 0.273
## 3 0 0.4 rmse standard 4.22 5 0.273
## 4 0 0.8 rmse standard 4.22 5 0.273
## 5 0 1 rmse standard 4.22 5 0.273
## 6 0 0 rmse standard 4.23 5 0.280
## 7 10 0 rmse standard 4.73 5 0.308
## 8 20 0 rmse standard 5.29 5 0.313
## 9 10 0.2 rmse standard 5.37 5 0.316
## 10 30 0 rmse standard 5.70 5 0.314
## # … with 56 more rows
  • Since this is a data frame, we can do things like filter and arrange!
21 / 26

Subset results

results %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
arrange(mean)
## # A tibble: 66 x 7
## penalty mixture .metric .estimator mean n std_err
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 0 0.2 rmse standard 4.22 5 0.273
## 2 0 0.6 rmse standard 4.22 5 0.273
## 3 0 0.4 rmse standard 4.22 5 0.273
## 4 0 0.8 rmse standard 4.22 5 0.273
## 5 0 1 rmse standard 4.22 5 0.273
## 6 0 0 rmse standard 4.23 5 0.280
## 7 10 0 rmse standard 4.73 5 0.308
## 8 20 0 rmse standard 5.29 5 0.313
## 9 10 0.2 rmse standard 5.37 5 0.316
## 10 30 0 rmse standard 5.70 5 0.314
## # … with 56 more rows
  • Since this is a data frame, we can do things like filter and arrange!

Which would you choose?

21 / 26
results %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
ggplot(aes(penalty, mean, color = factor(mixture), group = factor(mixture))) +
geom_line() +
geom_point() +
labs(y = "RMSE")

22 / 26

23 / 26

Putting it all together

  • Often we can use a combination of all of these tools together
  • First split our data
  • Do cross validation on just the training data to tune the parameters
  • Use last_fit() with the selected parameters, specifying the split data so that it is evaluated on the left out test sample
24 / 26

Putting it all together

auto_split <- initial_split(Auto, prop = 0.5)
auto_train <- training(auto_split)
auto_cv <- vfold_cv(auto_train, v = 5)
rec <- recipe(mpg ~ horsepower + displacement + weight, data = auto_train) %>%
step_scale(all_predictors())
tuning <- tune_grid(penalty_spec,
rec,
grid = grid,
resamples = auto_cv)
tuning %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
arrange(mean)
## # A tibble: 66 x 7
## penalty mixture .metric .estimator mean n std_err
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 0 0 rmse standard 4.48 5 0.195
## 2 0 1 rmse standard 4.49 5 0.223
## 3 0 0.8 rmse standard 4.49 5 0.223
## 4 0 0.6 rmse standard 4.49 5 0.223
## 5 0 0.4 rmse standard 4.51 5 0.228
## 6 0 0.2 rmse standard 4.51 5 0.228
## 7 10 0 rmse standard 4.90 5 0.170
## 8 20 0 rmse standard 5.44 5 0.203
## 9 10 0.2 rmse standard 5.52 5 0.216
## 10 30 0 rmse standard 5.84 5 0.228
## # … with 56 more rows
25 / 26

Putting it all together

final_spec <- linear_reg(penalty = 0, mixture = 0) %>%
set_engine("glmnet")
fit <- last_fit(final_spec,
rec,
split = auto_split)
fit %>%
collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.07
## 2 rsq standard 0.714
26 / 26

tidymodels

lm_spec <-
linear_reg() %>% # Pick linear regression
set_engine(engine = "lm") # set engine
lm_spec
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
lm_fit <- fit(lm_spec,
mpg ~ horsepower,
data = Auto)
2 / 26
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow