+ - 0:00:00
Notes for current slide
Notes for next slide

tidymodels

Dr. D’Agostino McGowan

1 / 27

tidymodels

  • tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis.
  • All packages share an underlying philosophy and a common grammar.
2 / 27

Step 1: Specify the model

  • Pick the model
3 / 27

Step 1: Specify the model

  • Pick the model
  • Set the engine
3 / 27

Specify the model

linear_reg() %>%
set_engine("lm")
4 / 27

Specify the model

linear_reg() %>%
set_engine("glmnet")
5 / 27

Specify the model

linear_reg() %>%
set_engine("spark")
6 / 27

Specify the model

decision_tree() %>%
set_engine("ranger")
7 / 27

Specify the model

  • All available models:

https://www.tidymodels.org

8 / 27

Specify Model

Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels. Save it as lm_spec and look at the object. What does it return?

Hint: you'll need https://www.tidymodels.org

02:00
9 / 27
lm_spec <-
linear_reg() %>% # Pick linear regression
set_engine(engine = "lm") # set engine
lm_spec
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
10 / 27

Fit the data

  • You can train your model using the fit() function
fit(lm_spec,
mpg ~ horsepower,
data = Auto)
## parsnip model object
##
## Fit time: 7ms
##
## Call:
## stats::lm(formula = mpg ~ horsepower, data = data)
##
## Coefficients:
## (Intercept) horsepower
## 39.9359 -0.1578
11 / 27

Fit Model

Fit the model:

library(ISLR)
lm_fit <- fit(lm_spec,
mpg ~ horsepower,
data = Auto)
lm_fit

Does this give the same results as

lm(mpg ~ horsepower, data = Auto)
01:30
12 / 27

Get predictions

lm_fit %>%
predict(new_data = Auto)
13 / 27

Get predictions

lm_fit %>%
predict(new_data = Auto)
  • Uses the predict() function
13 / 27

Get predictions

lm_fit %>%
predict(new_data = Auto)
  • Uses the predict() function
  • ‼️ new_data has an underscore
13 / 27

Get predictions

lm_fit %>%
predict(new_data = Auto)
  • Uses the predict() function
  • ‼️ new_data has an underscore
  • 😄 This automagically creates a data frame
13 / 27

Get predictions

lm_fit %>%
predict(new_data = Auto) %>%
bind_cols(Auto)
## # A tibble: 392 x 10
## .pred mpg cylinders displacement horsepower weight acceleration year
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.4 18 8 307 130 3504 12 70
## 2 13.9 15 8 350 165 3693 11.5 70
## 3 16.3 18 8 318 150 3436 11 70
## 4 16.3 16 8 304 150 3433 12 70
## 5 17.8 17 8 302 140 3449 10.5 70
## 6 8.68 15 8 429 198 4341 10 70
## 7 5.21 14 8 454 220 4354 9 70
## 8 6.00 14 8 440 215 4312 8.5 70
## 9 4.42 14 8 455 225 4425 10 70
## 10 9.95 15 8 390 190 3850 8.5 70
## # … with 382 more rows, and 2 more variables: origin <dbl>, name <fct>
14 / 27
01:30

Get predictions

Edit the code below to add the original data to the predicted data.

mpg_pred <- lm_fit %>%
predict(new_data = Auto) %>%
---
15 / 27

Get predictions

mpg_pred <- lm_fit %>%
predict(new_data = Auto) %>%
bind_cols(Auto)
mpg_pred
## # A tibble: 392 x 10
## .pred mpg cylinders displacement horsepower weight acceleration year
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.4 18 8 307 130 3504 12 70
## 2 13.9 15 8 350 165 3693 11.5 70
## 3 16.3 18 8 318 150 3436 11 70
## 4 16.3 16 8 304 150 3433 12 70
## 5 17.8 17 8 302 140 3449 10.5 70
## 6 8.68 15 8 429 198 4341 10 70
## 7 5.21 14 8 454 220 4354 9 70
## 8 6.00 14 8 440 215 4312 8.5 70
## 9 4.42 14 8 455 225 4425 10 70
## 10 9.95 15 8 390 190 3850 8.5 70
## # … with 382 more rows, and 2 more variables: origin <dbl>, name <fct>
16 / 27

Calculate the error

  • Root mean square error
mpg_pred %>%
rmse(truth = mpg, estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.89
17 / 27

Calculate the error

  • Root mean square error
mpg_pred %>%
rmse(truth = mpg, estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.89

What is this estimate? (training error? testing error?)

17 / 27

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
## <Analysis/Assess/Total>
## <196/196/392>
18 / 27

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
## <Analysis/Assess/Total>
## <196/196/392>
  • Extract the training and testing data
training(Auto_split)
testing(Auto_split)
18 / 27

Validation set approach

Auto_train <- training(Auto_split)
Auto_train
## # A tibble: 196 x 9
## mpg cylinders displacement horsepower weight acceleration year origin
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14 8 454 220 4354 9 70 1
## 2 15 8 383 170 3563 10 70 1
## 3 14 8 340 160 3609 8 70 1
## 4 14 8 455 225 3086 10 70 1
## 5 24 4 113 95 2372 15 70 3
## 6 18 6 199 97 2774 15.5 70 1
## 7 21 6 200 85 2587 16 70 1
## 8 25 4 110 87 2672 17.5 70 2
## 9 24 4 107 90 2430 14.5 70 2
## 10 25 4 104 95 2375 17.5 70 2
## # … with 186 more rows, and 1 more variable: name <fct>
19 / 27
04:00

Validation Set

Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE.

set.seed(100)
Auto_split <- ________
Auto_train <- ________
Auto_test <- ________
lm_fit <- fit(lm_spec,
mpg ~ horsepower,
data = ________)
mpg_pred <- ________ %>%
predict(new_data = ________) %>%
bind_cols(________)
rmse(________, truth = ________, estimate = ________)
20 / 27

A faster way!

  • You can use last_fit() and specify the split
  • This will automatically train the data on the train data from the split
  • Instead of specifying which metric to calculate (with rmse as before) you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split
set.seed(100)
Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)
lm_fit %>%
collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.87
## 2 rsq standard 0.625
21 / 27

What about cross validation?

Auto_cv <- vfold_cv(Auto, v = 5)
Auto_cv
## # 5-fold cross-validation
## # A tibble: 5 x 2
## splits id
## <list> <chr>
## 1 <split [313/79]> Fold1
## 2 <split [313/79]> Fold2
## 3 <split [314/78]> Fold3
## 4 <split [314/78]> Fold4
## 5 <split [314/78]> Fold5
22 / 27

What about cross validation?

23 / 27

What about cross validation?

fit_resamples(lm_spec,
mpg ~ horsepower,
resamples = Auto_cv)
23 / 27

What about cross validation?

fit_resamples(lm_spec,
mpg ~ horsepower,
resamples = Auto_cv)
## # 5-fold cross-validation
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [313/79]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]>
## 2 <split [313/79]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]>
## 3 <split [314/78]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]>
## 4 <split [314/78]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]>
## 5 <split [314/78]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]>
24 / 27

What about cross validation?

How do we get the metrics out? With collect_metrics() again!

25 / 27

What about cross validation?

How do we get the metrics out? With collect_metrics() again!

results <- fit_resamples(lm_spec,
mpg ~ horsepower,
resamples = Auto_cv)
results %>%
collect_metrics()
## # A tibble: 2 x 5
## .metric .estimator mean n std_err
## <chr> <chr> <dbl> <int> <dbl>
## 1 rmse standard 4.93 5 0.0779
## 2 rsq standard 0.611 5 0.0277
25 / 27
02:00

K-fold cross validation

Edit the code below to get the 5-fold cross validation error rate for the following model:

mpg=β0+β1horsepower+β2horsepower2+ϵ

Auto_cv <- vfold_cv(Auto, v = 5)
results <- fit_resamples(lm_spec,
----,
resamples = ---)
results %>%
collect_metrics()
  • What do you think rsq is?
26 / 27
27 / 27

tidymodels

  • tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis.
  • All packages share an underlying philosophy and a common grammar.
2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow