tidymodels

tidymodelsDr. D’Agostino McGowan1 / 27

tidymodels

tidymodels.org

tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis.
All packages share an underlying philosophy and a common grammar.

2 / 27

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

Step 1: Specify the modelPick the model
3 / 27

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

Step 1: Specify the modelPick the model
Set the engine
3 / 27

Specify the model

linear_reg() %>%
  set_engine("lm")

4 / 27

Specify the model

linear_reg() %>%
  set_engine("glmnet")

5 / 27

Specify the model

linear_reg() %>%
  set_engine("spark")

6 / 27

Specify the model

decision_tree() %>%
  set_engine("ranger")

7 / 27

Specify the model

All available models:

https://www.tidymodels.org

8 / 27

`Specify Model`

Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels. Save it as lm_spec and look at the object. What does it return?

Hint: you'll need https://www.tidymodels.org

02:00

9 / 27

lm_spec <- 
  linear_reg() %>% # Pick linear regression
  set_engine(engine = "lm") # set engine
lm_spec

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

10 / 27

Fit the data

You can train your model using the fit() function

fit(lm_spec,
    mpg ~ horsepower,
    data = Auto)

## parsnip model object
## 
## Fit time:  7ms 
## 
## Call:
## stats::lm(formula = mpg ~ horsepower, data = data)
## 
## Coefficients:
## (Intercept)   horsepower  
##     39.9359      -0.1578

11 / 27

`Fit Model`

Fit the model:

library(ISLR)
lm_fit <- fit(lm_spec,
              mpg ~ horsepower,
              data = Auto)
lm_fit

Does this give the same results as

lm(mpg ~ horsepower, data = Auto)

01:30

12 / 27

Get predictions

lm_fit %>%
  predict(new_data = Auto)

13 / 27

Get predictions

lm_fit %>%
  predict(new_data = Auto)

Uses the predict() function

13 / 27

Get predictions

lm_fit %>%
  predict(new_data = Auto)

Uses the predict() function
‼️ new_data has an underscore

13 / 27

Get predictions

lm_fit %>%
  predict(new_data = Auto)

Uses the predict() function
‼️ new_data has an underscore
😄 This automagically creates a data frame

13 / 27

Get predictions

lm_fit %>%
  predict(new_data = Auto) %>%
  bind_cols(Auto)

## # A tibble: 392 x 10
##    .pred   mpg cylinders displacement horsepower weight acceleration  year
##  * <dbl> <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl> <dbl>
##  1 19.4     18         8          307        130   3504         12      70
##  2 13.9     15         8          350        165   3693         11.5    70
##  3 16.3     18         8          318        150   3436         11      70
##  4 16.3     16         8          304        150   3433         12      70
##  5 17.8     17         8          302        140   3449         10.5    70
##  6  8.68    15         8          429        198   4341         10      70
##  7  5.21    14         8          454        220   4354          9      70
##  8  6.00    14         8          440        215   4312          8.5    70
##  9  4.42    14         8          455        225   4425         10      70
## 10  9.95    15         8          390        190   3850          8.5    70
## # … with 382 more rows, and 2 more variables: origin <dbl>, name <fct>

14 / 27

01:30

`Get predictions`

Edit the code below to add the original data to the predicted data.

mpg_pred <- lm_fit %>% 
  predict(new_data = Auto) %>% 
  ---

15 / 27

Get predictions

mpg_pred <- lm_fit %>%
  predict(new_data = Auto) %>%
  bind_cols(Auto)
mpg_pred

## # A tibble: 392 x 10
##    .pred   mpg cylinders displacement horsepower weight acceleration  year
##  * <dbl> <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl> <dbl>
##  1 19.4     18         8          307        130   3504         12      70
##  2 13.9     15         8          350        165   3693         11.5    70
##  3 16.3     18         8          318        150   3436         11      70
##  4 16.3     16         8          304        150   3433         12      70
##  5 17.8     17         8          302        140   3449         10.5    70
##  6  8.68    15         8          429        198   4341         10      70
##  7  5.21    14         8          454        220   4354          9      70
##  8  6.00    14         8          440        215   4312          8.5    70
##  9  4.42    14         8          455        225   4425         10      70
## 10  9.95    15         8          390        190   3850          8.5    70
## # … with 382 more rows, and 2 more variables: origin <dbl>, name <fct>

16 / 27

Calculate the error

Root mean square error

mpg_pred %>%
  rmse(truth = mpg, estimate = .pred)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.89

17 / 27

Calculate the error

Root mean square error

mpg_pred %>%
  rmse(truth = mpg, estimate = .pred)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.89

What is this estimate? (training error? testing error?)

17 / 27

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split

## <Analysis/Assess/Total>
## <196/196/392>

18 / 27

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split

## <Analysis/Assess/Total>
## <196/196/392>

Extract the training and testing data

training(Auto_split)
testing(Auto_split)

18 / 27

Validation set approach

Auto_train <- training(Auto_split)

Auto_train

## # A tibble: 196 x 9
##      mpg cylinders displacement horsepower weight acceleration  year origin
##    <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl> <dbl>  <dbl>
##  1    14         8          454        220   4354          9      70      1
##  2    15         8          383        170   3563         10      70      1
##  3    14         8          340        160   3609          8      70      1
##  4    14         8          455        225   3086         10      70      1
##  5    24         4          113         95   2372         15      70      3
##  6    18         6          199         97   2774         15.5    70      1
##  7    21         6          200         85   2587         16      70      1
##  8    25         4          110         87   2672         17.5    70      2
##  9    24         4          107         90   2430         14.5    70      2
## 10    25         4          104         95   2375         17.5    70      2
## # … with 186 more rows, and 1 more variable: name <fct>

19 / 27

04:00

`Validation Set`

Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE.

set.seed(100)
Auto_split  <- ________
Auto_train  <- ________
Auto_test   <- ________
lm_fit      <- fit(lm_spec, 
                   mpg ~ horsepower, 
                   data = ________)
mpg_pred  <- ________ %>% 
  predict(new_data = ________) %>% 
  bind_cols(________)
rmse(________, truth = ________, estimate = ________)

20 / 27

A faster way!

You can use last_fit() and specify the split
This will automatically train the data on the train data from the split
Instead of specifying which metric to calculate (with rmse as before) you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split

set.seed(100)
Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
                   mpg ~ horsepower,
                   split = Auto_split)
lm_fit %>%
  collect_metrics()

## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4.87 
## 2 rsq     standard       0.625

21 / 27

What about cross validation?

Auto_cv <- vfold_cv(Auto, v = 5)
Auto_cv

## #  5-fold cross-validation 
## # A tibble: 5 x 2
##   splits           id   
##   <list>           <chr>
## 1 <split [313/79]> Fold1
## 2 <split [313/79]> Fold2
## 3 <split [314/78]> Fold3
## 4 <split [314/78]> Fold4
## 5 <split [314/78]> Fold5

22 / 27

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

What about cross validation?23 / 27

What about cross validation?

fit_resamples(lm_spec,
              mpg ~ horsepower,
              resamples = Auto_cv)

23 / 27

What about cross validation?

fit_resamples(lm_spec,
              mpg ~ horsepower,
              resamples = Auto_cv)

## #  5-fold cross-validation 
## # A tibble: 5 x 4
##   splits           id    .metrics         .notes          
##   <list>           <chr> <list>           <list>          
## 1 <split [313/79]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]>
## 2 <split [313/79]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]>
## 3 <split [314/78]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]>
## 4 <split [314/78]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]>
## 5 <split [314/78]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]>

24 / 27

What about cross validation?

How do we get the metrics out? With collect_metrics() again!

25 / 27

What about cross validation?

How do we get the metrics out? With collect_metrics() again!

results <- fit_resamples(lm_spec,
                         mpg ~ horsepower,
                         resamples = Auto_cv)
results %>%
  collect_metrics()

## # A tibble: 2 x 5
##   .metric .estimator  mean     n std_err
##   <chr>   <chr>      <dbl> <int>   <dbl>
## 1 rmse    standard   4.93      5  0.0779
## 2 rsq     standard   0.611     5  0.0277

25 / 27

02:00

`K-fold cross validation`

Edit the code below to get the 5-fold cross validation error rate for the following model:

$m p g = β_{0} + β_{1} h o r s e p o w e r + β_{2} h o r s e p o w e r^{2} + ϵ$

Auto_cv <- vfold_cv(Auto, v = 5)
results <- fit_resamples(lm_spec,
                         ----,
                         resamples = ---)
results %>%
  collect_metrics()

What do you think rsq is?

26 / 27

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

27 / 27

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

tidymodels

Dr. D’Agostino McGowan

tidymodels

Step 1: Specify the model

Step 1: Specify the model

Specify the model

Specify the model

Specify the model

Specify the model

Specify the model

Specify Model

Fit the data

Fit Model

Get predictions

Get predictions

Get predictions

Get predictions

Get predictions

Get predictions

Get predictions

Calculate the error

Calculate the error

Validation set approach

Validation set approach

Validation set approach

Validation Set

A faster way!

What about cross validation?

What about cross validation?

What about cross validation?

What about cross validation?

What about cross validation?

What about cross validation?

K-fold cross validation

tidymodels

Help

`Specify Model`

`Fit Model`

`Get predictions`

`Validation Set`

`K-fold cross validation`