tidymodels

tidymodelsDr. D’Agostino McGowan1 / 26

tidymodels

lm_spec <- 
  linear_reg() %>% # Pick linear regression
  set_engine(engine = "lm") # set engine
lm_spec

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

lm_fit <- fit(lm_spec,
              mpg ~ horsepower,
              data = Auto)

2 / 26

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split

## <Analysis/Assess/Total>
## <196/196/392>

3 / 26

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split

## <Analysis/Assess/Total>
## <196/196/392>

Extract the training and testing data

training(Auto_split)
testing(Auto_split)

3 / 26

Validation set approach

Auto_train <- training(Auto_split)

Auto_train

## # A tibble: 196 x 9
##      mpg cylinders displacement horsepower weight acceleration  year origin
##    <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl> <dbl>  <dbl>
##  1    18         8          307        130   3504         12      70      1
##  2    18         8          318        150   3436         11      70      1
##  3    16         8          304        150   3433         12      70      1
##  4    17         8          302        140   3449         10.5    70      1
##  5    15         8          429        198   4341         10      70      1
##  6    15         8          390        190   3850          8.5    70      1
##  7    14         8          340        160   3609          8      70      1
##  8    15         8          400        150   3761          9.5    70      1
##  9    24         4          113         95   2372         15      70      3
## 10    22         6          198         95   2833         15.5    70      1
## # … with 186 more rows, and 1 more variable: name <fct>

4 / 26

A faster way!

You can use last_fit() and specify the split
This will automatically train the data on the train data from the split
Instead of specifying which metric to calculate (with rmse as before) you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split

set.seed(100)
Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
                   mpg ~ horsepower,
                   split = Auto_split)
lm_fit %>%
  collect_metrics()

## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4.87 
## 2 rsq     standard       0.625

5 / 26

What about cross validation?

Auto_cv <- vfold_cv(Auto, v = 5)
Auto_cv

## #  5-fold cross-validation 
## # A tibble: 5 x 2
##   splits           id   
##   <list>           <chr>
## 1 <split [313/79]> Fold1
## 2 <split [313/79]> Fold2
## 3 <split [314/78]> Fold3
## 4 <split [314/78]> Fold4
## 5 <split [314/78]> Fold5

6 / 26

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

What if we wanted to do some preprocessingFor the shrinkage methods we discussed it was important to scale the variables
7 / 26

What if we wanted to do some preprocessing

For the shrinkage methods we discussed it was important to scale the variables

What does this mean?

7 / 26

What if we wanted to do some preprocessing

For the shrinkage methods we discussed it was important to scale the variables

What does this mean?

What would happen if we scale before doing cross-validation? Will we get different answers?

7 / 26

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

What if we wanted to do some preprocessingAuto_scaled <- Auto %>%
  mutate(horsepower = scale(horsepower))
sd(Auto_scaled$horsepower)

## [1] 1
Auto_cv_scaled <- vfold_cv(Auto_scaled, v = 5)
map_dbl(Auto_cv_scaled$splits,
        function(x) {
          dat <- as.data.frame(x)$horsepower
          sd(dat)
        })

## [1] 1.0115202 1.0025849 0.9834936 0.9733806 1.0293404
8 / 26

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

What if we wanted to do some preprocessingrecipe()!
9 / 26

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

What if we wanted to do some preprocessingrecipe()!
Using the recipe() function along with step_*() functions, we can specify preprocessing steps and R will automagically apply them to each fold appropriately.
9 / 26

What if we wanted to do some preprocessing

recipe()!
Using the recipe() function along with step_*() functions, we can specify preprocessing steps and R will automagically apply them to each fold appropriately.

rec <- recipe(mpg ~ horsepower, data = Auto) %>%
  step_scale(horsepower)

9 / 26

What if we wanted to do some preprocessing

recipe()!
Using the recipe() function along with step_*() functions, we can specify preprocessing steps and R will automagically apply them to each fold appropriately.

rec <- recipe(mpg ~ horsepower, data = Auto) %>%
  step_scale(horsepower)

You can find all of the potential preprocessing steps here: https://tidymodels.github.io/recipes/reference/index.html

9 / 26

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

Where do we plug in this recipe?The recipe gets plugged into the fit_resamples() function
10 / 26

Where do we plug in this recipe?

The recipe gets plugged into the fit_resamples() function

Auto_cv <- vfold_cv(Auto, v = 5)
rec <- recipe(mpg ~ horsepower, data = Auto) %>%
  step_scale(horsepower)
results <- fit_resamples(lm_spec,
                         preprocessor = rec,
                         resamples = Auto_cv)
results %>%
  collect_metrics()

## # A tibble: 2 x 5
##   .metric .estimator  mean     n std_err
##   <chr>   <chr>      <dbl> <int>   <dbl>
## 1 rmse    standard   4.88      5  0.317 
## 2 rsq     standard   0.613     5  0.0249

10 / 26

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

What if we want to predict mpg with more variablesNow we still want to add a step to scale predictors
We could either write out all predictors individually to scale them
11 / 26

Dr. Lucy D'Agostino McGowan  adapted from Alison Hill's Introduction to ML with the Tidyverse

What if we want to predict mpg with more variablesNow we still want to add a step to scale predictors
We could either write out all predictors individually to scale them
OR we could use the all_predictors() short hand.
11 / 26

What if we want to predict mpg with more variables

Now we still want to add a step to scale predictors
We could either write out all predictors individually to scale them
OR we could use the all_predictors() short hand.

rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) %>%
  step_scale(all_predictors())

11 / 26

Putting it together

rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) %>%
  step_scale(all_predictors())
results <- fit_resamples(lm_spec,
                         preprocessor = rec,
                         resamples = Auto_cv)
results %>%
  collect_metrics()

## # A tibble: 2 x 5
##   .metric .estimator  mean     n std_err
##   <chr>   <chr>      <dbl> <int>   <dbl>
## 1 rmse    standard   4.22      5  0.272 
## 2 rsq     standard   0.709     5  0.0153

12 / 26