class: center, middle, inverse, title-slide # tidymodels ### Dr. D’Agostino McGowan --- layout: true <div class="my-footer"> <span> Dr. Lucy D'Agostino McGowan <i> adapted from Alison Hill's Introduction to ML with the Tidyverse</i> </span> </div> --- ## tidymodels .pull-left[ ![](img/02/tidymodels.png) ] .pull-right[ .center[ [tidymodels.org](https://www.tidymodels.org/) ] - tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis. - All packages share an underlying philosophy and a common grammar. ] --- ## Step 1: Specify the model * Pick the **model** -- * Set the **engine** --- ## Specify the model ```r linear_reg() %>% set_engine("lm") ``` --- ## Specify the model ```r linear_reg() %>% set_engine("glmnet") ``` --- ## Specify the model ```r linear_reg() %>% set_engine("spark") ``` --- ## Specify the model ```r decision_tree() %>% set_engine("ranger") ``` --- ## Specify the model * All available models: https://www.tidymodels.org --- class: inverse ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Specify Model` Write a pipe that creates a model that uses `lm()` to fit a linear regression using tidymodels. Save it as `lm_spec` and look at the object. What does it return? _Hint: you'll need https://www.tidymodels.org_
02
:
00
--- ```r lm_spec <- linear_reg() %>% # Pick linear regression set_engine(engine = "lm") # set engine lm_spec ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- ## Fit the data * You can train your model using the `fit()` function ```r fit(lm_spec, mpg ~ horsepower, data = Auto) ``` ``` ## parsnip model object ## ## Fit time: 7ms ## ## Call: ## stats::lm(formula = mpg ~ horsepower, data = data) ## ## Coefficients: ## (Intercept) horsepower ## 39.9359 -0.1578 ``` --- class: inverse ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Fit Model` Fit the model: ```r library(ISLR) lm_fit <- fit(lm_spec, mpg ~ horsepower, data = Auto) lm_fit ``` Does this give the same results as ```r lm(mpg ~ horsepower, data = Auto) ```
01
:
30
--- ## Get predictions ```r lm_fit %>% predict(new_data = Auto) ``` -- * Uses the `predict()` function -- * ‼️ `new_data` has an underscore -- * 😄 This automagically creates a data frame --- ## Get predictions ```r lm_fit %>% predict(new_data = Auto) %>% bind_cols(Auto) ``` ``` ## # A tibble: 392 x 10 ## .pred mpg cylinders displacement horsepower weight acceleration year ## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 19.4 18 8 307 130 3504 12 70 ## 2 13.9 15 8 350 165 3693 11.5 70 ## 3 16.3 18 8 318 150 3436 11 70 ## 4 16.3 16 8 304 150 3433 12 70 ## 5 17.8 17 8 302 140 3449 10.5 70 ## 6 8.68 15 8 429 198 4341 10 70 ## 7 5.21 14 8 454 220 4354 9 70 ## 8 6.00 14 8 440 215 4312 8.5 70 ## 9 4.42 14 8 455 225 4425 10 70 ## 10 9.95 15 8 390 190 3850 8.5 70 ## # … with 382 more rows, and 2 more variables: origin <dbl>, name <fct> ``` --- class: inverse
01
:
30
## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Get predictions` Edit the code below to add the original data to the predicted data. ```r mpg_pred <- lm_fit %>% predict(new_data = Auto) %>% --- ``` --- ## Get predictions ```r mpg_pred <- lm_fit %>% predict(new_data = Auto) %>% bind_cols(Auto) mpg_pred ``` ``` ## # A tibble: 392 x 10 ## .pred mpg cylinders displacement horsepower weight acceleration year ## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 19.4 18 8 307 130 3504 12 70 ## 2 13.9 15 8 350 165 3693 11.5 70 ## 3 16.3 18 8 318 150 3436 11 70 ## 4 16.3 16 8 304 150 3433 12 70 ## 5 17.8 17 8 302 140 3449 10.5 70 ## 6 8.68 15 8 429 198 4341 10 70 ## 7 5.21 14 8 454 220 4354 9 70 ## 8 6.00 14 8 440 215 4312 8.5 70 ## 9 4.42 14 8 455 225 4425 10 70 ## 10 9.95 15 8 390 190 3850 8.5 70 ## # … with 382 more rows, and 2 more variables: origin <dbl>, name <fct> ``` --- ## Calculate the error * Root mean square error ```r mpg_pred %>% rmse(truth = mpg, estimate = .pred) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 4.89 ``` -- .question[ What is this estimate? (training error? testing error?) ] --- ## Validation set approach ```r Auto_split <- initial_split(Auto, prop = 0.5) Auto_split ``` ``` ## <Analysis/Assess/Total> ## <196/196/392> ``` -- * Extract the training and testing data ```r training(Auto_split) testing(Auto_split) ``` --- ## Validation set approach ```r Auto_train <- training(Auto_split) ``` ```r Auto_train ``` .small[ ``` ## # A tibble: 196 x 9 ## mpg cylinders displacement horsepower weight acceleration year origin ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 14 8 454 220 4354 9 70 1 ## 2 15 8 383 170 3563 10 70 1 ## 3 14 8 340 160 3609 8 70 1 ## 4 14 8 455 225 3086 10 70 1 ## 5 24 4 113 95 2372 15 70 3 ## 6 18 6 199 97 2774 15.5 70 1 ## 7 21 6 200 85 2587 16 70 1 ## 8 25 4 110 87 2672 17.5 70 2 ## 9 24 4 107 90 2430 14.5 70 2 ## 10 25 4 104 95 2375 17.5 70 2 ## # … with 186 more rows, and 1 more variable: name <fct> ``` ] --- class: inverse
04
:
00
## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Validation Set` Copy the code below, fill in the blanks to fit a model on the **training** data then calculate the **test** RMSE. ```r set.seed(100) Auto_split <- ________ Auto_train <- ________ Auto_test <- ________ lm_fit <- fit(lm_spec, mpg ~ horsepower, data = ________) mpg_pred <- ________ %>% predict(new_data = ________) %>% bind_cols(________) rmse(________, truth = ________, estimate = ________) ``` --- ## A faster way! * You can use `last_fit()` and specify the split * This will automatically train the data on the `train` data from the split * Instead of specifying which metric to calculate (with `rmse` as before) you can just use `collect_metrics()` and it will automatically calculate the metrics on the `test` data from the split ```r set.seed(100) Auto_split <- initial_split(Auto, prop = 0.5) lm_fit <- last_fit(lm_spec, mpg ~ horsepower, * split = Auto_split) lm_fit %>% * collect_metrics() ``` ``` ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 4.87 ## 2 rsq standard 0.625 ``` --- ## What about cross validation? ```r Auto_cv <- vfold_cv(Auto, v = 5) Auto_cv ``` ``` ## # 5-fold cross-validation ## # A tibble: 5 x 2 ## splits id ## <list> <chr> ## 1 <split [313/79]> Fold1 ## 2 <split [313/79]> Fold2 ## 3 <split [314/78]> Fold3 ## 4 <split [314/78]> Fold4 ## 5 <split [314/78]> Fold5 ``` --- ## What about cross validation? -- ```r *fit_resamples(lm_spec, mpg ~ horsepower, * resamples = Auto_cv) ``` --- ## What about cross validation? ```r fit_resamples(lm_spec, mpg ~ horsepower, resamples = Auto_cv) ``` ``` ## # 5-fold cross-validation ## # A tibble: 5 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [313/79]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]> ## 2 <split [313/79]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]> ## 3 <split [314/78]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]> ## 4 <split [314/78]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]> ## 5 <split [314/78]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]> ``` --- ## What about cross validation? .question[ How do we get the metrics out? With `collect_metrics()` again! ] -- ```r results <- fit_resamples(lm_spec, mpg ~ horsepower, resamples = Auto_cv) results %>% collect_metrics() ``` ``` ## # A tibble: 2 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 rmse standard 4.93 5 0.0779 ## 2 rsq standard 0.611 5 0.0277 ``` --- class: inverse
02
:
00
## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `K-fold cross validation` Edit the code below to get the 5-fold cross validation error rate for the following model: `\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon\)` ```r Auto_cv <- vfold_cv(Auto, v = 5) results <- fit_resamples(lm_spec, ----, resamples = ---) results %>% collect_metrics() ``` * What do you think `rsq` is? ---