Due: Thursday 2020-09-22 at 5pm
02-lab-cv
)In this lab we will work with three packages: ISLR
which is a package that accompanies your textbook, tidyverse
which is a collection of packages for doing data analysis in a “tidy” way, and tidymodels
a collection of packages for statistical modeling.
The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
For this lab, we are using the Auto
data from the ISLR
package.
Explain how k-fold Cross Validation is implemented.
What are the advantages / disadvantages of k-fold Cross Validation compared to the Validation Set approach? What are the advantages / disadvantages of k-fold Cross Validation compared to Leave-one-out Cross Validation?
We are trying to decide between three models of varying flexibility:
Auto
data, split the data into two groups a training data set, saved as Auto_train
and a testing data set, saved as Auto_test
. Be sure to set a seed to ensure that you get the same result each time you knit your document.You can use the poly()
function to fit a model with a polynomial term. For example, to fit the model \(y = \beta_0 + \beta_1 \texttt{x} + \beta_2 \texttt{x}^2 + \beta_3 \texttt{x}^3 + \epsilon\), you would run fit(lm_spec, y ~ poly(x, 3), data = data)
Fit the three models outlined above on the training data. Using the model created on the training data, predict mpg
in the test data set for each model. What is the test RMSE for the three models? Which model would you choose?
Fit the same three models, but instead of the validation set approach, perform 5-fold cross validation. Make sure to set a seed so you get the same answer each time you run the analysis. Calculate the overall 5-fold cross validation error for each of the three models. Which model would you chose?
The tidymodels package allows you to do this faster! Instead of having a fit 3 (or more!) different models to determine the best flexibility, you can (1) create a recipe to specify how you would like to fit a model and then (2) tune this model to determine the best output. Copy the code below. What do you think the line step_poly(horsepower, degree = tune())
does? Hint: you can run ?step_poly
in the Console to learn more about this function.
fit_resamples
with tune_grid
. The pseudo code to do this is below - you may need to update some names to match what you have named objects in your document. Add the code to tune your model based on the code below.Using the collect_metrics
function, look at the RMSE for auto_tune
. Which degree
is preferable?
You can plot the output from Exercise 8 to make it a bit easier to determine. First, save your output from Exercise 8 as auto_metrics
. Then filter this data frame to only include rows where .metric == "rmse"
. Save this filtered data frame as auto_rmse
. Edit the code below to plot the degree
on the x-axis and mean
on the y-axis. Describe what this plot shows.