Due: Tuesday 2020-10-20 at 5:00pm
In this lab we are going to practice cross validation. A few reminders:
In this lab we will work with two packages: tidyverse
which is a collection of packages for doing data analysis in a “tidy” way and tidymodels
for statistical modeling.
Now that the necessary packages are installed, you should be able to Knit your document and see the results.
If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.
Note that the packages are also loaded with the same commands in your R Markdown document.
For this lab, we are using a data frame currently in music.csv
. This data frame includes 72 predictors that are components of audio files and one outcome, lat
, the latitude of where the music originated. We are trying to predict the location of the music’s origin using audio components of the music.
You can download this dataset from Canvas, then save it in the same directory as your .Rmd file and R project.
7
. Split the music
data into a training and test set with 50% of the data in the training and 50% in the testing. Call your training set music_train
and your testing set music_test
. Describe these data sets (how many observations, how many variables).We are interested in predicting the latitude (lat
) of the music’s origin from all other variables. Fit a linear model using least squares on the training set. Report the test root mean squared error obtained.
Fit a ridge regression model on the training set with \(\lambda\) chosen using 10-fold cross validation. Report the \(\lambda\) chosen and explain why. Report the test root mean squared error obtained using testing portion of the initially split data frame.
Fit a lasso model on the training set with \(\lambda\) chosen using cross validation. Report the \(\lambda\) chosen and explain why. Report the test root mean squared error obtained using testing portion of the initially split data frame.
Fit an elastic net model on the training set with \(\lambda\) and \(\alpha\) chosen using cross validation. Report the \(\lambda\) chosen and explain why. Report the test root mean squared error obtained using testing portion of the initially split data frame.
Comment on the results obtained. How well can we predict the latitude of where the music originated? Is there much difference among the test errors resulting from these four approaches?