class: center, middle, inverse, title-slide # Ridge Regression ### Dr. D’Agostino McGowan --- layout: true <div class="my-footer"> <span> Dr. Lucy D'Agostino McGowan </span> </div> --- ## Linear Regression Review .question[ In linear regression, what are we minimizing? How can I write this in matrix form? ] -- * RSS! `$$(\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta)$$` -- .question[ What is the solution ( `\(\hat\beta\)` ) to this? ] -- `$$\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$` --- ## Linear Regression Review .question[ What is `\(\mathbf{X}\)`? ] -- - the design matrix! --- ## Linear Regression Review .question[ How did we get `\(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)`? ] $$ `\begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \mathbf{I}\hat\beta &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \hat\beta & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align}` $$ --- ## Linear Regression Review .alert[ Let's try to find an `\(\mathbf{X}\)` for which it would be impossible to calculate `\(\hat\beta\)` ] --- ## Calculating in R y | x ---|--- 4 | 1 3 | 2 1 | 5 3 | 1 5 | 5 --- ## Creating a vector in R .pull-left[ y | x ---|--- 4 | 1 3 | 2 1 | 5 3 | 1 5 | 5 ] .pull-right[ ```r y <- c(4, 3, 1, 3, 5) ``` ] --- ## Creating a Design matrix in R .pull-left[ y | x ---|--- 4 | 1 3 | 2 1 | 5 3 | 1 5 | 5 ] .pull-right[ .small[ ```r (X <- matrix(c(rep(1, 5), c(1, 2, 5, 1, 5)), ncol = 2)) ``` ``` ## [,1] [,2] ## [1,] 1 1 ## [2,] 1 2 ## [3,] 1 5 ## [4,] 1 1 ## [5,] 1 5 ``` ] ] --- ## Taking a transpose in R ```r t(X) ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 1 1 1 1 ## [2,] 1 2 5 1 5 ``` --- ## Taking an inverse in R ```r XTX <- t(X) %*% X solve(XTX) ``` ``` ## [,1] [,2] ## [1,] 0.6666667 -0.16666667 ## [2,] -0.1666667 0.05952381 ``` --- ## Put it all together ```r solve(t(X) %*% X) %*% t(X) %*% y ``` ``` ## [,1] ## [1,] 3.5000000 ## [2,] -0.1071429 ``` --- class: inverse ## <i class="fas fa-laptop"></i> `Application Exercise` * In R, find a design matrix `X` where it is not possible to calculate `\(\hat\beta\)` ```r solve(t(X) %*% X) %*% t(X) %*% y ```
05
:
00
--- ## Estimating `\(\hat\beta\)` `\(\hat\beta = \mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)` .question[ Under what circumstances is this equation not estimable? ] -- * when we can't invert `\((\mathbf{X^TX})^{-1}\)` -- * `\(p > n\)` * multicollinearity -- .alert[ A guaranteed way to check whether a square matrix is not invertible is to check whether the **determinant** is equal to zero ] --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix}1 & 2 & 3 & 1 \\ 1 & 3 & 4& 0 \end{bmatrix}$$` .question[ What is `\(n\)` here? What is `\(p\)`? ] -- .question[ Is `\(\mathbf{(X^TX)^{-1}}\)` going to be invertible? ] -- ```r X <- matrix(c(1, 1, 2, 3, 3, 4, 1, 0), nrow = 2) det(t(X) %*% X) ``` ``` ## [1] 0 ``` --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` -- .question[ Is `\(\mathbf{(X^TX)^{-1}}\)` going to be invertible? ] -- .small[ ```r X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4) det(t(X) %*% X) ``` ``` ## [1] 0 ``` ```r cor(X[, 2], X[, 3]) ``` ``` ## [1] 1 ``` ] --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` .question[ What was the problem this time? ] .small[ ```r X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4) det(t(X) %*% X) ``` ``` ## [1] 0 ``` ```r cor(X[, 2], X[, 3]) ``` ``` ## [1] 1 ``` ] --- ## Estimating `\(\hat\beta\)` .question[ What is a sure-fire way to tell whether `\(\mathbf{(X^TX)^{-1}}\)` will be invertible? ] -- * Take the determinant! -- `\(|\mathbf{A}|\)` means the determinant of matrix `\(\mathbf{A}\)` -- * For a 2x2 matrix: `$$\mathbf{A} = \begin{bmatrix}a&b\\c&d\end{bmatrix}$$` `$$|\mathbf{A}| = ad - bc$$` --- ## Estimating `\(\hat\beta\)` .question[ What is a sure-fire way to tell whether `\(\mathbf{(X^TX)^{-1}}\)` will be invertible? ] * Take the determinant! `\(|\mathbf{A}|\)` means the determinant of matrix `\(\mathbf{A}\)` * For a 3x3 matrix: `$$\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}$$` `$$|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)$$` --- ## Determinants _It looks funky, but it follows a nice pattern!_ `$$\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}$$` `$$|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)$$` -- * (1) multiply `\(a\)` by the determinant of the portion of the matrix that are **not** in `\(a\)`'s row or column * do the same for `\(b\)` (2) and `\(c\)` (3) * put it together as **plus** (1) **minus** (2) **plus** (3) -- `$$|\mathbf{A}| = a \left|\begin{matrix}e&f\\h&i\end{matrix}\right|-b\left|\begin{matrix}d&f\\g&i\end{matrix}\right|+c\left|\begin{matrix}d&e\\g&h\end{matrix}\right|$$` --- class: inverse ## <i class="fas fa-laptop"></i> `Application Exercise` * Calculate the determinant of the following matrices in R using the `det()` function: `$$\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 4 & 5 \end{bmatrix}$$` `$$\mathbf{B} = \begin{bmatrix} 1 & 2 & 3 \\ 3 & 6 & 9 \\ 2 & 5 & 7\end{bmatrix}$$` * Are these both invertible?
01
:
00
--- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` -- .question[ Is `\(\mathbf{(X^TX)^{-1}}\)` going to be invertible? ] -- .small[ ```r X <- matrix(c(1, 1, 1, 1, 3.01, 4, 5, 2, 6, 8, 10, 4), nrow = 4) det(t(X) %*% X) ``` ``` ## [1] 0.0056 ``` ```r cor(X[, 2], X[, 3]) ``` ``` ## [1] 0.999993 ``` ] --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` .question[ Is `\(\mathbf{(X^TX)^{-1}}\)` going to be invertible? ] .small[ ```r y <- c(1, 2, 3, 2) solve(t(X) %*% X) %*% t(X) %*% y ``` ``` ## [,1] ## [1,] 1.285714 ## [2,] -114.285714 ## [3,] 57.285714 ``` ] --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` .question[ Is `\(\mathbf{(X^TX)^{-1}}\)` going to be invertible? ] `$$\begin{bmatrix}\hat\beta_0\\\hat\beta_1\\\hat\beta_2\end{bmatrix} = \begin{bmatrix}1.28\\-114.29\\57.29\end{bmatrix}$$` --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` .question[ What is the equation for the variance of `\(\hat\beta\)`? ] `$$var(\hat\beta) = \hat\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$$` -- * `\(\hat\sigma^2 = \frac{RSS}{n-(p+1)}\)` -- `$$var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$$` --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` `$$var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$$` .question[ What is the variance for `\(\hat\beta_0\)`? ] --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` `$$var(\hat\beta) = \begin{bmatrix} \color{blue}{\mathbf{0.91835}}&-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$$` .question[ What is the variance for `\(\hat\beta_0\)`? ] --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` `$$var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$$` .question[ What is the variance for `\(\hat\beta_1\)`? ] --- ## Estimating `\(\hat\beta\)` `$$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$$` `$$var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \color{blue}{\mathbf{4081.571}} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$$` .question[ What is the variance for `\(\hat\beta_1\)`? 😱 ] --- ## What's the problem? * Sometimes we can't solve for `\(\hat\beta\)` .question[ Why? ] --- ## What's the problem? * Sometimes we can't solve for `\(\hat\beta\)` * `\(\mathbf{X}^T\mathbf{X}\)` is not invertible -- * We have more variables than observations ( `\(p > n\)` ) * The variables are linear combinations of one another -- * Even when we **can** invert `\(\mathbf{X}^T\mathbf{X}\)`, things can go wrong -- * The variance can blow up, like we just saw! --- class: center, middle ## What can we do about this? --- ## Ridge Regression * What if we add an additional _penalty_ to keep the `\(\hat\beta\)` coefficients small (this will keep the variance from blowing up!) -- * Instead of minimizing `\(RSS\)`, like we do with linear regresion, let's minimize `\(RSS\)` PLUS some penalty function -- `$$RSS + \underbrace{\lambda\sum_{j=1}^p\beta^2_j}_{\textrm{shrinkage penalty}}$$` -- .question[ What happens when `\(\lambda=0\)`? What happens as `\(\lambda\rightarrow\infty\)`? ] --- ## Ridge Regression .question[ Let's solve for the `\(\hat\beta\)` coefficients using Ridge Regression. What are we minimizing? ] -- `$$(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)+\lambda\beta^T\beta$$` --- class: inverse ## <i class="fas fa-edit"></i> `Try it!` * Find `\(\hat\beta\)` that minimizes this: `$$(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)+\lambda\beta^T\beta$$`
02
:
00
--- ## Ridge Regression `$$\hat\beta_{ridge} = (\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$` -- * Not only does this help with the variance, it solves our problem when `\(\mathbf{X}^{T}\mathbf{X}\)` isn't invertible! --- ## Choosing `\(\lambda\)` * `\(\lambda\)` is known as a **tuning parameter** and is selected using **cross validation** * For example, choose the `\(\lambda\)` that results in the smallest estimated test error --- ## Bias-variance tradeoff .question[ How do you think ridge regression fits into the bias-variance tradeoff? ] -- * As `\(\lambda\)` ☝️, bias ☝️, variance 👇 -- * Bias( `\(\hat\beta_{ridge}\)` ) = `\(-\lambda(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\beta\)` -- .question[ What would this be if `\(\lambda\)` was 0? ] -- * Var( `\(\hat\beta_{ridge}\)` ) = `\(\sigma^2(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\)` -- .question[ Is this bigger or smaller than `\(\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\)`? What is this when `\(\lambda = 0\)`? As `\(\lambda\rightarrow\infty\)` does this go up or down? ] --- ## Ridge Regression * **IMPORTANT**: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation) -- .question[ Why? ]