class: center, middle, inverse, title-slide # Linear Regression in R ### Dr. D’Agostino McGowan --- layout: true <div class="my-footer"> <span> Dr. Lucy D'Agostino McGowan </span> </div> --- ## Let's look at an example Let's look at a sample of 116 sparrows from Kent Island. We are interested in the relationship between `Weight` and `Wing Length` ![](08-linear-regression_files/figure-html/unnamed-chunk-3-1.png)<!-- --> * the **standard error** of `\(\hat{\beta_1}\)` ( `\(SE_{\hat{\beta}_1}\)` ) is how much we expect the sample slope to vary from one random sample to another. --- ## Sparrows .question[ How can we quantify how much we'd expect the slope to differ from one random sample to another? ] .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% * tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 *## 2 WingLength 0.467 0.0347 13.5 2.62e-25 ``` ] --- ## Sparrows .question[ How do we interpret this? ] .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 *## 2 WingLength 0.467 0.0347 13.5 2.62e-25 ``` ] --- ## Sparrows .question[ How do we interpret this? ] .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 *## 2 WingLength 0.467 0.0347 13.5 2.62e-25 ``` ] * "the sample slope is more than 13 standard errors above a slope of zero" --- ## Sparrows .question[ How do we know what values of this statistic are worth paying attention to? ] .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 *## 2 WingLength 0.467 0.0347 13.5 2.62e-25 ``` ] --- ## Sparrows .question[ How do we know what values of this statistic are worth paying attention to? ] .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 *## 2 WingLength 0.467 0.0347 13.5 2.62e-25 ``` ] * confidence intervals * p-values --- ## Sparrows .question[ How do we know what values of this statistic are worth paying attention to? ] .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% * tidy(conf.int = TRUE) ``` ``` ## # A tibble: 2 x 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 -0.531 3.26 *## 2 WingLength 0.467 0.0347 13.5 2.62e-25 0.399 0.536 ``` ] * confidence intervals * p-values --- class: inverse ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Application Exercise` 1. Fit a linear model using the `mtcars` data frame predicting miles per gallon (`mpg`) from (`wt`) 2. Pull out the coefficients and confidence intervals using the `tidy()` function demonstrated. How do you interpret these?
04
:
00
--- ## Sparrows .question[ How are these statistics distributed under the null hypothesis? ] .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 *## 2 WingLength 0.467 0.0347 13.5 2.62e-25 ``` ] --- ## Sparrows ![](08-linear-regression_files/figure-html/unnamed-chunk-14-1.png)<!-- --> * I've generated some data under a null hypothesis where `\(n = 20\)` --- ## Sparrows ![](08-linear-regression_files/figure-html/unnamed-chunk-15-1.png)<!-- --> * this is a **t-distribution** with **n-p-1** degrees of freedom. --- ## Sparrows The distribution of test statistics we would expect given the **null hypothesis is true**, `\(\beta_1 = 0\)`, is **t-distribution** with **n-2** degrees of freedom. ![](08-linear-regression_files/figure-html/unnamed-chunk-16-1.png)<!-- --> --- ## Sparrows ![](08-linear-regression_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- ## Sparrows .question[ How can we compare this line to the distribution under the null? ] ![](08-linear-regression_files/figure-html/unnamed-chunk-19-1.png)<!-- --> -- * p-value --- class: middle # p-value The probability of getting a statistic as extreme or more extreme than the observed test statistic **given the null hypothesis is true** --- ## Sparrows ![](08-linear-regression_files/figure-html/unnamed-chunk-20-1.png)<!-- --> .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 *## 2 WingLength 0.467 0.0347 13.5 2.62e-25 ``` ] --- ## Return to generated data, n = 20 ![](08-linear-regression_files/figure-html/unnamed-chunk-23-1.png)<!-- --> * Let's say we get a statistic of **1.5** in a sample --- ## Let's do it in R! The proportion of area less than 1.5 ![](08-linear-regression_files/figure-html/unnamed-chunk-24-1.png)<!-- --> ```r pt(1.5, df = 18) ``` ``` ## [1] 0.9245248 ``` --- ## Let's do it in R! The proportion of area **greater** than 1.5 ![](08-linear-regression_files/figure-html/unnamed-chunk-26-1.png)<!-- --> ```r pt(1.5, df = 18, lower.tail = FALSE) ``` ``` ## [1] 0.07547523 ``` --- ## Let's do it in R! The proportion of area **greater** than 1.5 **or** **less** than -1.5. ![](08-linear-regression_files/figure-html/unnamed-chunk-28-1.png)<!-- --> -- ```r pt(1.5, df = 18, lower.tail = FALSE) * 2 ``` ``` ## [1] 0.1509505 ``` --- class: middle # p-value The probability of getting a statistic as extreme or more extreme than the observed test statistic **given the null hypothesis is true** --- ## Hypothesis test * **null hypothesis** `\(H_0: \beta_1 = 0\)` * **alternative hypothesis** `\(H_A: \beta_1 \ne 0\)` -- * **p-value**: 0.15 -- * Often, we have an `\(\alpha\)`-level cutoff to compare this to, for example **0.05**. Since this is greater than **0.05**, we **fail to reject the null hypothesis** --- class: inverse ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Application Exercise` Using the linear model you fit previously (`mpg` from `wt` using the `mtcars` data) - calculate the p-value for the coefficient for weight? Interpret this value. What is the null hypothesis? What is the alternative hypothesis? Do you reject the null?
04
:
00
--- class: middle # confidence intervals If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter ( `\(\beta_1\)` ) to fall within the interval estimates 95% of the time. --- # Confidence interval .center[ `\(\Huge \hat\beta_1 \pm t^∗ \times SE_{\hat\beta_1}\)` ] --- # Confidence interval .center[ `\(\Huge \hat\beta_1 \pm t^∗ \times SE_{\hat\beta_1}\)` ] * `\(t^*\)` is the critical value for the `\(t_{n−p-1}\)` density curve to obtain the desired confidence level -- * Often we want a **95% confidence level**. --- ## Let's do it in R! .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy(conf.int = TRUE) ``` ``` ## # A tibble: 2 x 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 -0.531 3.26 *## 2 WingLength 0.467 0.0347 13.5 2.62e-25 0.399 0.536 ``` ] * `\(t^* = t_{n-p-1} = t_{114} = 1.98\)` -- * `\(LB = 0.47 - 1.98\times 0.0347 = 0.399\)` * `\(UB = 0.47+1.98 \times 0.0347 = 0.536\)` --- class: middle # confidence intervals If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter ( `\(\beta_1\)` ) to fall within the interval estimates 95% of the time. --- ## Linear Regression Questions * ✔️ Is there a relationship between a response variable and predictors? * ✔️ How strong is the relationship? * ✔️ What is the uncertainty? * How accurately can we predict a future outcome? --- ## Sparrows .question[ Using the information here, how could I predict a new sparrow's weight if I knew the wing length was 30? ] ```r linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 ## 2 WingLength 0.467 0.0347 13.5 2.62e-25 ``` -- * `\(1.37 + 0.467 \times 30 = 15.38\)` --- ## Linear Regression Accuracy .question[ What is the residual sum of squares again? ] * Note: In previous classes, this may have been referred to as SSE (sum of squares error), the book uses RSS, so we will stick with that! -- `$$RSS = \sum(y_i - \hat{y}_i)^2$$` -- * The **total sum of squares** represents the variability of the outcome, it is equivalent to the variability described by the **model** plus the remaining **residual sum of squares** `$$TSS = \sum(y_i - \bar{y})^2$$` --- ## Linear Regression Accuracy * There are many ways "model fit" can be assessed. Two common ones are: * Residual Standard Error (RSE) * `\(R^2\)` - the fraction of the variance explained -- * `\(RSE = \sqrt{\frac{1}{n-p-1}RSS}\)` -- * `\(R^2 = 1 - \frac{RSS}{TSS}\)` --- ## Linear Regression Accuracy .question[ What could we use to determine whether at least one predictor is useful? ] -- `$$F = \frac{(TSS - RSS)/p}{RSS/(n-p-1)}\sim F_{p,n-p-1}$$` We can use a F-statistic! --- ## Let's do it in R! .small[ ```r *lm_fit <- linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) *glance(lm_fit$fit) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.614 0.611 1.40 181. 2.62e-25 1 -203. 411. 419. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] --- class: inverse ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Application Exercise` Using the model previously fit (using the `mtcars` data predicting miles per gallon from weight), pull out the F-statistic and `\(R^2\)` using the `glance()` function. Interpret these values.
04
:
00
--- ## Additional Linear Regression Topics * Polynomial terms * Interactions * Outliers * Non-constant variance of error terms * High leverage points * Collinearity _Refer to Chapter 3 for more details on these topics if you need a refresher._