class: center, middle, inverse, title-slide # Review of multiple linear regression ### Prof. Maria Tackett ### 01.10.22 --- class: middle, center ##[Click for PDF of slides](02-mlr-review-pt1.pdf) --- ## Announcements - Labs start Thursday - [Install R and configure GitHub](https://github.com/sta310-sp22/computing/blob/main/README.md) - Office hours this week: - Thu 2 - 3pm & Fri 1 - 2pm online (links in Sakai) - Full office hours schedule starts Tue, Jan 19 - Fill out [All About You Survey](https://duke.qualtrics.com/jfe/form/SV_1X1ryORVK6JJwkm) --- class: middle ## Questions from last time? --- class: middle, inverse ## Linear least squares regression (LLSR) vs. ## Generalized linear models (GLM) vs. ## Multilevel models --- ## Assumptions for linear regression -- **L**inearity: Linear relationship between mean response and predictor variable(s) -- **I**ndependence: Residuals are independent. There is no connection between how far any two points lie above or below regression line. -- **N**ormality: Response follows a normal distribution at each level of the predictor (or combination of predictors) -- **E**qual variance: Variability (variance or standard deviation) of the response is equal for all levels of the predictor (or combination of predictors) **Use residual plots to check that the conditions hold before using the model for statistical inference.** --- ## Assumptions for linear regression .pull-left[ <img src="02-mlr-review-pt1_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" /> .small[Modified from Figure 1.1. in BMLR] ] .pull-right[ .midi[**L**inearity: Linear relationship between mean of the response `\(Y\)` and the predictor `\(X\)`] .midi[**I**ndependence: No connection between how any two points lie above or below the regression line] .midi[**N**ormality: Response, `\(Y\)`, follows a normal distribution at each level of the predictor, `\(X\)` (indicated by red curves)] .midi[**E**qual variance: Variance (or standard deviation) of the response, `\(Y\)`, is equal for all levels of the predictor, `\(X\)`] ] ] --- ## Are the assumptions violated? [Click here](https://forms.gle/N3PftSk2tEJRTxCE6) for poll.
04
:
00
--- ## Beyond linear regression - When we use linear least squares regression to draw conclusions, we do so under the assumption that L.I.N.E. are all met. - **Generalized linear models** require different assumptions and can accommodate violations in L.I.N.E. - Relationship between response and predictor(s) can be nonlinear - Response variable can be non-normal - Variance in response can differ at each level of predictor(s) **But the independence assumption must hold!** - **Multilevel models** will be used for data with correlated observations --- class: middle, inverse ## Review of multiple linear regression --- ## Data: Kentucky Derby Winners .midi[ Today's data is from the Kentucky Derby, an annual 1.25-mile horse race held at the Churchill Downs race track in Louisville, KY. The data is in the file [derbyplus.csv](data/derbyplus.csv) and contains information for races 1896 - 2017. ] .pull-left[ .midi[**Response variable**] - .midi[`speed`: Average speed of the winner in feet per second (ft/s)] .midi[**Additional variable**] .midi[- `winner`: Winning horse] ] .pull-right[ .midi[**Predictor variables**] - .midi[`year`: Year of the race] - .midi[`condition`: Condition of the track (good, fast, slow)] - .midi[`starters`: Number of horses who raced] ] --- ## Data ```r derby <- read_csv("data/derbyplus.csv") ``` ```r derby %>% head(5) %>% kable() ``` | year|winner |condition | speed| starters| |----:|:-------------|:---------|-----:|--------:| | 1896|Ben Brush |good | 51.66| 8| | 1897|Typhoon II |slow | 49.81| 6| | 1898|Plaudit |good | 51.16| 4| | 1899|Manuel |fast | 50.00| 5| | 1900|Lieut. Gibson |fast | 52.28| 7| --- ![](img/data-analysis-life-cycle.png) --- ## Exploratory data analysis (EDA) - Once you're ready for the statistical analysis (explore), the first step should always be **exploratory data analysis**. - The EDA will help you - begin to understand the variables and observations - identify outliers or potential data entry errors - begin to see relationships between variables - identify the appropriate model and identify a strategy - The EDA is exploratory; formal modeling and statistical inference should be used to draw conclusions. --- ## Plots for univariate EDA .panelset.sideways[ .panel[.panel-name[Plot] <img src="02-mlr-review-pt1_files/figure-html/univar-eda-plot-1.png" width="90%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] .small[ ```r p1 <- ggplot(data = derby, aes(x = speed)) + geom_histogram(fill = colors$green, color = "black") + labs(x = "Winning speed (ft/s)", y = "Count") p2 <- ggplot(data = derby, aes(x = starters)) + geom_histogram(fill = colors$green, color = "black") + labs(x = "Starters", y = "Count") p3 <- ggplot(data = derby, aes(x = condition)) + geom_bar(fill = colors$green, color = "black", aes(x = )) p1 + (p2 / p3) + plot_annotation(title = "Univariate data analysis") ``` ] ] ] --- ## Plots for bivariate EDA .panelset.sideways[ .panel[.panel-name[Plot] <img src="02-mlr-review-pt1_files/figure-html/bivar-eda-plot-1.png" width="90%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] .small[ ```r p4 <- ggplot(data = derby, aes(x = starters, y = speed)) + geom_point() + labs(x = "Starters", y = "Speed (ft / s)") p5 <- ggplot(data = derby, aes(x = year, y = speed)) + geom_point() + labs(x = "Year", y = "Speed (ft / s)") p6 <- ggplot(data = derby, aes(x = condition, y = speed)) + geom_boxplot(fill = colors$green, color = "black") + labs(x = "Conditions", y = "Speed (ft / s)") (p4 + p5) + p6 + plot_annotation(title = "Bivariate data analysis") ``` ] ] ] --- ## Scatterplot matrix .midi[ A **scatterplot matrix** helps quickly visualize relationships between many variable pairs. They are particularly useful to identify potentially correlated predictors. ] .panelset.sideways[ .panel[.panel-name[Plot] <img src="02-mlr-review-pt1_files/figure-html/scatterplot-matrix-plot-1.png" width="90%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] .small[ ```r #library(GGally) ggpairs(data = derby, columns = c("condition", "year", "starters", "speed")) ``` ] ] ] --- ## Plots for multivariate EDA Plot the relationship between the response and a predictor based on levels of another predictor to assess potential interactions. .panelset.sideways[ .panel[.panel-name[Plot] <img src="02-mlr-review-pt1_files/figure-html/multivar-eda-plot-1.png" width="90%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] .small[ ```r #library(viridis) ggplot(data = derby, aes(x = year, y = speed, color = condition, shape = condition, linetype = condition)) + geom_point() + geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) + labs(x = "Year", y = "Speed (ft/s)", color = "Condition", title = "Speed vs. year", subtitle = "by track condition") + guides(lty = FALSE, shape = FALSE) + scale_color_viridis_d(end = 0.9) ``` ] ] ] --- ## Model 1: Main effects model .panelset.sideways[ .panel[.panel-name[Output] |term | estimate| std.error| statistic| p.value| |:-------------|--------:|---------:|---------:|-------:| |(Intercept) | 8.197| 4.508| 1.818| 0.072| |starters | -0.005| 0.017| -0.299| 0.766| |year | 0.023| 0.002| 9.766| 0.000| |conditiongood | -0.443| 0.231| -1.921| 0.057| |conditionslow | -1.543| 0.161| -9.616| 0.000| ] .panel[.panel-name[Code] .small[ ```r # Fit and display model model1 <- lm(speed ~ starters + year + condition, data = derby) tidy(model1) %>% kable(digits = 3) ``` ] ] ] --- ## Interpretation `$$\widehat{speed} = 8.197 - 0.005 ~ starters + 0.023 ~ year - 0.443 ~ good - 1.543 ~ slow$$` |term | estimate| std.error| statistic| p.value| |:-------------|--------:|---------:|---------:|-------:| |(Intercept) | 8.197| 4.508| 1.818| 0.072| |starters | -0.005| 0.017| -0.299| 0.766| |year | 0.023| 0.002| 9.766| 0.000| |conditiongood | -0.443| 0.231| -1.921| 0.057| |conditionslow | -1.543| 0.161| -9.616| 0.000| -- .question[ 1. Write out the interpretations for `starters` and `conditiongood`. 2. Does the intercept have a meaningful interpretation? ] --- ## Centering **Centering**: Subtract a constant from each observation of a given variable - Do this to make interpretation of model parameters more meaningful (particularly intercept) - In STA 210, we used **mean-centering** where we subtracted the mean from each observation of given variable - How does centering change the model? --- ## Centering `year` .midi[ ```r derby <- derby %>% mutate(yearnew = year - 1896) #1896 = starting year ``` ] |term | estimate| std.error| statistic| p.value| |:-------------|--------:|---------:|---------:|-------:| |(Intercept) | 52.175| 0.194| 269.079| 0.000| |starters | -0.005| 0.017| -0.299| 0.766| |yearnew | 0.023| 0.002| 9.766| 0.000| |conditiongood | -0.443| 0.231| -1.921| 0.057| |conditionslow | -1.543| 0.161| -9.616| 0.000| `$$\widehat{speed} = 52.175 - 0.005 ~ starters + 0.023 ~ yearnew - 0.443 ~ good - 1.543 ~ slow$$` --- ## Model 1: Check model assumptions .panelset.sideways[ .panel[.panel-name[Plots] ```r #library(ggfortify) autoplot(model1Cent) ``` <img src="02-mlr-review-pt1_files/figure-html/unnamed-chunk-10-1.png" width="80%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Poll] [Click here](https://forms.gle/iHfWR1igqpsM7Gr48) for poll. ] ] --- class: middle, inverse ## Model 2 --- ## Add quadratic effect for year? .panelset.sideways[ .panel[.panel-name[Plot] <img src="02-mlr-review-pt1_files/figure-html/year-quad-plot-1.png" width="90%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] .small[ ```r ggplot(data = derby, aes(x = yearnew, y = speed)) + geom_point(alpha = 0.7) + geom_smooth(method = "lm", se = FALSE, color = "blue") + geom_smooth(se = FALSE, color = "red", linetype = 2) + labs(x = "Years since 1896", y = "Speed (ft/s)", title = "Speed vs. Years since 1896") ``` ] ] ] --- ## Model 2: Add `\(yearnew^2\)` .panelset.sideways[ .panel[.panel-name[Output] |term | estimate| std.error| statistic| p.value| |:-------------|--------:|---------:|---------:|-------:| |(Intercept) | 51.4130| 0.1826| 281.5645| 0.0000| |starters | -0.0253| 0.0136| -1.8588| 0.0656| |yearnew | 0.0700| 0.0061| 11.4239| 0.0000| |I(yearnew^2) | -0.0004| 0.0000| -8.0411| 0.0000| |conditiongood | -0.4770| 0.1857| -2.5689| 0.0115| |conditionslow | -1.3927| 0.1305| -10.6701| 0.0000| ] .panel[.panel-name[Code] .small[ ```r model2 <- lm(speed ~ starters + yearnew + I(yearnew^2) + condition, data = derby) tidy(model2) %>% kable(digits = 4) ``` ] ] ] --- ## Interpreting quadratic effects `$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 ~ x_1 + \hat{\beta}_2 ~ x_2 + \hat{\beta}_3 ~ x_2^2$$` **General interpretation**: When `\(x_2\)` increases from a to b, `\(y\)` is expected to change by `\(\hat{\beta}_2(b - a) + \hat{\beta}_3(b^2 - a^2)\)`, holding `\(x_1\)` constant. <br> -- .question[ `$$\begin{aligned}\widehat{speed} = &51.413 - 0.025 ~ starters + 0.070 ~ yearnew \\ & - 0.0004 ~ yearnew^2 - 0.477 ~ good - 1.393 ~ slow\end{aligned}$$` Interpret the effect of year for the 5 most recent years (2013 - 2017). ] <!-- Ended here in lecture--> --- ## Model 2: Check model assumptions <img src="02-mlr-review-pt1_files/figure-html/unnamed-chunk-12-1.png" width="70%" style="display: block; margin: auto;" /> --- class: middle, inverse ## Model 3 --- ## Include interaction term? Recall from the EDA... <img src="02-mlr-review-pt1_files/figure-html/unnamed-chunk-13-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Model 3: Add interaction term `$$\begin{aligned}\widehat{speed} = & 52.387 - 0.003 ~ starters + 0.020 ~ yearnew - 1.070 ~ good - 2.183 ~ slow \\ &+0.012 ~ yearnew \times good + 0.012 ~ yearnew \times slow \end{aligned}$$` .panelset.sideways[ .panel[.panel-name[Output] |term | estimate| std.error| statistic| p.value| |:---------------------|--------:|---------:|---------:|-------:| |(Intercept) | 52.387| 0.200| 262.350| 0.000| |starters | -0.003| 0.016| -0.189| 0.850| |yearnew | 0.020| 0.003| 7.576| 0.000| |conditiongood | -1.070| 0.423| -2.527| 0.013| |conditionslow | -2.183| 0.270| -8.097| 0.000| |yearnew:conditiongood | 0.012| 0.008| 1.598| 0.113| |yearnew:conditionslow | 0.012| 0.004| 2.866| 0.005| ] .panel[.panel-name[Code] ```r model3 <- lm(speed ~ starters + yearnew + condition + yearnew * condition, data = derby) tidy(model3) %>% kable(digits = 4) ``` ] .panel[.panel-name[Assumptions] <img src="02-mlr-review-pt1_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" /> ] ] --- ## Interpreting interaction effects |term | estimate| std.error| statistic| p.value| |:---------------------|--------:|---------:|---------:|-------:| |(Intercept) | 52.387| 0.200| 262.350| 0.000| |starters | -0.003| 0.016| -0.189| 0.850| |yearnew | 0.020| 0.003| 7.576| 0.000| |conditiongood | -1.070| 0.423| -2.527| 0.013| |conditionslow | -2.183| 0.270| -8.097| 0.000| |yearnew:conditiongood | 0.012| 0.008| 1.598| 0.113| |yearnew:conditionslow | 0.012| 0.004| 2.866| 0.005| [Click here](https://forms.gle/BrufiFdhQAi4WBWeA) for poll
04
:
00
--- ## Which model would you choose? .panelset.sideways[ .panel[.panel-name[Output] .midi[**Model 1: Main effects**] | r.squared| adj.r.squared| AIC| BIC| |---------:|-------------:|-------:|-------:| | 0.73| 0.721| 259.478| 276.302| .midi[**Model 2: Main effects + `\(year^2\)`**] | r.squared| adj.r.squared| AIC| BIC| |---------:|-------------:|-------:|-------:| | 0.827| 0.819| 207.429| 227.057| .midi[**Model 3: Main effects + interaction]** | r.squared| adj.r.squared| AIC| BIC| |---------:|-------------:|-------:|-------:| | 0.751| 0.738| 253.584| 276.016| ] .panel[.panel-name[Code] ```r # Model 1 glance(model1Cent) %>% select(r.squared, adj.r.squared, AIC, BIC) %>% kable(digits = 3) # Model2 glance(model2) %>% select(r.squared, adj.r.squared, AIC, BIC) %>% kable(digits = 3) # Model 3 glance(model3) %>% select(r.squared, adj.r.squared, AIC, BIC) %>% kable(digits = 3) ``` ] ] --- ## Characteristics of a "good" final model - Model can be used to answer primary research questions - Predictor variables control for important covariates - Potential interactions have been investigated - Variables are centered, as needed, for more meaningful interpretations - unnecessary terms are removed - Assumptions are met and influential points have been addressed - model tells a "persuasive story parsimoniously" <br> .small[List from Section 1.6.7 of [BMLR](https://bookdown.org/roback/bookdown-BeyondMLR/)] --- ## Acknowledgements These slides are based on content in [BMLR: Chapter 1 - Review of Multiple Linear Regression](https://bookdown.org/roback/bookdown-BeyondMLR/ch-MLRreview.html)