class: center, middle, inverse, title-slide # Multilevel Generalized Linear Models ### 04.11.22 --- class: middle, center ## [Click here for PDF of slides](25-multilevel-glms.pdf) --- ## Announcements - Quiz 04 open **Tue, Apr 12 - Thu, Apr 14** - Final project - optional draft due - **Fri, Apr 15** - final report due **Wed, Apr 27** - Please fill out course evaluations! - [Click here](https://sta310-sp22.github.io/slides/supplemental-notes/multilevel-model-questions.html) for answers to questions about multilevel models submitted on Quiz 03. Thanks to Jose for putting this document together! --- ## Learning goals - Exploratory data analysis for multilevel data with non-normal response variable - Use a two-stage modeling approach to understand conclusions at each level - Write Level One, Level Two and composite models for mutlilevel GLM - Fit and interpret multilevel GLM --- ## Data: College Basketball referees The dataset [`basketball0910.csv`](data/basketball0910.csv) contains data on 4972 fouls in 340 NCAA basketball games from the Big Ten, ACC, and Big East conferences during the 2009-2010 season. The goal is to determine whether the data from this season support a conclusion from [Anderson and Pierce (2009)](https://www.tandfonline.com/doi/abs/10.1080/02640410902729733) that referees tend to "even out" foul calls in a game. The variables we'll focus on are - **`foul.home`**: foul was called on home team (1: yes, 0: no) - **`foul.diff`**: difference in fouls before current foul was called (home - visitor) - **`game`**: Unique game ID number - **`visitor`**: visiting team abbreviation - **`home`**: home team abbreviation See [BMLR: Section 11.3.1](https://bookdown.org/roback/bookdown-BeyondMLR/ch-GLMM.html#explore-glmm) for full codebook. --- ## Data: College basketball referees <table> <thead> <tr> <th style="text-align:right;"> game </th> <th style="text-align:left;"> visitor </th> <th style="text-align:left;"> hometeam </th> <th style="text-align:right;"> foul.num </th> <th style="text-align:right;"> foul.home </th> <th style="text-align:right;"> foul.vis </th> <th style="text-align:right;"> foul.diff </th> <th style="text-align:left;"> foul.type </th> <th style="text-align:right;"> time </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> IA </td> <td style="text-align:left;"> MN </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> Personal </td> <td style="text-align:right;"> 14.167 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> IA </td> <td style="text-align:left;"> MN </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:left;"> Personal </td> <td style="text-align:right;"> 11.433 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> IA </td> <td style="text-align:left;"> MN </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> Personal </td> <td style="text-align:right;"> 10.233 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> IA </td> <td style="text-align:left;"> MN </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Personal </td> <td style="text-align:right;"> 9.733 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> IA </td> <td style="text-align:left;"> MN </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> Shooting </td> <td style="text-align:right;"> 7.767 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> IA </td> <td style="text-align:left;"> MN </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:left;"> Shooting </td> <td style="text-align:right;"> 5.567 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> IA </td> <td style="text-align:left;"> MN </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:left;"> Shooting </td> <td style="text-align:right;"> 2.433 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> IA </td> <td style="text-align:left;"> MN </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:left;"> Offensive </td> <td style="text-align:right;"> 1.000 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> MI </td> <td style="text-align:left;"> MIST </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> Shooting </td> <td style="text-align:right;"> 18.983 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> MI </td> <td style="text-align:left;"> MIST </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:left;"> Personal </td> <td style="text-align:right;"> 17.200 </td> </tr> </tbody> </table> --- ## Modeling approach - `foul.home` is a binary response variable (1: foul called on home team, 0: foul called on visiting team) ✅ Use a generalized linear model, specifically one with the logit link function - Data has a multilevel structure - Covariates at individual foul level and game level ✅ Use a multilevel model We will combine these and fit a **multilevel generalized linear model** with a logit link function --- class: middle, inverse ## Exploratory data analysis --- ## Exploratory data analysis **Univariate** - Visualizations and summary statistics for Level One and Level Two variables **Bivariate** - Segmented bar plots or mosaic plots for response vs. categorical predictors - Conditional density plot for response vs. quantitative predictors - Empirical logit plot for quantitative predictors - Is relationship reasonably linear? --- class: middle ## Complete *Univariate EDA* in `lecture-25.Rmd`
05
:
00
--- class: middle ## Complete *Bivariate EDA* in `lecture-25.Rmd`
04
:
00
--- ## Logistic regression Start with a logistic regression model treating all observations as independent Quick analysis to get initial intuition about research question sand important variables, not be used for final conclusions <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.103 </td> <td style="text-align:right;"> 0.101 </td> <td style="text-align:right;"> -1.023 </td> <td style="text-align:right;"> 0.306 </td> </tr> <tr> <td style="text-align:left;"> foul.diff </td> <td style="text-align:right;"> -0.077 </td> <td style="text-align:right;"> 0.025 </td> <td style="text-align:right;"> -3.078 </td> <td style="text-align:right;"> 0.002 </td> </tr> <tr> <td style="text-align:left;"> score.diff </td> <td style="text-align:right;"> 0.020 </td> <td style="text-align:right;"> 0.007 </td> <td style="text-align:right;"> 3.012 </td> <td style="text-align:right;"> 0.003 </td> </tr> <tr> <td style="text-align:left;"> lead.home </td> <td style="text-align:right;"> -0.093 </td> <td style="text-align:right;"> 0.160 </td> <td style="text-align:right;"> -0.582 </td> <td style="text-align:right;"> 0.560 </td> </tr> <tr> <td style="text-align:left;"> time </td> <td style="text-align:right;"> -0.013 </td> <td style="text-align:right;"> 0.008 </td> <td style="text-align:right;"> -1.653 </td> <td style="text-align:right;"> 0.098 </td> </tr> <tr> <td style="text-align:left;"> foul.diff:time </td> <td style="text-align:right;"> -0.007 </td> <td style="text-align:right;"> 0.003 </td> <td style="text-align:right;"> -2.485 </td> <td style="text-align:right;"> 0.013 </td> </tr> <tr> <td style="text-align:left;"> lead.home:time </td> <td style="text-align:right;"> 0.022 </td> <td style="text-align:right;"> 0.011 </td> <td style="text-align:right;"> 1.957 </td> <td style="text-align:right;"> 0.050 </td> </tr> </tbody> </table> --- class: middle, inverse ## Two-stage modeling --- ## Two-stage modeling For now, let's fit a model focusing on the question: **Do the data provide evidence that referees tend to "even out" fouls?** To explore this, we will fit a model with the Level One predictor `foul.diff` using a two-stage modeling approach. -- **Two-stage modeling approach** - Fit a separate model of the association between `foul.diff` and `foul.home` for each game (Level One models) - Fit models to explain game-to-game variability in the estimated slopes and intercepts (Level Two models) --- ## Model set up - `\(Y_{ij}\)`: 1 if `\(j^{th}\)` foul in Game `\(i\)` was called on home team; 0 otherwise. - `\(p_{ij}\)`: True probability the `\(j^{th}\)` foul in Game `\(i\)` was called on home team `$$Y_{ij} \sim \text{Bernoulli}(p_{ij})$$` --- ## Model Set Up **Level One models** `$$\log\Big(\frac{p_{ij}}{1 - p_{ij}}\Big) = a_i + b_i ~ \text{foul.diff}_{ij}$$` **Level Two models** `$$a_i = \alpha_0 + u_i\hspace{20mm} b_i = \beta_0 + v_i$$` `$$\left[ \begin{array}{c} u_i \\ v_i \end{array} \right] \sim N \left( \left[ \begin{array}{c} 0 \\ 0 \end{array} \right], \left[ \begin{array}{cc} \sigma_u^{2} & \sigma_{uv} \\ \sigma_{uv} & \sigma_{v}^{2} \end{array} \right] \right)$$` --- ## Level One Models .pull-left[ <img src="data:image/png;base64,#25-multilevel-glms_files/figure-html/unnamed-chunk-7-1.png" width="70%" style="display: block; margin: auto;" /><table> <thead> <tr> <th style="text-align:right;"> alpha0 </th> <th style="text-align:right;"> sigma2_u </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -0.301 </td> <td style="text-align:right;"> 128.858 </td> </tr> </tbody> </table> ] .pull-right[ <img src="data:image/png;base64,#25-multilevel-glms_files/figure-html/unnamed-chunk-8-1.png" width="70%" style="display: block; margin: auto;" /><table> <thead> <tr> <th style="text-align:right;"> beta0 </th> <th style="text-align:right;"> sigma2_v </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.328 </td> <td style="text-align:right;"> 60.049 </td> </tr> </tbody> </table> ] --- ## Number of fouls per game <img src="data:image/png;base64,#25-multilevel-glms_files/figure-html/unnamed-chunk-9-1.png" width="70%" style="display: block; margin: auto;" /> -- ⚠️ In the two-stage approach, games with 3 fouls are treated with equal weight as games with 20 + fouls --- class: middle, inverse ## Unified multilevel model --- ## Composite model **Composite model** .eq[ `$$\log\Big(\frac{p_{ij}}{1 - p_{ij}}\Big) = \alpha_0 + \beta_0 ~ \text{foul.diff}_{ij} + [u_i + v_i ~ \text{foul.diff}_{ij}]$$` `$$\left[ \begin{array}{c} u_i \\ v_i \end{array} \right] \sim N \left( \left[ \begin{array}{c} 0 \\ 0 \end{array} \right], \left[ \begin{array}{cc} \sigma_u^{2} & \sigma_{uv} \\ \sigma_{uv} & \sigma_{v}^{2} \end{array} \right] \right)$$` ] --- ## Fit the model in R Use the `glmer` function in the **lme4** package to fit multilevel GLMs. ```r model1 <- glmer(foul.home ~ foul.diff + (foul.diff|game), data = basketball, family = binomial) ``` ``` ## boundary (singular) fit: see help('isSingular') ``` <table> <thead> <tr> <th style="text-align:left;"> effect </th> <th style="text-align:left;"> group </th> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> fixed </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.157 </td> <td style="text-align:right;"> 0.046 </td> <td style="text-align:right;"> -3.382 </td> <td style="text-align:right;"> 0.001 </td> </tr> <tr> <td style="text-align:left;"> fixed </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> foul.diff </td> <td style="text-align:right;"> -0.285 </td> <td style="text-align:right;"> 0.038 </td> <td style="text-align:right;"> -7.440 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> ran_pars </td> <td style="text-align:left;"> game </td> <td style="text-align:left;"> sd__(Intercept) </td> <td style="text-align:right;"> 0.542 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> ran_pars </td> <td style="text-align:left;"> game </td> <td style="text-align:left;"> cor__(Intercept).foul.diff </td> <td style="text-align:right;"> -1.000 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> ran_pars </td> <td style="text-align:left;"> game </td> <td style="text-align:left;"> sd__foul.diff </td> <td style="text-align:right;"> 0.035 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> --- ## Boundary constraints - The estimates of the parameters `\(\alpha_0, \beta_0, \sigma_u, \sigma_v, \rho_{uv}\)` are those that maximize the likelihood of observing the data - The fixed effects, e.g., `\(\alpha_0\)` and `\(\beta_0\)`, can take any values, but the terms associated with the error terms are constrained to a set of "allowable" values `$$\sigma_u \geq 0 \hspace{10mm} \sigma_v \geq 0 \hspace{10mm} -1 \leq \rho_{uv} \leq 1$$` -- - Because of these boundaries, a "constrained" search is used to find the MLEs. -- - The error message `"## boundary (singular) fit"`, means the estimate of one or more terms was set at the maximum (or minimum) allowable value, not the value it would've been if an unconstrained search were allowable --- ## Illustrating boundary constraints .pull-left[ Contour plots from a hypothetical likelihood `\(L(\beta_0, \sigma^2)\)` <br> <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#25-multilevel-glms_files/figure-html/boundary-1.png" alt="From BMLR Figure 10.14" width="100%" /> <p class="caption">From BMLR Figure 10.14</p> </div> ] -- .pull-right[ - In the unconstrained search, the likelihood `\(L(\beta_0, \sigma^2)\)` is maximized at `\(\hat{\beta}_0 = 4, \hat{\sigma}^2 = -2\)` - In reality `\(\sigma^2\)` must be non-negative, so the search for the MLE is restricted to the region such that `\(\sigma^2 \geq 0\)`. - The constrained likelihood is maximized at `\(\hat{\beta}_0 = 4, \hat{\sigma}^2 = 0\)` ] --- ## Addressing boundary constraints Address boundary constraints by reparameterizing the model - Remove variance and/or correlation terms estimated at the boundary - Reduce the number of parameters to be estimated by fixing the values of some parameters - Apply transformation to covariates, such as centering variables, standardizing variables, or changing units. Extreme values, outliers, or highly correlated covariates can sometimes cause issues with MLEs. --- ## Is it OK to use model that faces boundary constraints? Best to try to deal with boundary constraints, but you can sometimes leave the model as is if... - You're not interested in estimating parameters with boundary issues - Removing the parameters does not impact conclusions about the variables of interest --- **Original model** <table> <thead> <tr> <th style="text-align:left;"> effect </th> <th style="text-align:left;"> group </th> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> fixed </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.157 </td> <td style="text-align:right;"> 0.046 </td> <td style="text-align:right;"> -3.382 </td> <td style="text-align:right;"> 0.001 </td> </tr> <tr> <td style="text-align:left;"> fixed </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> foul.diff </td> <td style="text-align:right;"> -0.285 </td> <td style="text-align:right;"> 0.038 </td> <td style="text-align:right;"> -7.440 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> ran_pars </td> <td style="text-align:left;"> game </td> <td style="text-align:left;"> sd__(Intercept) </td> <td style="text-align:right;"> 0.542 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> ran_pars </td> <td style="text-align:left;"> game </td> <td style="text-align:left;"> cor__(Intercept).foul.diff </td> <td style="text-align:right;"> -1.000 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> ran_pars </td> <td style="text-align:left;"> game </td> <td style="text-align:left;"> sd__foul.diff </td> <td style="text-align:right;"> 0.035 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> .question[ 1. Which term(s) should we remove to try to address boundary constraint? 2. Refit the model with these terms removed. 3. Which model do you choose? Explain. ]
04
:
00
--- ## Interpret model coefficients Interpret the following coefficients in the selected model (if applicable): - `\(\hat{\alpha}_0\)` - `\(\hat{\beta}_0\)` - `\(\hat{\sigma}_u\)` - `\(\hat{\sigma}_v\)` - `\(\hat{\rho}_{uv}\)` --- ## Looking ahead So far we've only considered random effects within a nested structure, but sometimes we may want to consider random effects that aren't nested. - For example, we want to consider a random effect for the home team, but home team is not nested within game (since teams can be the home team for multiple games) - We will deal with this using **crossed random effects** --- ## Acknowledgements - [BMLR: Section 10.5 - Encountering boundary constraints](https://bookdown.org/roback/bookdown-BeyondMLR/ch-3level.html#sec:boundary) - [Chapter 11: Multilevel generalized linear models structure](https://bookdown.org/roback/bookdown-BeyondMLR/ch-GLMM.html) - Sections 11.1 - 11.4 - Anderson, Kyle J., and David A. Pierce. 2009. “Officiating Bias: The Effect of Foul Differential on Foul Calls in NCAA Basketball.” Journal of Sports Sciences 27 (7): 687–94. https://doi.org/10.1080/02640410902729733.