Logistic regression

# Logistic regression
### 02.17.22

---

## [Click here for PDF of slides](12-logistic.pdf)

---

## Announcements

- Fill out [mini-project 01 team evaluation](https://forms.office.com/r/MuLAg4y7dY) by **TODAY at 11:59pm**

- Quiz 02 on Gradescope **due Fri, Feb 18 at 11:59pm**

---

## Learning goals

- Identify Bernoulli and binomial random variables

- Write GLM for binomial response variable

- Interpret the coefficients for a logistic regression model

- Visualizations for logistic regression

---

## Basics of logistic regression

---

## Bernoulli + Binomial random variables

Logistic regression is used to analyze data with two types of responses:

- **Binary**: These responses take on two values success `$(Y = 1)$` or failure `$(Y = 0)$`, yes `$(Y = 1)$` or no `$(Y = 0)$`, etc.

`$$P(Y = y) = p^y(1-p)^{1-y} \hspace{10mm} y = 0, 1$$`

- **Binomial**: Number of successes in a Bernoulli process, `$n$` independent trials with a constant probability of success `$p$`.

`$$P(Y = y) = {n \choose y}p^{y}(1-p)^{n - y} \hspace{10mm} y = 0, 1, \ldots, n$$`

In both instances, the goal is to model `$p$` the probability of success.

---

## Binary vs. Binomial data

1. .midi[Use median age and unemployment rate in a county to predict the percent of Obama votes in the county in the 2008 presidential election.]

2. .midi[Use GPA and MCAT scores to estimate the probability a student is accepted into medical school.]

3. .midi[Use sex, age, and smoking history to estimate the probability an individual has lung cancer.]

4. .midi[Use offensive and defensive statistics from the 2017-2018 NBA season to predict a team's winning percentage.]

[Click here](https://forms.gle/URGd5ycAxwXhJKQZ9) to submit your responses.

]

---

## Logistic regression model

- The response variable, `$\log\Big(\frac{p}{1-p}\Big)$`, is the log(odds) of success, i.e. the logit

- Use the model to calculate the probability of success `$$\hat{p} = \frac{e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_px_p}}{1 + e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_px_p}}$$`

- When the response is a Bernoulli random variable, the probabilities can be used to classify each observation as a success or failure

---

## Logistic vs linear regression model

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#12-logistic_files/figure-html/OLSlogistic-1.png" alt="Graph from BMLR Chapter 6" width="60%" />
<p class="caption">Graph from BMLR Chapter 6</p>
</div>

---

## Logit link

Bernoulli and Binomial random variables can be written in one-parameter exponential family form, `$f(y;\theta) = e^{[a(y)b(\theta) + c(\theta) + d(y)]}$`

**Bernoulli**

`$$f(y;p) = e^{y\log(\frac{p}{1-p}) + \log(1-p)}$$`

**Binomial**

`$$f(y;n,p) = e^{y\log(\frac{p}{1-p}) + n\log(1-p) + \log{n \choose y}}$$`

They have the same canonical link `$b(p) = \log\big(\frac{p}{1-p}\big)$`

---

## Assumptions for logistic regression

The following assumptions need to be satisfied to use logistic regression to make inferences

1️⃣ `$\hspace{0.5mm}$` **Binary response**: The response is dichotomous (has two possible outcomes) or is the sum of dichotomous responses

2️⃣ `$\hspace{0.5mm}$` **Independence**: The observations must be independent of one another

3️⃣ `$\hspace{0.5mm}$` **Variance structure**: Variance of a binomial random variable is `$np(1-p)$` `$(n = 1 \text{ for Bernoulli})$` , so the variability  is highest when `$p = 0.5$`

4️⃣ `$\hspace{0.5mm}$`  **Linearity**: The log of the odds ratio, `$\log\big(\frac{p}{1-p}\big)$`, must be a linear function of the predictors `$x_1, \ldots, x_p$`

---

## COVID-19 infection prevention practices at food establishments

Researchers at Wollo Univeristy in Ethiopia conducted a study in July and August 2020 to understand factors associated with good COVID-19 infection prevention practices at food establishments. Their study is published in [Andualem et al. (2022)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0259851#ack)

They were particularly interested in the understanding implementation of prevention practices at food establishments, given the workers' increased risk due to daily contact with customers.

.footnote[.small[Andualem, A., Tegegne, B., Ademe, S., Natnael, T., Berihun, G., Abebe, M., ... & Adane, M. (2022). COVID-19 infection prevention practices among a sample of food handlers of food and drink establishments in Ethiopia. PloS one, 17(1), e0259851.]]

---

## The data

*"An institution-based cross-sectional study was conducted among <b>422 food handlers in Dessie City and Kombolcha Town food and drink establishments in July and August 2020.</b> The study participants were selected using a <b>simple random sampling</b> technique. Data were collected by trained data collectors using a pretested structured questionnaire and an on-the-spot observational checklist."*

<br>

---

## Response variable

*"The outcome variable of this study was the <b>good or poor practices of COVID-19 infection prevention among food handlers</b>. Nine yes/no questions, one observational checklist and five multiple choice infection prevention practices questions were asked with a minimum score of 1 and maximum score of 25. Good infection prevention practice (the variable of interest) was determined for food handlers who scored 75% or above, whereas poor infection prevention practices refers to those food handlers who scored below 75% on the practice questions."*

<br>

---

## Results

![](data:image/png;base64,#img/logistic-output.PNG)

---

## Interpreting the results

- Is the response a Bernoulli or Binomial?

- What is the strongest predictor of having good COVID-19 infection prevention practices?
  - It's often unreliable to look answer this question just based on the model output. Why are we able to answer this question based on the model output in this case? 
  
- Describe the effect (coefficient interpretation and inference) of having COVID-19 infection prevention policies available at the food establishment.

- The intercept describes what group of food handlers?

---

## Visualizations for logistic regression

---

## Access to personal protective equipment

We will use the data from [Andualem et al. (2022)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0259851#ack) to explore the association between age, sex, years of service, and whether someone works at a food establishment with access to personal protective equipment (PPE) as of August 2020. We will use access to PPE as a proxy for wearing PPE.

| age|sex    | years| ppe_access|
|---:|:------|-----:|----------:|
|  34|Male   |     2|          1|
|  32|Female |     3|          1|
|  32|Female |     1|          1|
|  40|Male   |     4|          1|
|  32|Male   |    10|          1|

---

## Exploratory data analysis for binary response

```r
library(Stat2Data)
par(mfrow = c(1, 2))
emplogitplot1(ppe_access ~ age, data = covid_df, ngroups = 10)
emplogitplot1(ppe_access ~ years, data = covid_df, ngroups = 5)
```

---

## Exploratory data analysis for binary response

```r
library(viridis)
ggplot(data = covid_df, aes(x = sex, fill = factor(ppe_access))) + 
  geom_bar(position = "fill")  +
  labs(x = "Sex", 
       fill = "PPE Access", 
       title = "PPE Access by Sex") + 
  scale_fill_viridis_d()
```

<img src="data:image/png;base64,#12-logistic_files/figure-html/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" />
]

---

## Model results

```r
ppe_model <- glm(factor(ppe_access) ~ age + sex + years, data = covid_df, 
                 family = binomial)
tidy(ppe_model, conf.int = TRUE) %>%
  kable(digits = 3)
```

|term        | estimate| std.error| statistic| p.value| conf.low| conf.high|
|:-----------|--------:|---------:|---------:|-------:|--------:|---------:|
|(Intercept) |   -2.127|     0.458|    -4.641|   0.000|   -3.058|    -1.257|
|age         |    0.056|     0.017|     3.210|   0.001|    0.023|     0.091|
|sexMale     |    0.341|     0.224|     1.524|   0.128|   -0.098|     0.780|
|years       |    0.264|     0.066|     4.010|   0.000|    0.143|     0.401|

---

## Visualizing coefficient estimates

```r
model_coef <- tidy(ppe_model, exponentiate = TRUE, conf.int = TRUE)
```

```r
ggplot(data = model_coef, aes(x = term, y = estimate)) +
  geom_point() +
  geom_hline(yintercept = 1, lty = 2) + 
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high))+
  labs(title = "Exponentiated model coefficients") + 
  coord_flip()
```

<img src="data:image/png;base64,#12-logistic_files/figure-html/unnamed-chunk-9-1.png" width="55%" style="display: block; margin: auto;" />
]

---

## Acknowledgements

These slides are based on content in

- [BMLR: Chapter 6 - Logistic Regression](https://bookdown.org/roback/bookdown-BeyondMLR/ch-glms.html)