Using Likelihoods

# Using Likelihoods
### Prof. Maria Tackett
### 01.19.22

---

##[Click for PDF of slides](04-likelihoods.pdf)

---

## Announcements

- [Homework 01](https://sta310-sp22.github.io/assignments/hw/hw-01.html) due Wednesday at 11:59pm

- Week 03 reading: [BMLR: Chapter 3 - Distribution Theory](https://bookdown.org/roback/bookdown-BeyondMLR/ch-distthry.html)

- See [syllabus](https://sta310-sp22.netlify.app/syllabus/#teaching-team--office-hours) for office hours schedule 
  - Office hours online this week
  
- Team lab tomorrow - introducing Mini-Project 01
  - Find a paper using a GLM in their analysis
  - Evaluate the analysis in the paper 
  - "Replicate" the analysis the same or similar data 
  - Present results in a presentation and short write up
  - More details in lab!
---

## In-person learning 😷

Attendance in lectures and labs is expected as long as you're healthy and not in quarantine

**Lectures**

- If you're unable to attend, you can watch the recording of the lecture on Panopto  (link in Sakai) 
- Ask questions on GitHub Discussions or in office hours

**Labs**

- Labs are not recorded
- On weeks with teamwork: If you are unable to attend lab but are able to participate remotely, work with your teammates to set up a Zoom call

---

## Class Q&A Forum: GitHub Discussions

- Class Q&A forum on [GitHub Discussions](https://github.com/sta310-sp22/discussion/discussions)
  - Place for questions about course content, assignments, etc. 
  - Only use email for personal questions (e.g., grades, illness, etc.)
  - Let Prof. Tackett know if you do not have access to the forum

**Demo**

---

## Homework 01

- Notes on [variable transformations](https://sta210-fa21.netlify.app/slides/14-transformations.html#1)

- Exercise 5

> *The quality of the model selection process, including the exploratory data analysis. A high quality model selection process is accurate, comprehensive, and strategic (e.g., trying all possible interaction terms will not receive full credit).*

> *The quality of the summary. A high quality summary is accurate, comprehensive, answers the primary analysis question, and tells a cohesive story (e.g., a list of interpretations will not receive full credit).*
]
---

## Using Likelihoods

---

## Learning goals

- Describe the concept of a likelihood

- Construct the likelihood for a simple model

- Define the Maximum Likelihood Estimate (MLE) and use it to answer an analysis question

- Identify three ways to calculate or approximate the MLE and apply these methods to find the MLE for a simple model

- Use likelihoods to compare models (next week)

---

## What is the likelihood?

A **likelihood** is a function that tells us how likely we are to observe our data for a given parameter value (or values).

- Unlike Ordinary Least Squares (OLS), they do not require the responses be independent, identically distributed, and normal (iidN)

- They are <u>not</u> the same as probability functions

- **Probability function:** Fixed parameter value(s) + input possible outcomes `$\Rightarrow$` probability of seeing the different outcomes given the parameter value(s)
  
--
  
  - **Likelihood:** Fixed data + input possible parameter values `$\Rightarrow$` probability of seeing the fixed data for each parameter value

---

## Fouls in college basketball games

The data set [`04-refs.csv`](data/04-refs.csv) includes 30 randomly selected NCAA men's basketball games played in the 2009 - 2010 season.

We will focus on the variables `foul1`, `foul2`, and `foul3`, which indicate which team had a foul called them for the 1st, 2nd, and 3rd fouls, respectively. 
  - `H`: Foul was called on the home team 
  - `V`: Foul was called on the visiting team
  
--

We are focusing on the first three fouls for this analysis, but this could easily be extended to include all fouls in a game.

.footnote[.small[The dataset was derived from `basektball0910.csv` used in [BMLR Section 11.2](https://bookdown.org/roback/bookdown-BeyondMLR/ch-GLMM.html#cs:refs)]]

---

## Fouls in college basketball games

```r
refs <- read_csv("data/04-refs.csv")
refs %>% slice(1:5) %>% kable()
```

| game|     date|visitor |hometeam |foul1 |foul2 |foul3 |
|----:|--------:|:-------|:--------|:-----|:-----|:-----|
|  166| 20100126|CLEM    |BC       |V     |V     |V     |
|  224| 20100224|DEPAUL  |CIN      |H     |H     |V     |
|  317| 20100109|MARQET  |NOVA     |H     |H     |H     |
|  214| 20100228|MARQET  |SETON    |V     |V     |H     |
|  278| 20100128|SETON   |SFL      |H     |V     |V     |

We will treat the games as independent in this analysis.

---

## Different likelihood models

**Model 1 (Unconditional Model)**: What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?

**Model 2 (Conditional Model)**: 
  - Is there a tendency for the referees to call more fouls on the visiting team or home team? 
  - Is there a tendency for referees to call a foul on the team that already has more fouls? 
 
--

Ultimately we want to decide which model is better.

---

## Exploratory data analysis

```r
refs %>%
count(foul1, foul2, foul3) %>% kable()
```

|foul1 |foul2 |foul3 |  n|
|:-----|:-----|:-----|--:|
|H     |H     |H     |  3|
|H     |H     |V     |  2|
|H     |V     |H     |  3|
|H     |V     |V     |  7|
|V     |H     |H     |  7|
|V     |H     |V     |  1|
|V     |V     |H     |  5|
|V     |V     |V     |  2|
]
]

---

## Model 1: Unconditional model

### What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?

---
## Likelihood

Let `$p_H$` be the probability the referees call a foul on the home team.

**The likelihood for a single observation**

`$$Lik(p_H) = p_H^{y_i}(1 - p_H)^{n_i - y_i}$$`

Where `$y_i$` is the number of fouls called on the home team.

(In this example, we know `$n_i = 3$` for all observations.)

**Example**

For a single game where the first three fouls are `$H, H, V$`, then

`$$Lik(p_H) = p_H^{2}(1 - p_H)^{3 - 2} = p_H^{2}(1 - p_H)$$`

---

## Model 1: Likelihood contribution

.midi[
| Foul1 | Foul2 | Foul3 | n | Likelihood Contribution |
|-------|-------|-------|---|:-----------------------:|
| H     | H     | H     | 3 |       `$p_H^3$`               |
| H     | H     | V     | 2 |        `$p_H^2(1 - p_H)$`                 |
| H     | V     | H     | 3 |        `$p_H^2(1 - p_H)$`                 |
| H     | V     | V     | 7 |       A                |
| V     | H     | H     | 7 |        B                 |
| V     | H     | V     | 1 |          `$p_H(1 - p_H)^2$`               |
| V     | V     | H     | 5 |         `$p_H(1 - p_H)^2$`                |
| V     | V     | V     | 2 |         `$(1 - p_H)^3$`                |
]

Fill in **A** and **B**.

---

## Model 1: Likelihood function

Because the observations (the games) are independent, the **likelihood** is

`$$Lik(p_H) = \prod_{i=1}^{n}p_H^{y_i}(1 - p_H)^{3 - y_i}$$`
--

We will use this function to find the **maximum likelihood estimate (MLE)**. The MLE is the value between 0 and 1 where we are most likely to see the observed data.

---

## Visualizing the likelihood

.panelset.sideways[
.panel[.panel-name[Plot]
<img src="04-likelihoods_files/figure-html/unnamed-chunk-5-1.png" width="90%" style="display: block; margin: auto;" />
]

```r
p <- seq(0,1, length.out = 100) #sequence of 100 values between 0 and 100
lik <- p^46 *(1 -p)^44

x <- tibble(p = p, lik = lik)
ggplot(data = x, aes(x = p, y = lik)) + 
  geom_point() + 
  geom_line() +
  labs(y = "Likelihood",
       title = "Likelihood of p_H")
```
]
]

---

A. 0.489

B. 0.500

C. 0.511

D. 0.556

[Click here](https://forms.gle/jeTNo7SwuE2vZur59) to submit your response.
]

---

## Finding the maximum likelihood estimate

There are three primary ways to find the MLE

✅ Approximate using a graph

✅ Numerical approximation

✅ Using calculus

---

## Approximate MLE from a graph

---

## Find the MLE using numerical approximation

Specify a finite set of possible values the for `$p_H$` and calculate the likelihood for each value

```r
# write an R function for the likelihood
ref_lik <- function(ph) {
  ph^46 *(1 - ph)^44
}
```

```r
# use the optimize function to find the MLE
optimize(ref_lik, interval = c(0,1), maximum = TRUE)
```

```
## $maximum
## [1] 0.5111132
## 
## $objective
## [1] 8.25947e-28
```

---

## Find MLE using calculus

- Find the MLE by taking the first derivative of the likelihood function.

- This can be tricky because of the Product Rule, so we can maximize the **log(Likelihood)** instead. The same value maximizes the likelihood and log(Likelihood)

---

## Find MLE using calculus

`$$Lik(p_H) = \prod_{i=1}^{n}p_H^{y_i}(1 - p_H)^{3 - y_i}$$`

`$$\begin{aligned}\log(Lik(p_H)) &= \sum_{i=1}^{n}y_i\log(p_H) + (3 - y_i)\log(1 - p_H)\\[10pt] &= 46\log(p_H) + 44\log(1 - p_H)\end{aligned}$$`
---

## Find MLE using calculus

`$$\frac{d}{d p_H} \log(Lik(p_H)) = \frac{46}{p_H} - \frac{44}{1-p_H} = 0$$`

`$$\Rightarrow \frac{46}{p_H} = \frac{44}{1-p_H}$$`

`$$\Rightarrow 46(1-p_H) = 44p_H$$`

`$$\Rightarrow 46 = 90p_H$$`

`$$\hat{p}_H = \frac{46}{90} = 0.511$$`
--

---

## Model 2: Conditional model

### Is there a tendency for the referees to call more fouls on the visiting team or home team?

### Is there a tendency for referees to call a foul on the team that already has more fouls?

---

## Model 2: Likelihood contributions

- Now let's assume fouls are <u>not</u> independent within each game. We will specify this dependence using conditional probabilities. 
  - **Conditional probability**: `$P(A|B) =$` Probability of `$A$` given `$B$` has occurred

Define new parameters:

- `$p_{H|N}$`: Probability referees call foul on home team given there are equal numbers of fouls on the home and visiting teams

- `$p_{H|H Bias}$`: Probability referees call foul on home team given there are more prior fouls on the home team

- `$p_{H|V Bias}$`: Probability referees call foul on home team given there are more prior fouls on the visiting team

---

## Model 2: Likelihood contributions

.midi[
| Foul1 | Foul2 | Foul3 | n | Likelihood Contribution |
|-------|-------|-------|---|:-----------------------:|
| H     | H     | H     | 3 | `$(p_{H\vert N})(p_{H\vert H Bias})(p_{H\vert H Bias}) = (p_{H\vert N})(p_{H\vert H Bias})^2$` |
| H     | H     | V     | 2 |        `$(p_{H\vert N})(p_{H\vert H Bias})(1 - p_{H\vert H Bias})$`                 |
| H     | V     | H     | 3 |        `$(p_{H\vert N})(1 - p_{H\vert H Bias})(p_{H\vert N}) = (p_{H\vert N})^2(1 - p_{H\vert H Bias})$`                 |
| H     | V     | V     | 7 |       A                |
| V     | H     | H     | 7 |        B                 |
| V     | H     | V     | 1 |          `$(1 - p_{H\vert N})(p_{H\vert V Bias})(1 - p_{H\vert N}) = (1 - p_{H\vert N})^2(p_{H\vert V Bias})$`               |
| V     | V     | H     | 5 |         `$(1 - p_{H\vert N})(1-p_{H\vert V Bias})(p_{H\vert V Bias})$`                |
| V     | V     | V     | 2 |         `$\begin{aligned}&(1 - p_{H\vert N})(1-p_{H\vert V Bias})(1-p_{H\vert V Bias})\\ &=(1 - p_{H\vert N})(1-p_{H\vert V Bias})^2\end{aligned}$`                |
]

Fill in **A** and **B**

---

## Likelihood function

`$$\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}$$`

**(Note: The exponents sum to 90, the total number of fouls in the data)**

<br>

`$$\begin{aligned}\log (Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias})) &= 25 \log(p_{H| N}) + 23 \log(1 - p_{H|N}) \\ & + 8 \log(p_{H| H Bias}) + 12 \log(1 - p_{H| H Bias})\\ &+ 13 \log(p_{H| V Bias}) + 9 \log(1-p_{H|V Bias})\end{aligned}$$`

---

.question[
If fouls within a game are independent, how would you expect `$\hat{p}_H$`, `$\hat{p}_{H\vert H Bias}$` and `$\hat{p}_{H\vert V Bias}$` to compare?

a. `$\hat{p}_H$` is greater than `$\hat{p}_{H\vert H Bias}$` and `$\hat{p}_{H \vert V Bias}$`

b. `$\hat{p}_{H\vert H Bias}$` is greater than `$\hat{p}_H$` and `$\hat{p}_{H \vert V Bias}$`

c. `$\hat{p}_{H\vert V Bias}$` is greater than `$\hat{p}_H$` and `$\hat{p}_{H \vert V Bias}$`

d. They are all approximately equal.

]

---

.question[
If there is a tendency for referees to call a foul on the team that already has more fouls, how would you expect `$\hat{p}_H$` and  `$\hat{p}_{H\vert H Bias}$` to compare?

a. `$\hat{p}_H$` is greater than  `$\hat{p}_{H\vert H Bias}$`

b. `$\hat{p}_{H\vert H Bias}$` is greater than `$\hat{p}_H$`

c. They are approximately equal.

[Click here](https://forms.gle/Wy4YmJz3zgVezUdZA) to submit your response.
]

---

## Next time

- Using likelihoods to compare models

- Chapter 3: Distribution theory

---

## Acknowledgements

These slides are based on content in [BMLR Chapter 2 - Beyond Least Squares: Using Likelihoods](https://bookdown.org/roback/bookdown-BeyondMLR/ch-beyondmost.html)