Instructions

Write all narrative using full sentences. Write all interpretations and conclusions in the context of the problem.
Display all analysis code in the PDF.
If you are fitting a model, display the model output in a neatly formatted table. (The tidy and kable functions can help!)
If you are creating a plot, use clear labels for all axes, titles, etc.
Commit and push your work to GitHub regularly, at least after each exercise. Write short and informative commit messages.
When you’re done, we should be able to knit the final version of the R Markdown document in your GitHub repo to get a copy of the PDF submitted to Gradescope.

Exercises

Researchers are interested in the number of dates respondents arranged online and whether the rates differ by age group. Questions which elicit responses similar to this can be found in the Pew Survey concerning dating online and relationships (Smith and Duggan 2013). Each survey respondent was asked how many dates they have arranged online in the past 3 months as well as the typical amount of time, \(t\), in hours, they spend online weekly. Some rows of data appear in the table below.11 from Sec. 4.11.1, Ex 15

Age	Time Online	Number of Dates Arranged Online
19	35	3
29	20	5
38	15	0
55	10	0

a. Identify the response, predictor, and offset in this context. Does using an offset make sense? Briefly explain.

b. Write out a model for this data. As part of your model description, define the parameter, \(\lambda\).

c. Consider a zero-inflated Poisson model for this data. Describe what the “true zeros” would be in this setting.

Use the following scenario for Exercises 2 - 3.

Brockmann (1996) carried out a study of nesting female horseshoe crabs. Female horseshoe crabs often have male crabs attached to a female’s nest known as satellites. One objective of the study was to determine which characteristics of the female were associated with the number of satellites. Of particular interest is the relationship between the width of the female carapace and satellites.22 Questions adapted from Sec. 4.11.2, Ex. 2

The data can be found in crab.csv. It includes:

Satellite = number of satellites
Width = carapace width (cm)
Weight = weight (kg)
Spine = spine condition (1 = both good, 2 = one worn or broken, 3 = both worn or broken)
Color = color (1 = light medium, 2 = medium, 3 = dark medium, 4 = dark)

Make sure to convert Spine and Color to the appropriate data types in R before doing the analysis.

a. Create a histogram of Satellite. Is there preliminary evidence the number of satellites could be modeled as a Poisson response? Briefly explain.

b. Fit a Poisson regression model including Width, Weight, and Spine as predictors. Display the model with the 95% confidence interval for each coefficient.

c. Interpret the coefficient of Weight and its 95% confidence interval in terms of the mean number of satellites.

d. Describe the effect of Spine in terms of the mean number of satellites.

e. Should Color be added to the model? Conduct the appropriate test to investigate this question. State the hypotheses, display the relevant output, and state your conclusion in the context of the data.
We would like to fit a quasi-Poisson regression model for this data.

a. Briefly explain why we may want to consider fitting a quasi-Poisson regression model for this data.

b. Fit a quasi-Poisson regression model that corresponds with the model chosen the previous exercise. Display the model.

c. What is the estimated dispersion parameter? Show how this value is calculated.

d. How do the estimated coefficients change compared to the model chosen in the previous exercise? How do the standard errors change?

a. Use the R function rpois() to generate 10,000 \(x_i\) from a regular Poisson random variable, \(X \sim \textrm{Poisson}(\lambda=1.5)\). Plot a histogram of this distribution and note its mean and variance. Next, let \(Y \sim \textrm{Gamma}(r = 3, \lambda = 2)\) and use rgamma() to generate 10,000 random \(y_i\) from this distribution. Now, consider 10,000 different Poisson distributions where \(\lambda_i = y_i\). Randomly generate one \(z_i\) from each Poisson distribution. Plot a histogram of these \(z_i\) and compare it to your original histogram of \(X\) (where \(X \sim \textrm{Poisson}(1.5)\)). How do the means and variances compare?33 from Sec 3.7.2, Ex 2

Hint: Remember to set a seed, so your simulations are reproducible!

b. A negative binomial distribution can actually be expressed as a gamma-Poisson mixture. In Part a, you looked at a gamma-Poisson mixture \(Z \sim \textrm{Poisson}(\lambda)\) where \(\lambda \sim \textrm{Gamma}(r = 3, \lambda' = 2)\).

Find the parameters of a negative binomial distribution \(X \sim \textrm{NegBinom}(r, p)\) such that \(X\) is equivalent to \(Z\). As a hint, the means of both distributions must be the same, so \(r(1-p)/p = 3/2\). Show through histograms and summary statistics that your negative binomial distribution is equivalent to the gamma-Poisson mixture. You can use rnbinom() in R.

Argue that if you want a \(\textrm{NegBinom}(r, p)\) random variable, you can instead sample from a Poisson distribution, where the \(\lambda\) values are themselves sampled from a gamma distribution with parameters \(r\) and \(\lambda' = \frac{p}{1-p}\).44 from Sec. 3.7.2, Ex 3

Hint: Remember to set a seed, so your simulations are reproducible!
Awad, Lebo, and Linden (2017) scraped 40628 Airbnb listings from New York City in March 2017 and put together the data set NYCairbnb.csv. The codebook is in the data folder of the hw-02 repo.

Perform the EDA and build a model, considering offset and accounting for overdispersion, if needed. Then, use the model to describe the characteristics of Airbnbs that are expected to have a high number of reviews..55 adapted from Sec 4.11.3, Ex 1

Submission

Before you wrap up the assignment, make sure all documents are updated in your GitHub repo.

To submit your assignment:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on your STA 310 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your assignment should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

The PDF must be submitted to Gradescope by the deadline to be considered on time.

Grading

Total	50
Ex 1	5
Ex 2	8
Ex 3	9
Ex 4	10
Ex 5	15
Workflow & formatting	3

The “Workflow & formatting” grade is based on the organization of the assignment write up along with the reproducible workflow. This includes having an organized write up with neat and readable headers, code, and narrative, including properly rendered mathematical notation. It also includes having a reproducible R Markdown document that can be knitted to reproduce the submitted PDF and implementing version control using multiple commits with informative commit messages.

Acknowledgements

Exercises are pulled or adapted from Beyond Multiple Linear Regression.

References

Awad, Annika, Evan Lebo, and Anna Linden. 2017. “Intercontinental Comparative Analysis of Airbnb Booking Factors.”

Brockmann, H. Jane. 1996. “Satellite Male Groups in Horseshoe Crabs, Limulus Polyphemus.” Ethology 102 (1): 1–21. http://dx.doi.org/doi:10.1111/j.1439-0310.1996.tb01099.x.

Smith, Aaron, and Maeve Duggan. 2013. “Online Dating and Relationships.” Pew Research Center. https://www.issuelab.org/resources/15934/15934.pdf.

Homework 02

due Mon, Feb 07 at 11:59pm