Homework 01

due Wednesday, January 19 at 11:59pm EST

Instructions

Exercises

Conceptual exercises

  1. Researchers record the number of cricket chirps per minute and temperature during that time to investigate whether the number of chirps varies with the temperature.1 from Sec. 1.8.1, Ex 1

    a. Identify the response and explanatory variable(s).

    b. Write the assumptions for linear regression in the context of the problem.

  2. For each of over 27,000 overnight permits for the Boundary Waters Canoe area, the zip code for the group leader has been translated to the distance traveled and socioeconomic data. Thus, for each zip code we can model the number of trips made as a function of distance traveled and various socioeconomic measures.2 from Sec. 1.8.1, Ex 2

    a. Identify the response and explanatory variable(s).

    b. Identify which model assumption(s) are violated. Briefly explain your choice.

  3. In a 2011 article in The Sport Journal, Farrar, Bruggink, et al. (2011) attempt to show that Major League Baseball general managers did not immediately embrace the findings of Michael Lewis’s book Moneyball (Lewis 2004). They contend that players’ on-base percentage remained relatively undercompensated compared to slugging percentage three years after the book came out. Two regression models are described: Team Run Production Model and Player Salary Model.3 from Sec. 1.8.1, Ex 4

    Click here for a PDF of the article.

    a. In Table 3: Coefficients for the Team Run Production Model (pg. 9-10), the authors contend that Model 1 is better than Model 3. Could you argue that Model 3 is actually better? Briefly explain.

    b. Explain how you could run a formal hypothesis test comparing Model 1 to Model 3 to determine which is better.

    c. In Table 4 (pg. 10), Model 1 has a higher \(Adj. R^2\) than Model 2, yet the extra term in Model 1 (an indicator variable for the National League) is not significant at the 5% significance level. Explain how this is possible.

Guided exercises

  1. The dataset kingCountyHouses.csv contains data on over 20,000 houses sold in King County, Washington (Kaggle (2018)).4 from Sec. 1.8.2, Ex 3

    We will use the following variables:

    • price = selling price of the house
    • sqft = interior square footage

    See textbook for full list of variables.

    a. Fit a simple linear regression model with price as the response variable and sqrt as the explanatory variable (Model 1). Interpret the slope coefficient in terms of the expected change in price when sqft increases by 100.

    b. Create a new variable, logprice, the natural log of price. Fit Model 2, where logprice is now the response variable and sqft is still the explanatory variable. Write out the regression equation. How is the logprice expected to change when sqft increases by 100?

    c. Based on Model 2, how is the price expected to change when sqft increases by 100?

    d. Fit Model 3, where price and logsqft (the natural log of sqft) are the response and explanatory variables, respectively. How does the predicted price change when sqft increases by 5%? As a hint, this is the same as multiplying sqft by 1.05.

Open-ended exercises

  1. A student collected data from a restaurant where she was a waitress (Dahlquist and Dong 2011). The student was interested in learning under what conditions a waitress can expect the largest tips relative to the size of the bill — for example (but not limited to): At dinner time or late at night? From younger or older patrons? From patrons receiving free meals? From patrons drinking alcohol? From patrons tipping with cash or credit? Data can be found in TipData.csv.5 from Sec. 1.8.3 Ex 3

    The codebook is available in the README of the data folder.

    Build a model that the student can use to determine under which conditions she can expect the largest tips relative to the size of the bill. The response variable for the analysis will be Tip.Percentage. Justify the selection of your final model and write a one 1 -2 paragraph summary of your conclusions based on the final model. (There is more than one correct final model!)

    The analysis should include the following:

    • Exploratory data analysis
    • Details of the process used to select the final model
    • Assessment of the assumptions for the final model
    • 1 -2 paragraph summary of your conclusions

    This question will be graded based on

    • The quality of the model selection process, including the exploratory data analysis. A high quality model selection process is accurate, comprehensive, and strategic (e.g., trying all possible interaction terms will not receive full credit).
    • The quality of the summary. A high quality summary is accurate, comprehensive, answers the primary analysis question, and tells a cohesive story (e.g., a list of interpretations will not receive full credit).

Submission

Before you wrap up the assignment, make sure all documents are updated in your GitHub repo.

To submit your assignment:

The PDF must be submitted to Gradescope by the deadline to be considered on time.

Grading

Total 50
Ex 1 5
Ex 2 3
Ex 3 9
Ex 4 10
Ex 5 20
Workflow & formatting 3

The “Workflow & formatting” grade is to based on the organization of the assignment write up along with the reproducible workflow. This includes having an organized write up with neat and readable headers, code, and narrative, including properly rendered mathematical notation. It also includes having a reproducible R Markdown document that can be knitted to replicate the submitted PDF and implementing version control using multiple commits with informative commit messages.

Acknowledgements

Exercises are pulled or adapted from Beyond Multiple Linear Regression.

References

Dahlquist, Samantha, and Jin Dong. 2011. “The Effects of Credit Cards on Tipping.”
Farrar, Anthony, Thomas H Bruggink, et al. 2011. “A New Test of the Moneyball Hypothesis.” The Sport Journal 14 (1).
Kaggle. 2018. “House Sales in King County, USA.” https://www.kaggle.com/harlfoxem/housesalesprediction/home.
Lewis, Michael. 2004. Moneyball: The Art of Winning an Unfair Game. WW Norton & Company.