tidy
and kable
functions can help!)Researchers record the number of cricket chirps per minute and temperature during that time to investigate whether the number of chirps varies with the temperature.1 from Sec. 1.8.1, Ex 1
a. Identify the response and explanatory variable(s).
b. Write the assumptions for linear regression in the context of the problem.
For each of over 27,000 overnight permits for the Boundary Waters Canoe area, the zip code for the group leader has been translated to the distance traveled and socioeconomic data. Thus, for each zip code we can model the number of trips made as a function of distance traveled and various socioeconomic measures.2 from Sec. 1.8.1, Ex 2
a. Identify the response and explanatory variable(s).
b. Identify which model assumption(s) are violated. Briefly explain your choice.
In a 2011 article in The Sport Journal, Farrar, Bruggink, et al. (2011) attempt to show that Major League Baseball general managers did not immediately embrace the findings of Michael Lewis’s book Moneyball (Lewis 2004). They contend that players’ on-base percentage remained relatively undercompensated compared to slugging percentage three years after the book came out. Two regression models are described: Team Run Production Model and Player Salary Model.3 from Sec. 1.8.1, Ex 4
Click here for a PDF of the article.
a. In Table 3: Coefficients for the Team Run Production Model (pg. 9-10), the authors contend that Model 1 is better than Model 3. Could you argue that Model 3 is actually better? Briefly explain.
b. Explain how you could run a formal hypothesis test comparing Model 1 to Model 3 to determine which is better.
c. In Table 4 (pg. 10), Model 1 has a higher \(Adj. R^2\) than Model 2, yet the extra term in Model 1 (an indicator variable for the National League) is not significant at the 5% significance level. Explain how this is possible.
The dataset kingCountyHouses.csv
contains data on over 20,000 houses sold in King County, Washington (Kaggle (2018)).4 from Sec. 1.8.2, Ex 3
We will use the following variables:
price
= selling price of the housesqft
= interior square footageSee textbook for full list of variables.
a. Fit a simple linear regression model with price
as the response variable and sqrt
as the explanatory variable (Model 1). Interpret the slope coefficient in terms of the expected change in price when sqft
increases by 100.
b. Create a new variable, logprice
, the natural log of price. Fit Model 2, where logprice
is now the response variable and sqft
is still the explanatory variable. Write out the regression equation. How is the logprice
expected to change when sqft
increases by 100?
c. Based on Model 2, how is the price expected to change when sqft increases by 100?
d. Fit Model 3, where price
and logsqft
(the natural log of sqft) are the response and explanatory variables, respectively. How does the predicted price change when sqft increases by 5%? As a hint, this is the same as multiplying sqft by 1.05.
A student collected data from a restaurant where she was a waitress (Dahlquist and Dong 2011). The student was interested in learning under what conditions a waitress can expect the largest tips relative to the size of the bill — for example (but not limited to): At dinner time or late at night? From younger or older patrons? From patrons receiving free meals? From patrons drinking alcohol? From patrons tipping with cash or credit? Data can be found in TipData.csv
.5 from Sec. 1.8.3 Ex 3
The codebook is available in the README
of the data
folder.
Build a model that the student can use to determine under which conditions she can expect the largest tips relative to the size of the bill. The response variable for the analysis will be Tip.Percentage
. Justify the selection of your final model and write a one 1 -2 paragraph summary of your conclusions based on the final model. (There is more than one correct final model!)
The analysis should include the following:
This question will be graded based on
Before you wrap up the assignment, make sure all documents are updated in your GitHub repo.
To submit your assignment:
Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on your STA 310 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your assignment should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
The PDF must be submitted to Gradescope by the deadline to be considered on time.
Total | 50 |
---|---|
Ex 1 | 5 |
Ex 2 | 3 |
Ex 3 | 9 |
Ex 4 | 10 |
Ex 5 | 20 |
Workflow & formatting | 3 |
The “Workflow & formatting” grade is to based on the organization of the assignment write up along with the reproducible workflow. This includes having an organized write up with neat and readable headers, code, and narrative, including properly rendered mathematical notation. It also includes having a reproducible R Markdown document that can be knitted to replicate the submitted PDF and implementing version control using multiple commits with informative commit messages.
Exercises are pulled or adapted from Beyond Multiple Linear Regression.