The goal of the final project is for you to use statistical methods from this course to analyze a data set of your own choosing. The data set may already exist or you may collect your own data by scraping the web or combining multiple data sets.
There are two options for the analysis:
1️⃣ Use multilevel modeling to analyze data with a multilevel structure.
2️⃣ Use one or more generalized linear models we haven’t covered in class to analyze data with independent observations.1
You may not use data that has been used for lectures, in-class activities, or assignments.
You may discuss your project with members of the teaching team if you are unsure whether your data and modeling approach are appropriate for the project. All analyses must be done in RStudio, and all components of the project must be reproducible.
All work for the project will be submitted on GitHub.
Round 1 submission (optional): Friday, April 15 at 11:59pm
Final submission: Wednesday, April 27 at 11:59pm
The Round 1 submission is an opportunity to receive feedback on your analysis and written report. The feedback will only be on the content that is submitted, so more “complete” drafts will receive more detailed feedback. At this stage, you will also be notified of the grade you would receive at that point. You will have the option to keep the grade (and thus you don’t need to turn in an updated report) or resubmit the written report by the final submission deadline for grading.
To submit the draft:
written-report.Rmd
and
written-report.pdf
to your GitHub repo.You must complete both steps to submit the draft.
The draft must be submitted by Friday, April 15 at 11:59pm to receive preliminary feedback. Reports submitted after that date will not receive preliminary feedback.
Note that this is optional, so there is no penalty for turning in nothing for the Round 1 submission.
The final submission is due by Wednesday, April 27 at 11:59pm. You will submit the final written report by pushing the R Markdown and knitted PDF documents to your GitHub repo.
Given the short grading timelines, there will be minimal feedback on the final submissions. You can submit a draft in the Round 1 submission if you wish to receive more detailed feedback.
In addition to the written report, the GitHub repo should include
README
.data
folder.Your written report must be completed in the written-report.Rmd file and must be reproducible.
Before you finalize your write up, make sure the printing of code
chunks, warnings, and messages are off with the options
echo = FALSE, warning = FALSE, messages = FALSE
.
The report, including visualizations and output will be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the aspects mentioned below.
Please be selective in what you include in your final write-up. The goal is to write a cohesive narrative rather than explain every step of the analysis. If you have additional work you wish to include that doesn’t fit in the 10-page limit, you may include it in a neatly organized appendix. Note that the appendix is only for supplemental material; the main body of the report must should be comprehensive and include all relevant details.
The report will include the sections outlined below.
This section includes an introduction to the project motivation, data, and research question. It also includes any background information relevant for understanding the analysis and relevant previous work.
The data and definitions of key variables are described. It should also include some exploratory data analysis (EDA) - visualizations and appropriate summary statistics. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and other key variables and multivariate relationships.
This section includes a description of the modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model and any interactions. Additionally, discuss how you arrived at the final model by describing the model selection process, any variable transformations (if needed), and any other relevant considerations that were part of the model fitting process including model assumptions and diagnostics as needed. This section will also include the equation for the final statistical model written in mathematical notation.
This is where you will output the final model and explain key results. The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.
In this section you’ll include a summary of the conclusions about the research question with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis. Issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Include ideas for future work.
This is an assessment of the overall presentation and formatting of the written report. This includes having clear section headers and appropriately sized figures with informative labels. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted. All code, warnings, and messages are suppressed. Overall, the document would be presentable in a business or research setting.
The analysis and written report should be done in a reproducible way. This means we should be able to reproduce the analysis and written report starting with the raw data. This means any data cleaning, combining data sets, creating new variables, etc. should also be done in a reproducible way.
You should have the following files and folders in the project repo. The repo and brief summary in the README should be updated by
README.md
: 3 - 5 sentence summary of the
project
/data/
: The data set
/data/*
: File containing raw data set/data/README.md
: Codebook for data set. Include
citations for the data source(s).written-report.Rmd
: R Markdown file for written
report
written-report.pdf
: Knitted PDF of written
report
Component | Points |
---|---|
Written report | 50 pts |
Reproducibility | 5 pts |
GitHub repo organization | 5 pts |
Each section - Introduction, Data, Methods, Results, Discussion & Conclusion, and formatting will receive one of the following:
A letter grade (A, A-, B+ , B, B-, etc.) will be assigned based on a holistic assessment of the report. The letter grade will be converted to points out of 50.
The GitHub repo organization and reproducibilty will be assessed out of 5 points each based on the criteria stated above.
This means the analysis should primarily focus on a model that is not linear regression, logistic regression, Poisson regression, or negative binomial regression.↩︎