Final Project

The culmination of the course is a group research project. You will

create your own research question
download your own data
visualize the data
interpret your findings

You can answer any question about inequality (broadly defined) using the ideas we learned in this course.

What you will submit

There are two submitted items, both due 5pm on Wednesday April 24

Writeup. This should be a .qmd document compiled into a PDF. It should include all code that produces your results. For undergraduate groups, this must be no more than 1,000 words and will contain 1 or more visualizations. For graduate groups, this must be no more than 2,000 words and contain 3 or more visualizations.

Slides. Upload slides in PDF format for your presentation in discussion. We encourage you to keep text to a minimum in the slides. You should plan to have all group members speak during the presentation. You should plan to address the key points 1–6 above. We realize that 10 minutes is short; being concise is a virtue!

Group structure

We will share a Google form in which you can either

tell us your group. We anticipate 4–6 students, but we can be flexible
tell us your interests, and we will team you up with others!

If you want a smaller group or to work individually, come talk to us

Key components of the project

Define your target population. Motivate this choice: why is this population interesting to study?
Describe how your sample was chosen from that population
- this may be a probability sample, such as those available via IPUMS. If so, tell us a little bit about the sampling design
- this may be a convenience sample. If so, why does it speak to the population and what are the limitations?
- this may be data on the entire population, as in our baseball example
Choose an outcome variable, which is defined for each unit in the population
- example: annual wage and salary income
Choose one or more variables on which to create population subgroups
- example: subgroups defined by sex (male and female)
Choose a summary statistic, which aggregates the outcome distribution to one summary per subgroup
- examples: proportion, mean, median, 90th percentile
Visualize your findings in a ggplot

Your goal should be to tell us a story using the data. What do we learn by studying this outcome, aggregated this way, in these subgroups from this population?

Considerations to bear in mind

Weights. If your sample is drawn from the population with unequal probabilities, you should use sampling weights
Models. If your question involves many subgroups (e.g., ages) with few observations in each subgroup, you can (but are not required to) use a statistical model to estimate your summary statistic in the subgroup by a predicted value. For example, you could use OLS to predict the proportion mean income at each age. If you do this, you should report the predicted value of the summary statistic, not the coefficients of the model.
Aggregation. Your data must begin with units (e.g., people) who you aggregate into subgroups (e.g., age groups). Your data might come pre-aggregated, such as data where each row contains data for all students in a particular college or university. Then you would need to aggregate further, such as to produce summaries for private versus public universities.
Dropped cases. As you move from raw data to the data that produce your graph, you might drop cases on the way. For example, some cases may have missing values on key predictors. Report how many are dropped, and why. Our goal here is transparent, open science.
Avoiding causal language. Beware of saying that one variable causes, shapes, influences, or determines another. These causal claims are important, but hard to support! You should take STSCI 3900 first. A heuristic to recognize causal claims is the sentence structure “X [verb] Y”, such as “going to college increases earnings.” This claim suggests a college graduate would have earned less if they had not gone to college—a counterfactual outcome we did not see. For our class, we suggest you focus on non-causal claim, such as “There is a difference in earnings among those who did and did not go to college.” A heuristic to recognize a non-causal claim is that it can be phrased it in an “among” statement: “Among subgroup A, we find ___. Among subgroup B, we find ___.” Or “There is a disparity in Y across subgroups defined by X.”
But I wanted to ask a causal question! While we encourage descriptive claims, we will allow causal claims if they are supported by transparent mathematical assumptions, using potential outcomes or Directed Acyclic Graphs. This might be an appropriate choice for students who previously took STSCI 3900, or who have other background in causal inference. If you take this road, come talk to us in office hours to make sure we are on the same page about the assumptions required.

Support from an assigned TA

Each group will be assigned one TA. That TA will be your first point of contact for support, and we expect they will get to know the project along the way. While your assigned TA will be specifically aware of your project, you are also welcome at all of our office hours.

Have fun

As a teaching team, the project is our favorite part of the course. Preparing you to succeed in the project has been (in some sense) the entire goal of all that precedes the project in the course. We hope you will find joy in answering questions with data, as we do.