-
Notifications
You must be signed in to change notification settings - Fork 9
things to consider for Sp23
Jon Marshall edited this page May 1, 2023
·
17 revisions
- Lab 19 should maybe also talk about logistic regression for classification. We discuss logit in Meeting 11.
- Lab 2 (problems notebook) has a rogue cell that talks about an "incidents" dataframe that must be left over from the first version of the lab. This cell should be deleted.
- Lab 3 question 1 on scatterplots--solution says that both response sets cluster toward zero, but it actually means they do strongly agree so take out the "not"
- Lab 5 Central Limit Theorem -- we are not talking about liberal respondents, we are talking about all respondents' feeling thermometer scores for "liberals"
- Lab 5 Central Limit Theorem --the jury example from Data 8 just does not work; you could use a z-test to talk about proportions, or you could do what Wilson does last and ask how likely it would be to draw the number of Black jurors actually on the panel at random given the known population proportion, or you could use a Chi-squared test. The last bit goes ahead and does hypothesis testing with Student's t statistic in any case, so we could omit the jury proportions part of the lab, although it does one useful thing by demonstrating how to create a synthetic dataset using bootstrapping and then testing the hypothesis against that.
- Lab 8 Step 1 documentation link is broken.
- Lab 9 last cell crediting Adithya should be a markdown cell in the student notebook.
- Lab 9 solutions and student notebook--the heatmap with the random distribution centered at [48,5] on the old lab is now centered on the Continental US centroid, so fix the comment in the cell to reflect that change.
- Lab 20 solutions originally tried to separate types of campaign communication rather than candidates by location on principal components axes; I think types are easier to separate than candidates
- Lab 21 fix broken image links, remove link to Colab since it can now run on Datahub
- Labs need to
a) introduce Pandas intensively prior to start of semester
b) introduce classification earlier, probably with logistic classifiers right after the introduction of OLS regression
c) talk more about data splitting (independence assumption not true for training and validation sets the way we do it)
d) give students more practice with data cleaning (perhaps by using a traffic stop dataset, which could also be used to try classification)
e) introduce EDA earlier with summary stats, null proportions, and then visualization, perhaps
- We may want to change out datasets; instead of Old Bailey Online we could use CaseLaw dataset (but it is very different and not nearly as labeled); instead of SFPD incidents we may try another city; we could try once again for California trial court disposition dataset (if they are willing to share) instead of New York; how about a law & literature dataset?
- we should refresh the problem sets to go along with the new data, which could be pretty radical if we go with CaseLaw or trial dispositions