Skip to content

Lab problems Spring 2024

Jon Marshall edited this page Apr 8, 2024 · 21 revisions
  • update estimated times with a range (e.g. 45 min to 3 hr)

  • All Nashville labs -- remove outlying age data (suspect age == 100)

  • Lab 6 -- Need to add a section where we use all the variables in their actual units, rather than as scaled with mean zero and unit variance, then run an OLS regression where regress bike rentals on some combo of dummy variables and interval and ratio variables and report it out using something like Statsmodels. Note that the bike rental data has already scaled its variables to range from 0 to 1, but that it encodes categorical variables as integers, and the Modules Team just left them that way as regressors. The lab does two important things--data splitting, scaling for ML problems, and overfitting, and so we should keep that in Lab 6 and maybe just start with the causal interpretation of OLS.

  • Lab 7 -- Fix outdated references to buildings ('Barrows Hall'). Fix solution for "states you have visited" to use a list of visited states and an in operator. Give notice about tileset behavior, e.g. 1) there is no default tileset when you are using Datahub from off campus (at least not for me), and you have to give a custom tileset and attribution; 2) Stamen Toner will run in web applications on registered servers (e.g. Datahub) but if you download your Python notebook as html, you can no longer call the Stamen Toner tiles when you open the html file in a web browser and it will give you a 404 error; 3) sometimes Datahub has problems rendering map layers, and so you may need to restart the kernel and run the whole thing to have it display (and it can disappear again sometimes)

  • Lab 8 -- fix the last map to specify Choropleth class rather than GeoJSON; make sure the prompts are clear and have updated links to Folium

  • Lab 12 -- where students add the color codes for the race categories in the traffic stop data, use something more pythonic than looping through list and dataframe; probably series.map or series.apply would be best (.map takes a dictionary as an argument)

  • Labs 13 & 14 -- Geopandas and rtree packages are added to Datahub image. You can do pip install instead of !pip install going forward

  • Lab 14 -- the text underneath the Titanic decision tree mistakenly says that the 8 year old boy would be expected to survive; in fact, his chances of survival are only 5%.

  • Lab 17 -- several typos that are legacies of the old version of the lab (e.g., trials['hits'][:100])

  • Lab 19 -- include a markdown cell on Naive Bayes classifiers; specify Linear Support Vector Classifier; typos: "retraining" should read "retaining"

  • Lab 20 -- typo in PCA section right before tf-idf vectorization, where 'clean_text' is my own coinage and the lab prompt says 'lower_text'

  • Lab 22 -- extra solution boxes in lab notebook

Clone this wiki locally