Update help menus, fix typos (in documentation and variable names), add more explanation of day offsets #16

alisoncallahan · 2021-05-14T19:30:07Z

--clmbr_create_info

rename --extract_dir to --input_data_dir (or something similar), and explain what it is in the help menu - right now the description in help is the name of the command ('Extract dir') which isn't very informative

--clmbr_train_model

complete the help menu descriptions, i.e. add English language descriptions of all the options
make the required options actually required (the help menu indicates that they are all optional arguments, but then the notebook says there are two required arguments)
the description of --model_dir is kind of confusing - 'override' implies that there's a default output location, but then the notebook says this option is required, and that it can't already exist (what if I want to overwrite the content of an existing folder? I have to delete the folder first? sure, the benefit is that it is a stopgap to prevent accidental overwriting, but also one extra step)
also, it seems like the logic for the directory overwriting is just in the next cell, not actually built into the clmbr_train_model function, which isn't consistent with the documentation of the notebook

"Set up the labeler for the downstream task we're interested in"
- the comment for the DiabetesLabeler class could use a little more detail - the description below this cell makes it sound like the labeler is labeling patient days based on whether there is a diabetes code present on that day or not. But the comment talks about a prediction task with a time horizon (also, how does one specify the time point that the horizon is relative to?).
- why do timelines have dictionaries? need documentation in timeline.pyi
- "this randomly selects on label per patient " should be "this randomly selects one label per patient"
- for output of ehr_ml.clmbr.featurize_patients_w_labels (features, labels, patient_ids, day_offsets) -- what is day_offsets relative to? DOB of patient?
"Using the trained model"
- in the first cell, I think patient_indices should be day_offsets (based on featurize_patients_w_labels in https://github.com/som-shahlab/ehr_ml/blob/245dd3436a5dcddada41222611e8129be96cd85b/ehr_ml/clmbr/__init__.py)

explanation of day offsets could include more information, for truly new users - e.g. is this date the date that a prediction should be made? Is this date somehow used as a cutoff for generating features? The notebook seems to assume a lot of knowledge/intution about the day offsets that users may not have.

The text was updated successfully, but these errors were encountered:

woffett · 2021-05-21T19:51:15Z

Update to reflect changes in the newest PR:

This first point is still open and is to be addressed.
I think this comment refers to the line in the labeler class diabetes_code = timelines.get_dictionary().map("ICD10CM/E11.9"). I'm not too sure myself, @lalaland do you have any clarification on this?
My understanding was that day_offsets indexes into an array of all days where a patient observation is recorded in the dataset. TODO: the documentation still needs to be updated to reflect this, once it's updated we can link to it in the notebook for clarity.
the variable should be renamed in the next PR

That's a good point. TODO: add explanation in 3b. about what convert_patient_data is doing

Provide feedback