Task definition matters in terms of memory usage #141

rvandewater · 2024-10-22T14:34:26Z

As suggested by @Oufattole, this kind of config seems to use excessive amounts of memory (more than 400 GB on MIMIC-IV in my case)

using regex mitigates this:

https://github.com/Oufattole/meds-torch/blob/main/MIMICIV_INDUCTIVE_EXPERIMENTS/configs/tasks/mortality/in_icu/first_24h.yaml

I'm not sure what the reason could be, but perhaps it's something to attend users to. Tested with es-aces 0.5.1 and command: aces-cli --multirun hydra/launcher=joblib data=sharded data.standard=meds data.root="$MIMICIV_MEDS_DIR/data" "data.shard=$(expand_shards $MIMICIV_MEDS_DIR/data)" cohort_dir="$cohort_dir" cohort_name="$TASK_NAME"

The text was updated successfully, but these errors were encountered:

justin13601 · 2024-10-22T15:17:20Z

Yep this is a limitation with the creation of the predicates as memory peaks during this process (also brought up in #89, which was closed after #90 was merged with the ability to match using regex).

Basically, each time a predicate is defined in a configuration file, a column is created which corresponds to that predicate. So, in the first example, defining the different types of admission predicates ultimately creates 10 columns, whereas the regex option will only create 1 column that matches everything. And when your dataframe is relatively large, adding these columns really blow up the memory required.

I'll add a note about this in the documentation as well as the repo README - thanks!

* #141 note about memory in README * #141 warning about memory in the docs * #142 add warning messages if labels are all the same * Add error message when predicates are specified using only strings (includes ??? case) Closes #141, #142, and #146

rvandewater changed the title ~~Task defnition matters in memory usage~~ Task definition matters in terms of memory usage Oct 22, 2024

justin13601 self-assigned this Oct 22, 2024

justin13601 added Documentation Improvements or additions to documentation priority:high Things that are high priority, but do not warrant an immediate hotfix Computational Performance Relates to the computational efficiency of the cohort extraction labels Oct 22, 2024

justin13601 added a commit that referenced this issue Oct 25, 2024

#141 note about memory in README

fb6f499

justin13601 added a commit that referenced this issue Oct 25, 2024

#141 warning about memory in the docs

1c0ee69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task definition matters in terms of memory usage #141

Task definition matters in terms of memory usage #141

rvandewater commented Oct 22, 2024 •

edited

Loading

justin13601 commented Oct 22, 2024

Task definition matters in terms of memory usage #141

Task definition matters in terms of memory usage #141

Comments

rvandewater commented Oct 22, 2024 • edited Loading

justin13601 commented Oct 22, 2024

rvandewater commented Oct 22, 2024 •

edited

Loading