Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task definition matters in terms of memory usage #141

Open
rvandewater opened this issue Oct 22, 2024 · 1 comment
Open

Task definition matters in terms of memory usage #141

rvandewater opened this issue Oct 22, 2024 · 1 comment
Assignees
Labels
Computational Performance Relates to the computational efficiency of the cohort extraction Documentation Improvements or additions to documentation priority:high Things that are high priority, but do not warrant an immediate hotfix

Comments

@rvandewater
Copy link

rvandewater commented Oct 22, 2024

As suggested by @Oufattole, this kind of config seems to use excessive amounts of memory (more than 400 GB on MIMIC-IV in my case)
image
using regex mitigates this:
image

https://github.com/Oufattole/meds-torch/blob/main/MIMICIV_INDUCTIVE_EXPERIMENTS/configs/tasks/mortality/in_icu/first_24h.yaml

I'm not sure what the reason could be, but perhaps it's something to attend users to. Tested with es-aces 0.5.1 and command: aces-cli --multirun hydra/launcher=joblib data=sharded data.standard=meds data.root="$MIMICIV_MEDS_DIR/data" "data.shard=$(expand_shards $MIMICIV_MEDS_DIR/data)" cohort_dir="$cohort_dir" cohort_name="$TASK_NAME"

@rvandewater rvandewater changed the title Task defnition matters in memory usage Task definition matters in terms of memory usage Oct 22, 2024
@justin13601
Copy link
Owner

Yep this is a limitation with the creation of the predicates as memory peaks during this process (also brought up in #89, which was closed after #90 was merged with the ability to match using regex).

Basically, each time a predicate is defined in a configuration file, a column is created which corresponds to that predicate. So, in the first example, defining the different types of admission predicates ultimately creates 10 columns, whereas the regex option will only create 1 column that matches everything. And when your dataframe is relatively large, adding these columns really blow up the memory required.

I'll add a note about this in the documentation as well as the repo README - thanks!

@justin13601 justin13601 self-assigned this Oct 22, 2024
@justin13601 justin13601 added Documentation Improvements or additions to documentation priority:high Things that are high priority, but do not warrant an immediate hotfix Computational Performance Relates to the computational efficiency of the cohort extraction labels Oct 22, 2024
justin13601 added a commit that referenced this issue Oct 25, 2024
justin13601 added a commit that referenced this issue Oct 25, 2024
* #141 note about memory in README

* #141 warning about memory in the docs

* #142 add warning messages if labels are all the same

* Add error message when predicates are specified using only strings (includes ??? case)

Closes #141, #142, and #146
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Computational Performance Relates to the computational efficiency of the cohort extraction Documentation Improvements or additions to documentation priority:high Things that are high priority, but do not warrant an immediate hotfix
Projects
None yet
Development

No branches or pull requests

2 participants