Information about override predicates using predicates-only files

justin13601 · Sep 24, 2024 · eb7bb05 · eb7bb05
1 parent 87de1ce
commit eb7bb05
Show file tree

Hide file tree

Showing 3 changed files with 32 additions and 20 deletions.
diff --git a/docs/source/usage.md b/docs/source/usage.md
@@ -149,43 +149,47 @@ Hydra configuration files are leveraged for cohort extraction runs. All fields c
 
 #### Data Configuration
 
-To set a data standard:
+**To set a data standard**:
 
-`data.standard`: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct'
+***`data.standard`***: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct'
 
-To query from a single MEDS shard:
+**To query from a single MEDS shard**:
 
-`data.path`: Path to the `.parquet`shard file
+***`data.path`***: Path to the `.parquet` shard file
 
-To query from multiple MEDS shards, you must set `data=sharded`. Additionally:
+**To query from multiple MEDS shards**, you must set `data=sharded`. Additionally:
 
-`data.root`: Root directory of MEDS dataset containing shard directories
+***`data.root`***: Root directory of MEDS dataset containing shard directories
 
-`data.shard`: Expression specifying MEDS shards (`$(expand_shards <str>/<int>)`)
+***`data.shard`***: Expression specifying MEDS shards using [expand_shards](https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py) (`$(expand_shards <str>/<int>)`)
 
-To query from an ESGPT dataset:
+**To query from an ESGPT dataset**:
 
-`data.path`: Directory of the full ESGPT dataset
+***`data.path`***: Directory of the full ESGPT dataset
 
-To query from a direct predicates dataframe:
+**To query from a direct predicates dataframe**:
 
-`data.path` Path to the `.csv` or `.parquet` file containing the predicates dataframe
+***`data.path`*** Path to the `.csv` or `.parquet` file containing the predicates dataframe
 
-`data.ts_format`: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M"
+***`data.ts_format`***: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M"
 
 #### Task Configuration
 
-`cohort_dir`: Directory of your task configuration file
+***`cohort_dir`***: Directory of your task configuration file
+
+***`cohort_name`***: Name of the task configuration file
+
+The above two fields are used below for automatically loading task configurations, saving results, and logging:
 
-`cohort_name`: Name of the task configuration file
+***`config_path`***: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml`
 
-The above two fields are used for automatically loading task configurations, saving results, and logging:
+***`output_filepath`***: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_**dir}/${cohort_name}.parquet` otherwise
 
-`config_path`: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml`
+***`log_dir`***: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs`
 
-`output_filepath`: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_dir}/${cohort_name}.parquet` otherwise
+Additionally, predicates may be specified in a separate predicates configuration file and loaded for overrides:
 
-`log_dir`: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs`
+***`predicates_path`***: Path to the [separate predicates configuration file](<>). Defaults to null
 
 #### Tab Completion
 
@@ -257,6 +261,8 @@ For example, to query an in-hospital mortality task on the sample data (both the
 >>> query.query(cfg=cfg, predicates_df=predicates_df)
 ```
 
+### Separate Predicates-Only File
+
 For more complex tasks involving a large number of predicates, a separate predicates-only "database" file can
 be created and passed into `TaskExtractorConfig.load()`. Only referenced predicates will have a predicate
 column computed and evaluated, so one could create a dataset-specific deposit file with many predicates and
@@ -266,4 +272,8 @@ reference as needed to ensure the cleanliness of the dataset-agnostic task crite
 >>> cfg = config.TaskExtractorConfig.load(config_path="criteria.yaml", predicates_path="predicates.yaml")
 ```
 
+If the same predicates are defined in both the task configuration file and the predicates-only file, the
+predicates-only definition takes precedent and will be used to override previous definitions. As such, one may
+create a predicates-only "database" file for a particular dataset, and override accordingly for various tasks.
+
 ______________________________________________________________________
diff --git a/src/aces/configs/__init__.py b/src/aces/configs/__init__.py
@@ -41,12 +41,13 @@
         - shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset.
 
         Note: data.shard can be expanded using the `expand_shards` function. Please refer to
+        https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and
         https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information.
 
     cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging
     cohort_name (required): cohort name, used to automatically load configs, saving results, and logging
     config_path (optional): path to the task configuration file, defaults to '<cohort_dir>/<cohort_name>.yaml'
-    predicates_path (optional): path to a separate predicates configuration file for overriding
+    predicates_path (optional): path to a separate predicates-only configuration file for overriding
     output_filepath (optional): path to the output file, defaults to '<cohort_dir>/<cohort_name>.parquet'
 
     ---------------- Default Config ----------------

diff --git a/src/aces/configs/_aces.yaml b/src/aces/configs/_aces.yaml
@@ -58,12 +58,13 @@ hydra:
           - shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset.
 
           Note: data.shard can be expanded using the `expand_shards` function. Please refer to
+          https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and
           https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information.
 
       cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging
       cohort_name (required): cohort name, used to automatically load configs, saving results, and logging
       config_path (optional): path to the task configuration file, defaults to '<cohort_dir>/<cohort_name>.yaml'
-      predicates_path (optional): path to a separate predicates configuration file for overriding
+      predicates_path (optional): path to a separate predicates-only configuration file for overriding
       output_filepath (optional): path to the output file, defaults to '<cohort_dir>/<cohort_name>.parquet'
 
       ---------------- Default Config ----------------