Skip to content

Commit

Permalink
Information about override predicates using predicates-only files
Browse files Browse the repository at this point in the history
  • Loading branch information
justin13601 committed Sep 24, 2024
1 parent 87de1ce commit eb7bb05
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 20 deletions.
46 changes: 28 additions & 18 deletions docs/source/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,43 +149,47 @@ Hydra configuration files are leveraged for cohort extraction runs. All fields c

#### Data Configuration

To set a data standard:
**To set a data standard**:

`data.standard`: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct'
***`data.standard`***: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct'

To query from a single MEDS shard:
**To query from a single MEDS shard**:

`data.path`: Path to the `.parquet`shard file
***`data.path`***: Path to the `.parquet` shard file

To query from multiple MEDS shards, you must set `data=sharded`. Additionally:
**To query from multiple MEDS shards**, you must set `data=sharded`. Additionally:

`data.root`: Root directory of MEDS dataset containing shard directories
***`data.root`***: Root directory of MEDS dataset containing shard directories

`data.shard`: Expression specifying MEDS shards (`$(expand_shards <str>/<int>)`)
***`data.shard`***: Expression specifying MEDS shards using [expand_shards](https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py) (`$(expand_shards <str>/<int>)`)

To query from an ESGPT dataset:
**To query from an ESGPT dataset**:

`data.path`: Directory of the full ESGPT dataset
***`data.path`***: Directory of the full ESGPT dataset

To query from a direct predicates dataframe:
**To query from a direct predicates dataframe**:

`data.path` Path to the `.csv` or `.parquet` file containing the predicates dataframe
***`data.path`*** Path to the `.csv` or `.parquet` file containing the predicates dataframe

`data.ts_format`: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M"
***`data.ts_format`***: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M"

#### Task Configuration

`cohort_dir`: Directory of your task configuration file
***`cohort_dir`***: Directory of your task configuration file

***`cohort_name`***: Name of the task configuration file

The above two fields are used below for automatically loading task configurations, saving results, and logging:

`cohort_name`: Name of the task configuration file
***`config_path`***: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml`

The above two fields are used for automatically loading task configurations, saving results, and logging:
***`output_filepath`***: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_**dir}/${cohort_name}.parquet` otherwise

`config_path`: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml`
***`log_dir`***: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs`

`output_filepath`: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_dir}/${cohort_name}.parquet` otherwise
Additionally, predicates may be specified in a separate predicates configuration file and loaded for overrides:

`log_dir`: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs`
***`predicates_path`***: Path to the [separate predicates configuration file](<>). Defaults to null

#### Tab Completion

Expand Down Expand Up @@ -257,6 +261,8 @@ For example, to query an in-hospital mortality task on the sample data (both the
>>> query.query(cfg=cfg, predicates_df=predicates_df)
```

### Separate Predicates-Only File

For more complex tasks involving a large number of predicates, a separate predicates-only "database" file can
be created and passed into `TaskExtractorConfig.load()`. Only referenced predicates will have a predicate
column computed and evaluated, so one could create a dataset-specific deposit file with many predicates and
Expand All @@ -266,4 +272,8 @@ reference as needed to ensure the cleanliness of the dataset-agnostic task crite
>>> cfg = config.TaskExtractorConfig.load(config_path="criteria.yaml", predicates_path="predicates.yaml")
```

If the same predicates are defined in both the task configuration file and the predicates-only file, the
predicates-only definition takes precedent and will be used to override previous definitions. As such, one may
create a predicates-only "database" file for a particular dataset, and override accordingly for various tasks.

______________________________________________________________________
3 changes: 2 additions & 1 deletion src/aces/configs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,13 @@
- shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset.
Note: data.shard can be expanded using the `expand_shards` function. Please refer to
https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and
https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information.
cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging
cohort_name (required): cohort name, used to automatically load configs, saving results, and logging
config_path (optional): path to the task configuration file, defaults to '<cohort_dir>/<cohort_name>.yaml'
predicates_path (optional): path to a separate predicates configuration file for overriding
predicates_path (optional): path to a separate predicates-only configuration file for overriding
output_filepath (optional): path to the output file, defaults to '<cohort_dir>/<cohort_name>.parquet'
---------------- Default Config ----------------
Expand Down
3 changes: 2 additions & 1 deletion src/aces/configs/_aces.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,12 +58,13 @@ hydra:
- shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset.
Note: data.shard can be expanded using the `expand_shards` function. Please refer to
https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and
https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information.
cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging
cohort_name (required): cohort name, used to automatically load configs, saving results, and logging
config_path (optional): path to the task configuration file, defaults to '<cohort_dir>/<cohort_name>.yaml'
predicates_path (optional): path to a separate predicates configuration file for overriding
predicates_path (optional): path to a separate predicates-only configuration file for overriding
output_filepath (optional): path to the output file, defaults to '<cohort_dir>/<cohort_name>.parquet'
---------------- Default Config ----------------
Expand Down

0 comments on commit eb7bb05

Please sign in to comment.