diff --git a/README.md b/README.md index 93c1fbc..4fc29d3 100644 --- a/README.md +++ b/README.md @@ -17,6 +17,15 @@ # Automatic Cohort Extraction System for Event-Streams +**Updates** + +- **\[2024-09-01\]** Predicates can now be defined in a configuration file separate to task criteria files. +- **\[2024-08-29\]** MEDS v0.3.3 is now supported. +- **\[2024-08-22\]** Polars v1.5.\* is now supported. +- **\[2024-08-10\]** Expanded predicates configuration language to support regular expressions, multi-column constraints, and multi-value constraints. +- **\[2024-07-30\]** Added ability to place constraints on static variables, such as patient demographics. +- **\[2024-06-28\]** Paper posted at [arXiv:2406.19653](https://arxiv.org/abs/2406.19653). + Automatic Cohort Extraction System (ACES) is a library that streamlines the extraction of task-specific cohorts from time series datasets formatted as event-streams, such as Electronic Health Records (EHR). ACES is designed to query these EHR datasets for valid subjects, guided by various constraints and requirements defined in a YAML task configuration file. This offers a powerful and user-friendly solution to researchers and developers. The use of a human-readable YAML configuration file also eliminates the need for users to be proficient in complex dataframe querying, making the extraction process accessible to a broader audience. There are diverse applications in healthcare and beyond. For instance, researchers can effortlessly define subsets of EHR datasets for training of foundation models. Retrospective analyses can also become more accessible to clinicians as it enables the extraction of tailored cohorts for studying specific medical conditions or population demographics. A new era of benchmarking over tasks instead of data may also be realized ([MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main)). @@ -95,31 +104,29 @@ df_result = query.query(cfg=cfg, predicates_df=predicates_df) 4. **Results**: The output will be a dataframe of subjects who satisfy the conditions defined in your task configuration file. Timestamps for the start/end boundaries of each window specified in the task configuration, as well as predicate counts for each window, are also provided. Below are sample logs for the successful extraction of an in-hospital mortality cohort: ```log -aces-cli cohort_name="inhospital_mortality" cohort_dir="sample_configs" data.standard="esgpt" data.path="MIMIC_ESD_new_schema_08-31-23-1/" -2024-06-05 02:06:57.362 | INFO | aces.__main__:main:40 - Loading config from 'sample_configs/inhospital_mortality.yaml' -2024-06-05 02:06:57.369 | INFO | aces.config:load:832 - Parsing predicates... -2024-06-05 02:06:57.369 | INFO | aces.config:load:838 - Parsing trigger event... -2024-06-05 02:06:57.369 | INFO | aces.config:load:841 - Parsing windows... -2024-06-05 02:06:57.380 | INFO | aces.__main__:main:43 - Attempting to get predicates dataframe given: -standard: esgpt +aces-cli cohort_name="inhospital_mortality" cohort_dir="sample_configs" data.standard="meds" data.path="MEDS_DATA" +2024-09-24 02:06:57.362 | INFO | aces.__main__:main:153 - Loading config from 'sample_configs/inhospital_mortality.yaml' +2024-09-24 02:06:57.369 | INFO | aces.config:load:1258 - Parsing windows... +2024-09-24 02:06:57.369 | INFO | aces.config:load:1267 - Parsing trigger event... +2024-09-24 02:06:57.369 | INFO | aces.config:load:1282 - Parsing predicates... +2024-09-24 02:06:57.380 | INFO | aces.__main__:main:156 - Attempting to get predicates dataframe given: +standard: meds ts_format: '%m/%d/%Y %H:%M' -path: MIMIC_ESD_new_schema_08-31-23-1/ +path: MEDS_DATA/ _prefix: '' -Updating config.save_dir from /n/data1/hms/dbmi/zaklab/RAMMS/data/MIMIC_IV/ESD_new_schema_08-31-23-1 to MIMIC_ESD_new_schema_08-31-23-1 -Loading events from MIMIC_ESD_new_schema_08-31-23-1/events_df.parquet... -Loading dynamic_measurements from MIMIC_ESD_new_schema_08-31-23-1/dynamic_measurements_df.parquet... -2024-06-05 02:07:01.405 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:241 - Generating plain predicate columns... -2024-06-05 02:07:01.579 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:252 - Added predicate column 'admission'. -2024-06-05 02:07:01.770 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:252 - Added predicate column 'discharge'. -2024-06-05 02:07:01.925 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:252 - Added predicate column 'death'. -2024-06-05 02:07:07.155 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:273 - Cleaning up predicates dataframe... -2024-06-05 02:07:07.156 | INFO | aces.predicates:get_predicates_df:401 - Loaded plain predicates. Generating derived predicate columns... -2024-06-05 02:07:07.167 | INFO | aces.predicates:get_predicates_df:404 - Added predicate column 'discharge_or_death'. -2024-06-05 02:07:07.772 | INFO | aces.predicates:get_predicates_df:413 - Generating special predicate columns... -2024-06-05 02:07:07.841 | INFO | aces.predicates:get_predicates_df:434 - Added predicate column '_ANY_EVENT'. -2024-06-05 02:07:07.841 | INFO | aces.query:query:32 - Checking if '(subject_id, timestamp)' columns are unique... -2024-06-05 02:07:08.221 | INFO | aces.utils:log_tree:59 - +2024-09-24 02:07:58.176 | INFO | aces.predicates:generate_plain_predicates_from_meds:268 - Loading MEDS data... +2024-09-24 02:07:01.405 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:272 - Generating plain predicate columns... +2024-09-24 02:07:01.579 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:276 - Added predicate column 'admission'. +2024-09-24 02:07:01.770 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:276 - Added predicate column 'discharge'. +2024-09-24 02:07:01.925 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:276 - Added predicate column 'death'. +2024-09-24 02:07:07.155 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:279 - Cleaning up predicates dataframe... +2024-09-24 02:07:07.156 | INFO | aces.predicates:get_predicates_df:642 - Loaded plain predicates. Generating derived predicate columns... +2024-09-24 02:07:07.167 | INFO | aces.predicates:get_predicates_df:645 - Added predicate column 'discharge_or_death'. +2024-09-24 02:07:07.772 | INFO | aces.predicates:get_predicates_df:654 - Generating special predicate columns... +2024-09-24 02:07:07.841 | INFO | aces.predicates:get_predicates_df:681 - Added predicate column '_ANY_EVENT'. +2024-09-24 02:07:07.841 | INFO | aces.query:query:76 - Checking if '(subject_id, timestamp)' columns are unique... +2024-09-24 02:07:08.221 | INFO | aces.utils:log_tree:57 - trigger ┣━━ input.end @@ -127,21 +134,22 @@ trigger ┗━━ gap.end ┗━━ target.end -2024-06-05 02:07:08.221 | INFO | aces.query:query:43 - Beginning query... -2024-06-05 02:07:08.221 | INFO | aces.query:query:44 - Identifying possible trigger nodes based on the specified trigger event... -2024-06-05 02:07:08.233 | INFO | aces.constraints:check_constraints:93 - Excluding 14,623,763 rows as they failed to satisfy '1 <= admission <= None'. -2024-06-05 02:07:08.249 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'input.end'... -2024-06-05 02:07:13.259 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'input.start'... -2024-06-05 02:07:26.011 | INFO | aces.constraints:check_constraints:93 - Excluding 12,212 rows as they failed to satisfy '5 <= _ANY_EVENT <= None'. -2024-06-05 02:07:26.052 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'gap.end'... -2024-06-05 02:07:30.223 | INFO | aces.constraints:check_constraints:93 - Excluding 631 rows as they failed to satisfy 'None <= admission <= 0'. -2024-06-05 02:07:30.224 | INFO | aces.constraints:check_constraints:93 - Excluding 18,165 rows as they failed to satisfy 'None <= discharge <= 0'. -2024-06-05 02:07:30.224 | INFO | aces.constraints:check_constraints:93 - Excluding 221 rows as they failed to satisfy 'None <= death <= 0'. -2024-06-05 02:07:30.226 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'target.end'... -2024-06-05 02:07:41.512 | INFO | aces.query:query:60 - Done. 44,318 valid rows returned corresponding to 11,606 subjects. -2024-06-05 02:07:41.513 | INFO | aces.query:query:72 - Extracting label 'death' from window 'target'... -2024-06-05 02:07:41.514 | INFO | aces.query:query:86 - Setting index timestamp as 'end' of window 'input'... -2024-06-05 02:07:41.606 | INFO | aces.__main__:main:52 - Completed in 0:00:44.243514. Results saved to 'sample_configs/inhospital_mortality.parquet'. +2024-09-24 02:07:08.221 | INFO | aces.query:query:85 - Beginning query... +2024-09-24 02:07:08.221 | INFO | aces.query:query:89 - Static variable criteria specified, filtering patient demographics... +2024-09-24 02:07:08.221 | INFO | aces.query:query:99 - Identifying possible trigger nodes based on the specified trigger event... +2024-09-24 02:07:08.233 | INFO | aces.constraints:check_constraints:110 - Excluding 14,623,763 rows as they failed to satisfy '1 <= admission <= None'. +2024-09-24 02:07:08.249 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'input.end'... +2024-09-24 02:07:13.259 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'input.start'... +2024-09-24 02:07:26.011 | INFO | aces.constraints:check_constraints:176 - Excluding 12,212 rows as they failed to satisfy '5 <= _ANY_EVENT <= None'. +2024-09-24 02:07:26.052 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'gap.end'... +2024-09-24 02:07:30.223 | INFO | aces.constraints:check_constraints:176 - Excluding 631 rows as they failed to satisfy 'None <= admission <= 0'. +2024-09-24 02:07:30.224 | INFO | aces.constraints:check_constraints:176 - Excluding 18,165 rows as they failed to satisfy 'None <= discharge <= 0'. +2024-09-24 02:07:30.224 | INFO | aces.constraints:check_constraints:176 - Excluding 221 rows as they failed to satisfy 'None <= death <= 0'. +2024-09-24 02:07:30.226 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'target.end'... +2024-09-24 02:07:41.512 | INFO | aces.query:query:113 - Done. 44,318 valid rows returned corresponding to 11,606 subjects. +2024-09-24 02:07:41.513 | INFO | aces.query:query:129 - Extracting label 'death' from window 'target'... +2024-09-24 02:07:41.514 | INFO | aces.query:query:142 - Setting index timestamp as 'end' of window 'input'... +2024-09-24 02:07:41.606 | INFO | aces.__main__:main:188 - Completed in 0:00:44.243514. Results saved to 'sample_configs/inhospital_mortality.parquet'. ``` ## Task Configuration File @@ -170,11 +178,11 @@ windows: ... ``` -Sample task configuration files for 6 common tasks are provided in `sample_configs/`. All task configurations can be directly extracted using `'direct'` model on `sample_data/sample_data.csv` as this predicates dataframe was designed specifically to capture predicates needed for all tasks. However, only `inhospital_mortality.yaml` and `imminent-mortality.yaml` would be able to be extracted on `sample_data/esgpt_sample` and `sample_data/meds_sample` due to a lack of required predicates. +Sample task configuration files for 6 common tasks are provided in `sample_configs/`. All task configurations can be directly extracted using `'direct'` mode on `sample_data/sample_data.csv` as this predicates dataframe was designed specifically to capture concepts needed for all tasks. However, only `inhospital_mortality.yaml` and `imminent-mortality.yaml` would be able to be extracted on `sample_data/esgpt_sample` and `sample_data/meds_sample` due to a lack of required concepts in the datasets. ### Predicates -Predicates describe the event at a timestamp and are used to create predicate columns that contain predicate counts for each row of your dataset. If the MEDS or ESGPT data standard is used, ACES automatically computes the predicates dataframe needed for the query from the `predicates` fields in your task configuration file. However, you may also choose to construct your own predicates dataframe should you not wish to use the MEDS or ESGPT data standard. +Predicates describe the event at a timestamp. Predicate columns are created to contain predicate counts for each row of your dataset. If the MEDS or ESGPT data standard is used, ACES automatically computes the predicates dataframe needed for the query from the `predicates` fields in your task configuration file. However, you may also choose to construct your own predicates dataframe should you not wish to use the MEDS or ESGPT data standard. Example predicates dataframe `.csv`: @@ -203,19 +211,24 @@ normal_spo2: value_max: 120 # optional value_min_inclusive: true # optional value_max_inclusive: true # optional + other_cols: {} # optional ``` Fields for a "plain" predicate: -- `code` (required): Must be a string with `//` sequence separating the column name and column value. +- `code` (required): Must be one of the following: + - a string with `//` sequence separating the column name and column value. + - a list of strings as above in the form of {any: \[???, ???, ...\]}, which will match any of the listed codes. + - a regex in the form of {regex: "???"}, which will match any code that matches that regular expression. - `value_min` (optional): Must be float or integer specifying the minimum value of the predicate, if the variable is presented as numerical values. - `value_max` (optional): Must be float or integer specifying the maximum value of the predicate, if the variable is presented as numerical values. - `value_min_inclusive` (optional): Must be a boolean specifying whether `value_min` is inclusive or not. - `value_max_inclusive` (optional): Must be a boolean specifying whether `value_max` is inclusive or not. +- `other_cols` (optional): Must be a 1-to-1 dictionary of column name and column value, which places additional constraints on further columns. #### Derived Predicates -"Derived" predicates combine existing "plain" predicates using `and` or `or` keywords and have exactly 1 required `expr` field: For instance, the following defines a predicate representing either death or discharge (by combining "plain" predicates of `death` and `discharge`): +"Derived" predicates combine existing "plain" predicates using `and` / `or` keywords and have exactly 1 required `expr` field: For instance, the following defines a predicate representing either death or discharge (by combining "plain" predicates of `death` and `discharge`): ```yaml # plain predicates @@ -231,9 +244,9 @@ discharge_or_death: Field for a "derived" predicate: -- `expr`: Must be a string with the 'and()' or 'or()' key sequences, with "plain" predicates as its constituents. +- `expr`: Must be a string with the 'and()' / 'or()' key sequences, with "plain" predicates as its constituents. -A special predicate `_ANY_EVENT` is always defined, which simply represents any event, as the name suggests. This predicate can be used like any other predicate manually defined (ie., setting a constraint on its occurrence or using it as a trigger, more information below). +A special predicate `_ANY_EVENT` is always defined, which simply represents any event, as the name suggests. This predicate can be used like any other predicate manually defined (ie., setting a constraint on its occurrence or using it as a trigger - more information below!). #### Special Predicates @@ -247,7 +260,7 @@ There are also a few special predicates that you can use. These *do not* need to ### Trigger Event -The trigger event is a simple field with a value of a predicate name. For each trigger event, a predication by a model can be made. For instance, in the following example, the trigger event is an admission. Therefore, in your task, a prediction by a model can be made for each valid admission (after extraction according to other task specifications). You can also simply filter to a cohort of one event (ie., just a trigger event) should you not have any further criteria in your task. +The trigger event is a simple field with a value of a predicate name. For each trigger event, a prediction by a model can be made. For instance, in the following example, the trigger event is an admission. Therefore, in your task, a prediction by a model can be made for each valid admission (ie., samples remaining after extraction according to other task specifications are considered valid). You can also simply filter to a cohort of one event (ie., just a trigger event) should you not have any further criteria in your task. ```yaml predicates: @@ -275,28 +288,28 @@ input: In this example, the window `input` begins at `NULL` (ie., the first event or the start of the time series record), and ends at 24 hours after the `trigger` event, which is specified to be a hospital admission. The window is inclusive on both ends (ie., both the first event and the event at 24 hours after the admission, if any, is included in this window). Finally, a constraint of 5 events of any kind is placed so any valid window would include sufficient data. -Two fields (`start` and `end`) are required to define the size of a window. Both fields must be a string referencing a predicate name, or a string referencing the `start` or `end` field of another window name. In addition, it may express a temporal relationship by including a positive or negative time period expressed as a string (ie., `+ 2 days`, `- 365 days`, `+ 12h`, `- 30 minutes`, `+ 60s`). It may also express an event relationship by including a sequence with a directional arrow and a predicate name (ie., `-> predicate_1` or `<- predicate_1`). Finally, it may also contain `NULL`, indicating the first/last event for the `start`/`end` field, respectively. +Two fields (`start` and `end`) are required to define the size of a window. Both fields must be a string referencing a predicate name, or a string referencing the `start` or `end` field of another window. In addition, it may express a temporal relationship by including a positive or negative time period expressed as a string (ie., `+ 2 days`, `- 365 days`, `+ 12h`, `- 30 minutes`, `+ 60s`). It may also express an event relationship by including a sequence with a directional arrow and a predicate name (ie., `-> predicate_1` indicating the period until the next occurrence of the predicate, or `<- predicate_1` indicating the period following the previous occurrence of the predicate). Finally, it may also contain `NULL`, indicating the first/last event for the `start`/`end` field, respectively. -`start_inclusive` and `end_inclusive` are required booleans specifying whether the events, if any, at the `start` and `end` points of the window are included in the window. +`start_inclusive` and `end_inclusive` are required booleans specifying whether the events, if present, at the `start` and `end` points of the window are included in the window. The `has` field specifies constraints relating to predicates within the window. For each predicate defined previously, a constraint for occurrences can be set using a string in the format of `(, )`. Unbounded conditions can be specified by using `None` or leaving it empty (ie., `(5, None)`, `(8,)`, `(None, 32)`, `(,10)`). -`label` is an optional field and can only exist in ONE window in the task configuration file if defined. It must be a string matching a defined predicate name, and is used to extract the label for the task. +`label` is an optional field and can only exist in ONE window in the task configuration file if defined (an error is thrown otherwise). It must be a string matching a defined predicate name, and is used to extract the label for the task. -`index_timestamp` is an optional field and can only exist in ONE window in the task configuration file if defined. It must be either `start` or `end`, and is used to create an index column used to easily manipulate the results output. Usually, one would set it to be the time at which the prediction would be made (ie., set to `end` in your window containing input data). Please ensure that you are validating your interpretation of `index_timestamp` for your task. For instance, if `index_timestamp` is set to the `end` of a particular window, the timestamp would be the event at the window boundary. However, in some cases, your task may want to exclude this boundary event, so ensure you are correctly interpreting the timestamp during extraction. +`index_timestamp` is an optional field and can only exist in ONE window in the task configuration file if defined (an error is thrown otherwise). It must be either `start` or `end`, and is used to create an index column used to easily manipulate the results output. Usually, one would set it to be the time at which the prediction would be made (ie., set to `end` in your window containing input data). Please ensure that you are validating your interpretation of `index_timestamp` for your task. For instance, if `index_timestamp` is set to the `end` of a particular window, the timestamp would be the event at the window boundary. However, in some cases, your task may want to exclude this boundary event, so ensure you are correctly interpreting the timestamp during extraction. ## FAQs ### Static Data -Static data is now supported. In MEDS, static variables are simply stored in rows with `null` timestamps. In ESGPT, static variables are stored in a separate `subjects_df` table. In either case, it is feasible to express static variables as a predicate and apply the associated criteria normally using the `patient_demographics` heading of a configuration file. Please see [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/examples.html) and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html) for examples and details. +In MEDS, static variables are simply stored in rows with `null` timestamps. In ESGPT, static variables are stored in a separate `subjects_df` table. In either case, it is feasible to express static variables as a predicate and apply the associated criteria normally using the `patient_demographics` heading of a configuration file. Please see [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/examples.html) and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html) for examples and details. ### Complementary Tools ACES is an integral part of the MEDS ecosystem. To fully leverage its capabilities, you can utilize it alongside other complementary MEDS tools, such as: -- [MEDS-ETL](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used to transform various data schemas, including some command data models, into the MEDS format. -- [MEDS-TAB](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used generate automated tabular baseline methods (ie., XGBoost over ACES-defined tasks). +- [MEDS-ETL](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used to transform various data schemas, including some common data models, into the MEDS format. +- [MEDS-TAB](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used to generate automated tabular baseline methods (ie., XGBoost over ACES-defined tasks). - [MEDS-Polars](https://github.com/Medical-Event-Data-Standard/meds_etl), which contains polars-based ETL scripts. ### Alternative Tools @@ -305,22 +318,19 @@ There are existing alternatives for cohort extraction that focus on specific com ACES serves as a middle ground between PIC-SURE and ATLAS. While it may offer less capability than PIC-SURE, it compensates with greater ease of use and improved communication value. Compared to ATLAS, ACES provides greater capability, though with slightly lower ease of use, yet it still maintains a higher communication value. -Finally, ACES is not tied to a particular common data model. Built on a flexible event-stream format, ACES is a no-code solution with a descriptive input format, permitting easy and wide iteration over task definitions, and can be applied to a variety of schemas, making it a versatile tool suitable for diverse research needs. +Finally, ACES is not tied to a particular common data model. Built on a flexible event-stream format, ACES is a no-code solution with a descriptive input format, permitting easy and wide iteration over task definitions. It can be applied to a variety of schemas, making it a versatile tool suitable for diverse research needs. ## Future Roadmap ### Usability - Extract indexing information for easier setup of downstream tasks ([#37](https://github.com/justin13601/ACES/issues/37)) -- Allow separate predicates-only files and criteria-only files ([#42](https://github.com/justin13601/ACES/issues/42)) ### Coverage - Directly support nested configuration files ([#43](https://github.com/justin13601/ACES/issues/43)) - Support timestamp binning for use in predicates or as qualifiers ([#44](https://github.com/justin13601/ACES/issues/44)) - Support additional label types ([#45](https://github.com/justin13601/ACES/issues/45)) -- Support additional predicate types ([#47](https://github.com/justin13601/ACES/issues/47)) -- Better handle criteria for static variables ([#48](https://github.com/justin13601/ACES/issues/48)) - Allow chaining of multiple task configurations ([#49](https://github.com/justin13601/ACES/issues/49)) ### Generalizability @@ -330,7 +340,6 @@ Finally, ACES is not tied to a particular common data model. Built on a flexible ### Causal Usage - Directly support case-control matching ([#51](https://github.com/justin13601/ACES/issues/51)) -- Directly support profiling of excluded populations ([#52](https://github.com/justin13601/ACES/issues/52)) ### Additional Tasks diff --git a/sample_data/meds_sample/held_out/0.parquet b/sample_data/meds_sample/held_out/0.parquet new file mode 100644 index 0000000..5c71d98 Binary files /dev/null and b/sample_data/meds_sample/held_out/0.parquet differ diff --git a/sample_data/meds_sample/sample_shard.parquet b/sample_data/meds_sample/sample_shard.parquet index 931a1f4..5c71d98 100644 Binary files a/sample_data/meds_sample/sample_shard.parquet and b/sample_data/meds_sample/sample_shard.parquet differ diff --git a/sample_data/meds_sample/test/0.parquet b/sample_data/meds_sample/test/0.parquet deleted file mode 100644 index 931a1f4..0000000 Binary files a/sample_data/meds_sample/test/0.parquet and /dev/null differ diff --git a/sample_data/meds_sample/train/0.parquet b/sample_data/meds_sample/train/0.parquet index 4fe06c7..2f90ac3 100644 Binary files a/sample_data/meds_sample/train/0.parquet and b/sample_data/meds_sample/train/0.parquet differ diff --git a/sample_data/meds_sample/train/1.parquet b/sample_data/meds_sample/train/1.parquet index 87af74e..98e4ee7 100644 Binary files a/sample_data/meds_sample/train/1.parquet and b/sample_data/meds_sample/train/1.parquet differ