Working MEDS tutorial

justin13601 · Sep 26, 2024 · 43b031e · 43b031e
1 parent da00904
commit 43b031e
Showing 1 changed file with 13 additions and 22 deletions.
diff --git a/docs/source/notebooks/tutorial_meds.ipynb b/docs/source/notebooks/tutorial_meds.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Code Example with Synthetic MEDS Data"
+    "# Code Tutorial with Synthetic MEDS Data"
    ]
   },
   {
@@ -20,7 +20,7 @@
    "source": [
     "### Imports\n",
     "\n",
-    "First, let's import ACES! Three modules - `config`, `predicates`, and `query` - are required to execute an end-to-end cohort extraction. `omegaconf` is also required to express our data config parameters in order to load our `EventStream` dataset. Other imports are only needed visualization!"
+    "First, let's import ACES! Three modules - `config`, `predicates`, and `query` - are required to execute an end-to-end cohort extraction. `omegaconf` is also required to express our data config parameters in order to load our `MEDS` dataset. Other imports are only needed for visualization!"
    ]
   },
   {
@@ -32,9 +32,9 @@
     "import json\n",
     "from pathlib import Path\n",
     "\n",
+    "import pandas as pd\n",
     "import yaml\n",
     "from bigtree import print_tree\n",
-    "from EventStream.data.dataset_polars import Dataset\n",
     "from IPython.display import display\n",
     "from omegaconf import DictConfig\n",
     "\n",
@@ -47,7 +47,7 @@
    "source": [
     "### Directories\n",
     "\n",
-    "Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the ESGPT synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs) and [sample_data/](https://github.com/justin13601/ACES/tree/main/sample_data) folders in the project root, respectively."
+    "Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the MEDS synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs) and [sample_data/](https://github.com/justin13601/ACES/tree/main/sample_data) folders in the project root, respectively."
    ]
   },
   {
@@ -57,7 +57,7 @@
    "outputs": [],
    "source": [
     "config_path = \"../../../sample_configs/inhospital_mortality.yaml\"\n",
-    "data_path = \"../../../sample_data/esgpt_sample\""
+    "data_path = \"../../../sample_data/meds_sample/\""
    ]
   },
   {
@@ -71,10 +71,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The task configuration file is the core configuration language that ACES uses to extract cohorts. Details about this configuration language is available in [Configuration Language](https://eventstreamaces.readthedocs.io/en/latest/configuration.html). In brief, the configuration file contains `predicates`, `trigger`, and `windows` sections. \n",
+    "The task configuration file is the core configuration language that ACES uses to extract cohorts. Details about this configuration language is available in [Configuration Language](https://eventstreamaces.readthedocs.io/en/latest/configuration.html). In brief, the configuration file contains `predicates`, `patient_demographics`, `trigger`, and `windows` sections. \n",
     "\n",
     "The `predicates` section is used to define dataset-specific concepts that are needed for the task. In our case of binary mortality prediction, we are interested in extracting a cohort of patients that have been admitted into the hospital and who were subsequently discharged or died. As such `admission`, `discharge`, `death`, and `discharge_or_death` would be handy predicates.\n",
     "\n",
+    "The `patient_demographics` section is used to define static concepts that remain constant for subjects over time. For instance, sex is a common static variable. Should we want to filter out cohort to patients with a specific sex, we can do so here in the same way as defining predicates. For more information on predicates, please refer to this [guide](https://eventstreamaces.readthedocs.io/en/latest/technical.html#predicates-plainpredicateconfig-and-derivedpredicateconfig). In this example, let's say we are only interested in male patients.\n",
+    "\n",
     "We'd also like to make a prediction of mortality for each admission. Hence, a reasonable `trigger` event would be an `admission` predicate.\n",
     "\n",
     "Suppose in our task, we'd like to set a constraint that the admission must have been more than 48 hours long. Additionally, for our prediction inputs, we'd like to use all information in the patient record up until 24 hours after admission, which must contain at least 5 event records (as we'd want to ensure there is sufficient input data). These clauses are captured in the `windows` section where each window is defined relative to another."
@@ -144,13 +146,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This tutorial uses synthetic data of 100 patients stored in the ESGPT standard. For more information about this data, please refer to the [ESGPT Documentation](https://eventstreamml.readthedocs.io/en/latest/_collections/local_tutorial_notebook.html).\n",
-    "\n",
-    "We first load the dataset by passing the path (`Path` object) to the directory containing the ESGPT dataset into `EventStream`. This configures a `ESD` object, allowing us to access the relevant dataframes. While ESGPT contains a wealth of other functionality, we are particularly interested in the loading of `events_df` and the `dynamic_measurements_df`.\n",
-    "\n",
-    "`events_df` consists of unique (`subject_id`, `timestamp`) pairs mapped to an unique `event_id`. For each `event_id`, the `event_type` column contains `&` delimited sequences, such as `ADMISSION`, `DEATH`, `LAB`, and `VITALS`, etc., specifying the type of event(s) that occurred that `event_id`.\n",
-    "\n",
-    "`dynamic_measurements_df` consists of various values in the electronic health record, and can be linked to the `events_df` table via the `event_id`."
+    "This tutorial uses synthetic data of 100 patients stored in the MEDS standard. For more information about this data, please refer to the generation of this synthetic data in the [ESGPT Documentation](https://eventstreamml.readthedocs.io/en/latest/_collections/local_tutorial_notebook.html) (separately converted to MEDS). Here is what the data looks like:"
    ]
   },
   {
@@ -159,12 +155,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ESD = Dataset.load(Path(data_path))\n",
-    "events_df = ESD.events_df\n",
-    "dynamic_measurements_df = ESD.dynamic_measurements_df\n",
-    "\n",
-    "display(events_df)\n",
-    "display(dynamic_measurements_df)"
+    "pd.read_parquet(f\"{data_path}/train/0.parquet\").head()"
    ]
   },
   {
@@ -177,7 +168,7 @@
     "\n",
     "A predicate column is simply a column containing numerical counts (often just `0`'s and `1`'s), representing the number of times a given predicate (concept) occurs at a given timestamp for a given patient.\n",
     "\n",
-    "In the case of ESGPT, ACES support the automatic generation of these predicate columns from the configuration file. However, some fields need to be provided via a `DictConfig` object. These include the path to the directory of the ESGPT dataset (`str`) and the data standard (which is `esgpt` in this case).\n",
+    "In the case of MEDS (and ESGPT), ACES support the automatic generation of these predicate columns from the configuration file. However, some fields need to be provided via a `DictConfig` object. These include the path to the directory of the MEDS dataset (`str`) and the data standard (which is `meds` in this case).\n",
     "\n",
     "Given this data configuration, we then call `predicates.get_predicates_df()` to generate the relevant predicate columns for our task. Due to the nature of the specified predicates, the resulting dataframe simply contains the unique (`subject_id`, `timestamp`) pairs and binary columns for each predicate. An additional predicate `_ANY_EVENT` is also generated - this will be used to enforce our constraint of the number of events in the `input` window. "
    ]
@@ -188,7 +179,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "data_config = DictConfig({\"path\": data_path, \"standard\": \"esgpt\"})\n",
+    "data_config = DictConfig({\"path\": data_path, \"standard\": \"meds\"})\n",
     "\n",
     "predicates_df = predicates.get_predicates_df(cfg=cfg, data_config=data_config)\n",
     "display(predicates_df)"
@@ -231,7 +222,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "... and that's a wrap! We have used ACES to perform an end-to-end extraction on a ESGPT dataset for a cohort that can be used to predict in-hospital mortality. Similar pipelines can be made for other tasks, as well as using the MEDS data standard. You may also pre-compute predicate columns and use the `direct` flag when loading in `.csv` or `.parquet` data files. More information about this is available in [Predicates DataFrame](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html).\n",
+    "... and that's a wrap! We have used ACES to perform an end-to-end extraction on a MEDS dataset for a cohort that can be used to predict in-hospital mortality. Similar pipelines can be made for other tasks, as well as using the ESGPT data standard. You may also pre-compute predicate columns and use the `direct` flag when loading in `.csv` or `.parquet` data files. More information about this is available in [Predicates DataFrame](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html).\n",
     "\n",
     "As always, please don't hesitate to reach out should you have any questions about ACES!\n"
    ]