updating filepattern and vaex package

PolusAI · Jan 25, 2024 · 275d950 · 275d950
1 parent c2b84e2
commit 275d950
Show file tree

Hide file tree

Showing 10 changed files with 721 additions and 465 deletions.
diff --git a/clustering/feature-subsetting-plugin/README.md b/clustering/feature-subsetting-plugin/README.md
@@ -1,24 +1,25 @@
-# Feature Data Subset
+# Feature Data Subset(v0.2.0-dev)
 
-This WIPP plugin subsets data based on a given feature. It works in conjunction with the `polus-feature-extraction-plugin`, where the feature extraction plugin can be used to extract the features such as the mean intensity of every image in the input image collection. 
+This WIPP plugin subsets data based on a given feature. It works in conjunction with the `polus-feature-extraction-plugin`, where the feature extraction plugin can be used to extract the features such as the mean intensity of every image in the input image collection.
 
 # Usage
-The details and usage of the plugin inputs is provided in the section below. In addition to the subsetted data, the output directory also consists of a `summary.txt` file which has information as to what images were kept and their new filename if they were renamed.  
-
-### Explanation of inputs 
-Some of the inputs are pretty straighforward and are used commonly across most WIPP plugins. This section is used to provide some details and examples of the inputs that may be a little complicated. The image collection with the following pattern will be used as an example : `r{r+}_t{t+}_p{p+}_z{z+}_c{c+}.ome.tif`, where r,t,p,z,c stand for replicate, timepoint, positon,z-positon, and channel respectively. Consider we have 5 replicates, 3 timepoints, 50 positions, 10 z-planes and 4 channels. 
-
-1. `inpDir` - This contains the path to the input image collection to subset data from. 
-2. `filePattern` - Filepattern of the input images
-3. `groupVar` - This is a mandatory input across which to subset data. This can take either 1 or 2 variables as input and if 2 variables are provided then the second variable will be treated as the minor grouping variable. In our example, if the `z` is provided as input, then within a subcollection, the mean of the feature value will be taken for all images with the same z. Then the z positions will be filtered out based on the input of `percentile` and `removeDirection` variables. Now if `z,c` are provided as input, then 'c' will be treated as the minor grouping variable which means that the mean will be taken for all images with the same z for each channel. Also, the plugin will ensures that the same values of z positions are filtered out across c. 
-4. `csvDir` - This contains the path to the csv collection containing the feature values for each image. This can be the output of the feature extraction plugin.
-5. `feature` - The column name from the csv file that will be used to filter images
-6. `percentile` and `removeDirection` - These two variables denote the critieria with which images are filtered. For example, if percentile is `0.1` and removeDirection is set to `Below` then images with feature value below the 10th percentile will be removed. On the other hand, if removeDirection is set to above then all images with feature value greater than the 10th pecentile will be removed. This enables data subsetting from both `brighfield` and `darkfield` microscopy images.  
-
- **Optional Arguments**   
-
+The details and usage of the plugin inputs is provided in the section below. In addition to the subsetted data, the output directory also consists of a `summary.txt` file which has information as to what images were kept and their new filename if they were renamed.
+
+### Explanation of inputs
+Some of the inputs are pretty straighforward and are used commonly across most WIPP plugins. This section is used to provide some details and examples of the inputs that may be a little complicated. The image collection with the following pattern will be used as an example : `r{r+}_t{t+}_p{p+}_z{z+}_c{c+}.ome.tif`, where r,t,p,z,c stand for replicate, timepoint, positon,z-positon, and channel respectively. Consider we have 5 replicates, 3 timepoints, 50 positions, 10 z-planes and 4 channels.
+
+1. `inpDir` - This contains the path to the input image collection to subset data from.
+2. `tabularDir` This contains the path to the tabular files with file formats (`.csv`, `.arrow`, `.parquet`) containing the feature values for each image. This can be the output of the feature extraction or nyxus plugin
+3. `filePattern` - Filepattern of the input images
+4. `imageFeature` - Tabular data featuring image filenames
+5. `tabularFeature` - Tabular feature that will be used to filter images
+6. `groupVar` - This is a mandatory input across which to subset data. This can take either 1 or 2 variables as input and if 2 variables are provided then the second variable will be treated as the minor grouping variable. In our example, if the `z` is provided as input, then within a subcollection, the mean of the feature value will be taken for all images with the same z. Then the z positions will be filtered out based on the input of `percentile` and `removeDirection` variables. Now if `z,c` are provided as input, then 'c' will be treated as the minor grouping variable which means that the mean will be taken for all images with the same z for each channel. Also, the plugin will ensures that the same values of z positions are filtered out across c.
+7. `percentile` and `removeDirection` - These two variables denote the critieria with which images are filtered. For example, if percentile is `0.1` and removeDirection is set to `Below` then images with feature value below the 10th percentile will be removed. On the other hand, if removeDirection is set to above then all images with feature value greater than the 10th pecentile will be removed. This enables data subsetting from both `brighfield` and `darkfield` microscopy images.
+
+ **Optional Arguments**
+
 8. `sectionVar` -  This is an optional input to segregate the input image collection into sub-collections. The analysis will be done seperately for each sub-collection. In our example, if the user enters `r,t` as the sectionVar, then we will have 15 subcollections (5*3),1 for each combination of timepoint and replicate. If the user enters `r` as sectionVar, then we will have 5 sub collections, 1 for each replicate. If the user wants to consider the whole image collection as a single section, then no input is required. NOTE: As a post processing step, same number of images will be subsetted across different sections.
-9. `padding` - This is an optional variable with default value of 0. A delay of 3 means that 3 additional planes will captured on either side of the subsetted data. This can be used as a sanity check to ensure that the subsetted data captures the images we want.  For example, in our examples if the following z values were filtered out intitially - 5,6,7 ; then a delay of 3 means that the output dataset will have z positions 2,3,4,5,6,7,8,9,10 if all them exist. 
+9. `padding` - This is an optional variable with default value of 0. A delay of 3 means that 3 additional planes will captured on either side of the subsetted data. This can be used as a sanity check to ensure that the subsetted data captures the images we want.  For example, in our examples if the following z values were filtered out intitially - 5,6,7 ; then a delay of 3 means that the output dataset will have z positions 2,3,4,5,6,7,8,9,10 if all them exist.
 10. `writeOutput` - This is an optional argument with default value `True`. If it is set to true, then both the output image collection and `summary.txt` file will be created. If it is set to false, then the output directory will only consist of summary.txt. This option enables the user to tune the hyperparameters such as percentile, removeDirecton, feature without actually creating the output image collection.
 
 
@@ -38,19 +39,20 @@ If WIPP is running, navigate to the plugins page and add a new plugin. Paste the
 
 ## Options
 
-This plugin takes one input argument and one output argument:
+This plugin takes twelve input arguments and one output argument:
 
 | Name                | Description                                           | I/O    | Type          |
 | ------------------- | ----------------------------------------------------- | ------ | ------------- |
-| `--csvDir`          | CSV collection containing features                    | Input  | csvCollection |
-| `--padding`         | Number of images to capture outside the cutoff        | Input  | int           |
-| `--feature`         | Feature to use to subset data                         | Input  | string        |
+| `--inpDir`          | Input image collection to be processed by this plugin | Input  | collection    |
+| `--tabularDir`      | Path to tabular data                                  | Input  | genericData   |
 | `--filePattern`     | Filename pattern used to separate data                | Input  | string        |
+| `--imageFeature`    | Feature in tabular data with image filenames          | Input  | string        |
+| `--tabularFeature`  | Tabular feature to filter image files                 | Input  | string        |
+| `--padding`         | Number of images to capture outside the cutoff        | Input  | integer       |
 | `--groupVar`        | variables to group by in a section                    | Input  | string        |
-| `--inpDir`          | Input image collection to be processed by this plugin | Input  | collection    |
-| `--percentile`      | Percentile to remove                                  | Input  | int           |
+| `--percentile`      | Percentile to remove                                  | Input  | float         |
 | `--removeDirection` | remove direction above or below percentile            | Input  | string        |
 | `--sectionVar`      | variables to divide larger sections                   | Input  | string        |
 | `--writeOutput`     | write output image collection or not                  | Input  | boolean       |
-| `--outDir`          | Output collection                                     | Output | collection    |
-
+| `--outDir`          | Output collection                                     | Output | genericData   |
+| `--preview`         | Generate a JSON file with outputs                     | Output | JSON          |
diff --git a/clustering/feature-subsetting-plugin/example/summary.txt b/clustering/feature-subsetting-plugin/example/summary.txt
@@ -0,0 +1,14 @@
+------------------------------------------------
+
+ Files :
+
+x00_y01_p03_c1.ome.tif -----> x00_y01_p01_c1.ome.tif
+x00_y01_p03_c2.ome.tif -----> x00_y01_p01_c2.ome.tif
+x00_y01_p03_c3.ome.tif -----> x00_y01_p01_c3.ome.tif
+x00_y01_p03_c4.ome.tif -----> x00_y01_p01_c4.ome.tif
+x00_y01_p03_c5.ome.tif -----> x00_y01_p01_c5.ome.tif
+x00_y01_p04_c1.ome.tif -----> x00_y01_p02_c1.ome.tif
+x00_y01_p04_c2.ome.tif -----> x00_y01_p02_c2.ome.tif
+x00_y01_p04_c3.ome.tif -----> x00_y01_p02_c3.ome.tif
+x00_y01_p04_c4.ome.tif -----> x00_y01_p02_c4.ome.tif
+x00_y01_p04_c5.ome.tif -----> x00_y01_p02_c5.ome.tif
diff --git a/clustering/feature-subsetting-plugin/plugin.json b/clustering/feature-subsetting-plugin/plugin.json
@@ -2,10 +2,10 @@
   "name": "Feature Subsetting",
   "version": "0.2.0-dev",
   "title": "Feature Subsetting",
-  "description": "Subset data using a given feature",
+  "description": "Subset data using a given feature.",
   "author": "Gauhar Bains ([email protected]) and Hamdah Shafqat Abbasi ([email protected])",
   "institution": "National Center for Advancing Translational Sciences, National Institutes of Health",
-  "repository": "https://github.com/PolusAI/polus-plugin",
+  "repository": "https://github.com/PolusAI/polus-plugins",
   "website": "https://ncats.nih.gov/preclinical/core/informatics",
   "citation": "",
   "containerId": "polusai/feature-subsetting-plugin:0.2.0-dev",
@@ -14,131 +14,160 @@
     "-m",
     "polus.plugins.clustering.feature_subsetting"
   ],
-  "inputs": [
-    {
-      "name": "inpDir",
+  "inputs": {
+    "inpDir": {
       "type": "collection",
-      "description": "Input image collection to be processed by this plugin",
-      "required": true
+      "title": "Input image directory",
+      "description": "Input image directory.",
+      "required": "True"
     },
-    {
-      "name": "filePattern",
+    "tabularDir": {
+      "type": "genericData",
+      "title": "Input tabular directory",
+      "description": "Path to directory containing tabular data.",
+      "required": "True"
+    },
+    "filePattern": {
+      "type": "string",
+      "title": "Filename pattern",
+      "description": "Filename pattern used to separate data.",
+      "required": "True"
+    },
+    "imageFeature": {
       "type": "string",
-      "description": "Filename pattern used to separate data",
-      "required": true
+      "title": "imageFeature",
+      "description": "Feature in tabular data containing image filenames.",
+      "required": "True"
     },
-    {
-      "name": "groupVar",
+    "tabularFeature": {
       "type": "string",
-      "description": "variables to group by in a section",
-      "required": true
+      "title": "tabularFeature",
+      "description": "Feature in tabular data to subset image data.",
+      "required": "True"
     },
-    {
-      "name": "sectionVar",
+    "padding": {
+      "type": "integer",
+      "title": "padding",
+      "description": "Number of images to capture outside the cutoff.",
+      "required": "False"
+    },
+    "groupVar": {
       "type": "string",
-      "description": "variables to divide larger sections",
-      "required": false
+      "title": "groupVar",
+      "description": "variables to group by in a section.",
+      "required": "True"
     },
-    {
-      "name": "csvDir",
-      "type": "csvCollection",
-      "description": "CSV collection containing features",
-      "required": true
+    "percentile": {
+      "type": "float",
+      "title": "percentile",
+      "description": "Percentile to remove.",
+      "required": "True"
     },
-    {
-      "name": "feature",
+    "removeDirection": {
       "type": "string",
-      "description": "Feature to use to subset data",
-      "required": true
-    },
-    {
-      "name": "percentile",
-      "type": "number",
-      "description": "Percentile to remove",
-      "required": true
-    },
-    {
-      "name": "removeDirection",
-      "type": "enum",
-      "options": {
-        "values": [
-          "Below",
-          "Above"
-        ]
-      },
-      "description": "remove direction above or below percentile",
-      "required": true
-    },
-    {
-      "name": "padding",
+      "title": "removeDirection",
+      "description": "Remove direction above or below percentile.",
+      "required": "False",
+      "default": "Below"
+    },
+    "sectionVar": {
       "type": "string",
-      "description": "Number of images to capture outside the cutoff",
-      "required": false
+      "title": "sectionVar",
+      "description": "Variables to divide larger sections.",
+      "required": "False"
+    },
+    "writeOutput": {
+      "type": "boolean",
+      "title": "writeOutput",
+      "description": "Write output image collection or not.",
+      "required": "False"
     },
-    {
-      "name": "writeOutput",
+    "preview": {
       "type": "boolean",
-      "description": "write output image collection or not",
-      "required": false
+      "title": "Preview",
+      "description": "Generate an output preview.",
+      "required": "False"
     }
-  ],
-  "outputs": [
-    {
-      "name": "outDir",
-      "type": "collection",
-      "description": "Output collection"
+  },
+  "outputs": {
+    "outDir": {
+      "type": "genericData",
+      "description": "Output collection."
     }
-  ],
-  "ui": [
-    {
-      "key": "inputs.inpDir",
-      "title": "Input collection",
-      "description": "Input image collection to be processed by this plugin"
-    },
-    {
-      "key": "inputs.csvDir",
-      "title": "CSV collection",
-      "description": "CSV collection containing features"
-    },
-    {
-      "key": "inputs.feature",
-      "title": "Feature",
-      "description": "Feature to use to subset data"
-    },
-    {
-      "key": "inputs.percentile",
-      "title": "Percentile",
-      "description": "Percentile to remove"
-    },
-    {
-      "key": "inputs.removeDirection",
-      "title": "Remove Direction",
-      "description": "remove direction above or below percentile"
-    },
-    {
-      "key": "inputs.filePattern",
+  },
+  "ui": {
+    "inpDir": {
+      "type": "collection",
+      "title": "Input image directory",
+      "description": "Input image directory.",
+      "required": "True"
+    },
+    "tabularDir": {
+      "type": "genericData",
+      "title": "Input tabular directory",
+      "description": "Path to directory containing tabular data.",
+      "required": "True"
+    },
+    "filePattern": {
+      "type": "string",
       "title": "Filename pattern",
-      "description": "Filename pattern used to separate data"
-    },
-    {
-      "key": "inputs.groupVar",
-      "title": "Grouping Variables",
-      "description": "variables to group by in a section"
-    },
-    {
-      "key": "inputs.sectionVar",
-      "title": "Section Variables",
-      "description": "variables to divide larger sections"
-    },
-    {
-      "key": "inputs.padding",
-      "title": "Padding",
-      "description": "Number of images to capture outside the cutoff"
-    },
-    {
-      "key": "inputs.writeOutput",
-      "title": "Write Output",
-      "description": "write output image collection or not"
+      "description": "Filename pattern used to separate data.",
+      "required": "True"
+    },
+    "imageFeature": {
+      "type": "string",
+      "title": "imageFeature",
+      "description": "Feature in tabular data containing image filenames.",
+      "required": "True"
+    },
+    "tabularFeature": {
+      "type": "string",
+      "title": "tabularFeature",
+      "description": "Feature in tabular data to subset image data.",
+      "required": "True"
+    },
+    "padding": {
+      "type": "integer",
+      "title": "padding",
+      "description": "Number of images to capture outside the cutoff.",
+      "required": "False"
+    },
+    "groupVar": {
+      "type": "string",
+      "title": "groupVar",
+      "description": "variables to group by in a section.",
+      "required": "True"
+    },
+    "percentile": {
+      "type": "float",
+      "title": "percentile",
+      "description": "Percentile to remove.",
+      "required": "True"
+    },
+    "removeDirection": {
+      "type": "string",
+      "title": "removeDirection",
+      "description": "Remove direction above or below percentile.",
+      "required": "False",
+      "default": "Below"
+    },
+    "sectionVar": {
+      "type": "string",
+      "title": "sectionVar",
+      "description": "Variables to divide larger sections.",
+      "required": "False"
+    },
+    "writeOutput": {
+      "type": "boolean",
+      "title": "writeOutput",
+      "description": "Write output image collection or not.",
+      "required": "False"
+    },
+    "preview": {
+      "type": "boolean",
+      "title": "Preview",
+      "description": "Generate an output preview.",
+      "required": "False"
     }
-  ]
-}
+  }
+}