convert from loom (fixes #15); expanded filtering; new PBMC dataset +…

… demo movie; better tests (fixes #21)
bihealth · Oct 18, 2019 · 0befa91 · 0befa91
1 parent 1d03aa3
commit 0befa91
Show file tree

Hide file tree

Showing 12 changed files with 304 additions and 121 deletions.
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -2,6 +2,15 @@
 History
 =======
 
+------
+v0.7.0
+------
+
+- added conversion from .loom files
+- cell filtering also supports downsampling
+- added PBMC dataset hosted on figshare
+- added demo movie
+
 ------
 v0.6.0
 ------

diff --git a/README.rst b/README.rst
@@ -23,7 +23,11 @@ SCelVis: Easy Single-Cell Visualization
 .. image:: https://zenodo.org/badge/185944510.svg
     :target: https://zenodo.org/badge/latestdoi/185944510
 
-You can find the URL for the demo linked to on the top right of the Github repository page.
+|
+
+.. image:: scelvis/assets/movie.gif
+    :height: 400px
+    :align: center
 
 ------------
 Installation
@@ -52,12 +56,13 @@ A Docker container is also available via `Quay.io/Biocontainers <https://quay.io
 Tutorial
 --------
 
-explore a simulated dummy dataset or 1000 cells from a 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells (10X v3 chemistry) 
+explore 1000 cells from a 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells (10X v3 chemistry) or a published dataset of ~14000 IFN-beta treated and control PBMCs from 8 donors (`GSE96583 <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96583>`_; see `Kang et al. <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96583>`_)
 
 .. code-block:: shell
 
-    $ scelvis run --data-source /path/to/scelvis/examples/dummy.h5ad
     $ scelvis run --data-source /path/to/scelvis/examples/hgmm_1k.h5ad
+    $ scelvis run --data-source https://files.figshare.com/18037739/pbmc.h5ad
+
 
 and then point your browser to http://0.0.0.0:8050/.
 
@@ -70,12 +75,14 @@ Data sets are provided as HDF5 files (`anndata <https://anndata.readthedocs.io/e
 
 For the input you can either specify one HDF5 file or a directory containing multiple such files.
 
-You can use ``scanpy`` to create this HDF5 file directly or use the ``scelvis convert`` command for converting your single-cell pipeline output.
+You can use `scanpy <http://scanpy.rtfd.io>`_ to create this HDF5 file directly or use the ``scelvis convert`` command for converting your single-cell pipeline output.
 
 HDF5 Input
 ----------
 
-for HDF5 input, you can do your analysis with `scanpy <http://scanpy.rtfd.io>`_ to create an anndata object ``ad``. SCelVis will use embedding coordinates from ``ad.obsm``, cell annotation from ``ad.obs`` and expression data directly from ``ad.X`` (this should contain normalized and log-transformed expression values for all genes). Information about the dataset will be extracted from strings stored in ``ad.uns['about_title']``, ``ad.uns['about_short_title']`` and ``ad.uns['about_readme']`` (assumed to be Markdown). Information about marker genes will be taken from entries starting with ``marker_`` in ``ad.uns``: entries called ``marker_gene`` (required!), ``marker_cluster``, ``marker_padj``, ``marker_LFC`` will create a table with the columns ``gene``, ``cluster``, ``padj``, and ``LFC``.
+for HDF5 input, you can do your analysis with `scanpy <http://scanpy.rtfd.io>`_ to create an anndata object ``ad``. SCelVis will use embedding coordinates from ``ad.obsm``, cell annotation from ``ad.obs`` and expression data directly from ``ad.X`` (this should contain normalized and log-transformed expression values for all genes). If present, information about the dataset will be extracted from strings stored in ``ad.uns['about_title']``, ``ad.uns['about_short_title']`` and ``ad.uns['about_readme']`` (assumed to be Markdown). Information about marker genes will be taken either from the ``rank_genes_groups`` slot in ``ad.uns`` or from entries starting with ``marker_`` in ``ad.uns``: entries called ``marker_gene`` (required!), ``marker_cluster``, ``marker_padj``, ``marker_LFC`` will create a table with the columns ``gene``, ``cluster``, ``padj``, and ``LFC``.
+
+If you prepared your data with ``Seurat`` (v2), you can use ``Convert(from = sobj, to = "anndata", filename = "data.h5ad")`` to get an HDF5 file.
 
 Text Input
 ----------
@@ -122,7 +129,18 @@ For "raw" text input, you need to prepare at least three files in the input dire
 
     $ scelvis convert --input-dir text_input --output data/text_input.h5ad --about-md text_input.md
 
-in ``examples/dummy_raw.zip`` and ``examples/dummy_about.md`` we provide raw data for the dummy dataset.
+in ``examples/dummy_raw.zip`` and ``examples/dummy_about.md`` we provide raw data for a simulated dummy dataset.
+
+Loom Input
+----------
+
+for `loompy <http://loompy.org>`_ or `loomR <https://github.com/mojaveazure/loomR>`_ input, you can convert your data like this:
+
+.. code-block:: shell
+
+    $ scelvis convert --i input.loom -m markers.tsv -a about.md -o loom_input.h5ad 
+
+if you prepared your data with ``Seurat`` (v3), you can use ``as.loom(sobj, filename="output.loom")`` to get a ``.loom`` file and then convert to ``.h5ad`` with the above command.
 
 CellRanger Input
 ----------------
@@ -142,7 +160,7 @@ Alternatively, the output directory of ``CellRanger`` can be used. This is the d
     EOF
     $ scelvis convert --input-dir cellranger-out --output data/cellranger_input.h5ad --about-md cellranger.md
 
-In ``examples/hgmm_1k_raw.zip`` we provide ``CellRanger`` output for the 1k 1:1 human mouse mix. Specifically, from the `outs` folder we selected
+In ``examples/hgmm_1k_raw`` we provide ``CellRanger`` output for the 1k 1:1 human mouse mix. Specifically, from the ``outs`` folder we selected
 
 - ``filtered_feature_bc_matrix.h5``
 - tSNE and PCA projections from ``analysis/tsne`` and ``analysis/pca``

diff --git a/requirements/base.txt b/requirements/base.txt
@@ -14,6 +14,7 @@ numpy
 pandas
 anndata
 scanpy
+loompy
 
 # Caching functionality for Flask.
 flask-caching

diff --git a/scelvis/app.py b/scelvis/app.py
@@ -145,10 +145,16 @@ def find(name, path):
             logger.info("Looking for %s file", cellranger_needle)
             needle_path = find(cellranger_needle, tmpdir)
             if needle_path is None:
-                raw_needle = "coords.tsv"
-                logger.info("Looking for %s file", raw_needle)
-                needle_path = find(raw_needle, tmpdir)
-                format_ = "text"
+                text_needle = "coords.tsv"
+                logger.info("Looking for %s file", text_needle)
+                needle_path = find(text_needle, tmpdir)
+                if needle_path is None:
+                    loom_needle = "data.loom"
+                    logger.info("Looking for %s file", loom_needle)
+                    needle_path = find(loom_needle, tmpdir)
+                    format_ = "loom"
+                else:
+                    format_ = "text"
             else:
                 format_ = "cell-ranger"
             input_dir = os.path.dirname(needle_path)
@@ -183,7 +189,8 @@ def find(name, path):
         return """
             <!doctype html>
             <title>Convert File</title>
-            <h1>Upload ZIP or TAR.GZ of CellRanger Output</h1>
+            <h1>Upload ZIP or TAR.GZ of your data</h1>
+            <p>either containing CellRanger output, raw text files or a data.loom file<p>
             <p>
                 The server will return a <tt>.h5a</tt> file that you can upload into the SCelVis visualization.
             </p>

diff --git a/scelvis/assets/cells.png b/scelvis/assets/cells.png
diff --git a/scelvis/assets/movie.gif b/scelvis/assets/movie.gif
diff --git a/scelvis/callbacks.py b/scelvis/callbacks.py
@@ -477,6 +477,12 @@ def toggle_filter_cells_controls(n, is_open):
 def register_update_filter_cells_controls(app, token):
     @app.callback(
         [
+            Output("%s_filter_cells_ncells_div" % token, "style"),
+            Output("%s_filter_cells_ncells" % token, "marks"),
+            Output("%s_filter_cells_ncells" % token, "min"),
+            Output("%s_filter_cells_ncells" % token, "max"),
+            Output("%s_filter_cells_ncells" % token, "value"),
+            Output("%s_filter_cells_ncells" % token, "step"),
             Output("%s_filter_cells_choice_div" % token, "style"),
             Output("%s_filter_cells_choice" % token, "options"),
             Output("%s_filter_cells_choice" % token, "value"),
@@ -493,56 +499,73 @@ def register_update_filter_cells_controls(app, token):
     def update_filter_cells_controls(pathname, attribute, filters_json):
         _, kwargs = get_route(pathname)
         data = store.load_data(kwargs.get("dataset"))
+        hidden_slider = ({"display": "none"}, {0: "0", 1: "1"}, 0, 1, 1, 0)
+        hidden_checklist = ({"display": "none"}, [], None)
+        hidden_rangeslider = ({"display": "none"}, {0: "0", 1: "1"}, 0, 1, [0, 1], 0)
+
         if attribute is None or attribute == "None":
-            return (
-                {"display": "none"},
-                [],
-                None,
-                {"display": "none"},
-                {0: "0", 1: "1"},
-                0,
-                1,
-                [0, 1],
-                0,
-            )
+            return hidden_slider + hidden_checklist + hidden_rangeslider
         filters = json.loads(filters_json)
-        values = data.ad.obs_vector(attribute)
-        if not pd.api.types.is_numeric_dtype(values):
-            categories = list(data.ad.obs[attribute].cat.categories)
-            return (
-                {"display": "block"},
-                [{"label": v, "value": v} for v in categories],
-                filters[attribute] if attribute in filters else categories,
-                {"display": "none"},
-                {0: "0", 1: "1"},
-                0,
-                1,
-                [0, 1],
-                0,
-            )
-        else:
-            range_min = values.min()
-            range_max = values.max()
+        if attribute == "ncells":
+            ncells_tot = data.ad.obs.shape[0]
             if attribute in filters:
-                val_min = filters[attribute][0]
-                val_max = filters[attribute][1]
+                ncells_selected = filters[attribute]
             else:
-                val_min = range_min
-                val_max = range_max
+                ncells_selected = ncells_tot
             return (
-                {"display": "none"},
-                [],
-                None,
-                {"display": "block"},
-                dict(
-                    (int(t) if t % 1 == 0 else t, "{0:g}".format(t))
-                    for t in ui.common.auto_tick([range_min, range_max], max_tick=4, tf_inside=True)
-                ),
-                range_min,
-                range_max,
-                [val_min, val_max],
-                (range_max - range_min) / 1000,
+                (
+                    {"display": "block"},
+                    dict(
+                        (int(t) if t % 1 == 0 else t, "{0:g}".format(t))
+                        for t in ui.common.auto_tick([0, ncells_tot], max_tick=4, tf_inside=True)
+                    ),
+                    0,
+                    ncells_tot,
+                    ncells_selected,
+                    ncells_tot / 1000,
+                )
+                + hidden_checklist
+                + hidden_rangeslider
             )
+        else:
+            values = data.ad.obs_vector(attribute)
+            if not pd.api.types.is_numeric_dtype(values):
+                categories = list(data.ad.obs[attribute].cat.categories)
+                return (
+                    hidden_slider
+                    + (
+                        {"display": "block"},
+                        [{"label": v, "value": v} for v in categories],
+                        filters[attribute] if attribute in filters else categories,
+                    )
+                    + hidden_rangeslider
+                )
+            else:
+                range_min = values.min()
+                range_max = values.max()
+                if attribute in filters:
+                    val_min = filters[attribute][0]
+                    val_max = filters[attribute][1]
+                else:
+                    val_min = range_min
+                    val_max = range_max
+                return (
+                    hidden_slider
+                    + hidden_checklist
+                    + (
+                        {"display": "block"},
+                        dict(
+                            (int(t) if t % 1 == 0 else t, "{0:g}".format(t))
+                            for t in ui.common.auto_tick(
+                                [range_min, range_max], max_tick=4, tf_inside=True
+                            )
+                        ),
+                        range_min,
+                        range_max,
+                        [val_min, val_max],
+                        (range_max - range_min) / 1000,
+                    )
+                )
 
 
 def register_update_filter_cells_filters(app):
@@ -554,9 +577,11 @@ def register_update_filter_cells_filters(app):
         ],
         [
             Input("url", "pathname"),
+            Input("meta_filter_cells_ncells", "value"),
             Input("meta_filter_cells_choice", "value"),
             Input("meta_filter_cells_range", "value"),
             Input("meta_filter_cells_reset", "n_clicks"),
+            Input("expression_filter_cells_ncells", "value"),
             Input("expression_filter_cells_choice", "value"),
             Input("expression_filter_cells_range", "value"),
             Input("expression_filter_cells_reset", "n_clicks"),
@@ -569,9 +594,11 @@ def register_update_filter_cells_filters(app):
     )
     def update_filter_cells_filters(
         pathname,
+        meta_ncells_value,
         meta_cat_value,
         meta_range_value,
         meta_reset_n,
+        expression_ncells_value,
         expression_cat_value,
         expression_range_value,
         expression_reset_n,
@@ -584,26 +611,43 @@ def update_filter_cells_filters(
         ctx = dash.callback_context
 
         filters = json.loads(filters_json)
+        active_filters = set()
+        # if reset button was hit, remove entries in filters_json
+        attributes = list(filters.keys())
         status = "active filters: "
-        # if reset button was hit, check all boxes using stored values in filters_json
-        attributes = filters.keys()
         if ctx.triggered and "reset" in ctx.triggered[0]["prop_id"]:
-            for attribute in list(attributes):
+            for attribute in attributes:
                 del filters[attribute]
             return (json.dumps(filters), status, status)
 
-        for cat_value, range_value, attribute in [
-            (meta_cat_value, meta_range_value, meta_attribute),
-            (expression_cat_value, expression_range_value, expression_attribute),
+        # else update filters_json depending on inputs
+        for ncells_value, cat_value, range_value, attribute in [
+            (meta_ncells_value, meta_cat_value, meta_range_value, meta_attribute),
+            (
+                expression_ncells_value,
+                expression_cat_value,
+                expression_range_value,
+                expression_attribute,
+            ),
         ]:
             if attribute is not None and attribute != "None":
-                values = data.ad.obs_vector(attribute)
-                if not pd.api.types.is_numeric_dtype(values):
-                    filters[attribute] = sorted(cat_value)
+                if attribute == "ncells":
+                    filters[attribute] = ncells_value
+                    ncells_tot = data.ad.obs.shape[0]
+                    if ncells_value < ncells_tot:
+                        active_filters.add(attribute)
                 else:
-                    filters[attribute] = range_value
-
-        status += ", ".join(attributes)
+                    values = data.ad.obs_vector(attribute)
+                    if not pd.api.types.is_numeric_dtype(values):
+                        filters[attribute] = cat_value
+                        if cat_value is not None and set(cat_value) != set(values):
+                            active_filters.add(attribute)
+                    else:
+                        filters[attribute] = range_value
+                        if range_value[0] > values.min() or range_value[1] < values.max():
+                            active_filters.add(attribute)
+
+        status += ", ".join(active_filters)
         return (json.dumps(filters), status, status)
 
 
@@ -623,19 +667,23 @@ def activate_filter_cells_reset(pathname, filters_json):
         else:
             filters = {}
         disabled = True
-        attributes = filters.keys()
-        for attribute in attributes:
-            values = data.ad.obs_vector(attribute)
-            if not pd.api.types.is_numeric_dtype(values):
-                if filters[attribute] != list(data.ad.obs[attribute].cat.categories):
+        for attribute, selected in filters.items():
+            if attribute == "ncells":
+                ncells_tot = data.ad.obs.shape[0]
+                if selected < ncells_tot:
                     disabled = False
             else:
-                range_min = values.min()
-                range_max = values.max()
-                val_min = filters[attribute][0]
-                val_max = filters[attribute][1]
-                if val_min > range_min or val_max < range_max:
-                    disabled = False
+                values = data.ad.obs_vector(attribute)
+                if not pd.api.types.is_numeric_dtype(values):
+                    if sorted(selected) != sorted(data.ad.obs[attribute].cat.categories):
+                        disabled = False
+                else:
+                    range_min = values.min()
+                    range_max = values.max()
+                    val_min = selected[0]
+                    val_max = selected[1]
+                    if val_min > range_min or val_max < range_max:
+                        disabled = False
 
         return (disabled, disabled)