Skip to content

Commit

Permalink
AnnData Conversion Notebook (#1079)
Browse files Browse the repository at this point in the history
* test notebook

* testing conversion workflow

* tests finalized

* notebook updated

* added dask, torchdata deps, updated notebook

* convert settings.CELL_SIZE to 'area' for AnnData

* updated docs/conf.py with new deps

* replaced | with Union / Optinal[<type>]

* replaced | with Union / Optional[<type>] in test_utils

* added zarr as dependency

* small notebook fixes

* made requested changes

* fixed docs

* do i look i know what a 'qhull v Qbb Qz Qc' is?

* replaced svg with png

* fixed image in nb, specified x,y, and the numpy coordinate system in docs

* undid formatting of 'test_utils'

* pycodestyle

* Update src/ark/utils/data_utils.py

te the

Co-authored-by: Noah F. Greenwald <[email protected]>

* updated data_types

---------

Co-authored-by: Noah F. Greenwald <[email protected]>
  • Loading branch information
srivarra and ngreenwald authored Dec 19, 2023
1 parent e0906eb commit f3391e5
Show file tree
Hide file tree
Showing 14 changed files with 1,186 additions and 125 deletions.
Binary file added docs/_images/anndata_schema.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
97 changes: 94 additions & 3 deletions docs/_rtd/data_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,16 +40,17 @@ For each cell, the following morphology features calculated from `skimage.measur
* `perimeter`: perimeter of object which approximates the contour as a line through the centers of border pixels using a 4-connectivity.
* `convex_area`: the area of the convex hull.
* `equivalent_diameter`: the diameter of the circle with the same area as the cell.
* `centroid-0`: the $x$-coordinate of the centroid.
* `centroid-1`: the $y$-coordinate of the centroid.
* Centroids: Note that all the arrays are NumPy arrays, therefore the origin $(0,0)$ is in the "top-left corner" of the image / array.
* `centroid-0`: the $y$-coordinate of the centroid.
* `centroid-1`: the $x$-coordinate of the centroid.
* `fov`: The FOV from which the cell originates from.

The base `regionprops` metric often don't provide enough morphological information about each cell on their own. We add the following derived metrics to provide more complete information about the segmented cells:
* `major_minor_axis_ratio`: the major axis length divided by the minor axis length.
* `perim_square_over_area`: the square of the perimeter divided by the area.
* `major_axis_equiv_diam_ratio`: the major axis length divided by the equivalent diameter.
* `convex_hull_resid`: the difference between the convex area and the area divided by the convex area.
* `centroid_dif`: the normalized euclidian distance between the cell centroid and the corresponding convex hull centroid.
* `centroid_dif`: the normalized euclidean distance between the cell centroid and the corresponding convex hull centroid.
* `num_concavities`: the number of concavities of the region.
* `nc_ratio`: for nuclear segmentation only. The nuclear area divided by the total area.

Expand Down Expand Up @@ -90,3 +91,93 @@ The CSV should contain the following columns
* `fov`: name of the FOV the cell comes from
* `label`: the name of the segmentation label
* A set of expression columns defining the properties of each cell desired for clustering

---

Name: AnnData
Type: anndata.AnnData
Created by: [ConvertToAnnData](https://ark-analysis.readthedocs.io/en/latest/_markdown/ark.utils.html#ark.utils.data_utils.ConvertToAnnData)
Used by: [anndata_conversion.ipynb](https://github.com/angelolab/ark-analysis/blob/main/templates/anndata_conversion.ipynb)

<p align="center">
<img width="50%" src="../_images/anndata_schema.png" alt="AnnData Schema"/>
</p>


`AnnData` is a data structure consisting of matrices, annotated by DataFrames and Indexes. The goal is to transition over to `AnnData` from the Cell Table as the primary tabular data structure for storing, and interacting with multiplexed spatial single cell data.
This section will illustrate the components of the `AnnData` object, and provide brief examples of which cell table columns map to which `AnnData` components.

A `AnnData` object is composed of the following components:

- **X**
- **var**
- **obs**
- **obsm**
- **obsp**
- **varm**
- **varp**

There will be one `AnnData` object per FOV. Each of these components have specific use cases and will be described below:

### 1. X, var, obs

<p align="center">
<img width="50%" alt="image" src="https://github.com/angelolab/ark-analysis/assets/8909315/a5011077-d350-4aab-b8f8-609b11087bba">
</p>

- `X` is a matrix of shape `(n_obs, n_vars)` where `n_obs` is the number of observations (currently cell segmentations) and `n_vars` is the number of variables (currently number of channels / markers).

For example the following columns from the Cell Table are mapped to the `X` component of the `AnnData` object:

| CD14 | CD163 | CD20 | CD3 | CD31 | CD4 | CD45 | $\cdots$ | SMA | Vim |
|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|
| 0.1 | 0.3 | 0.4 | 0.1 | 0.3 | 0.1 | 0.1 | $\cdots$ | 0.4 | 0.8 |
| $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
| $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
| 0.1 | 0.1 | 0.3 | 0.7 | 0.8 | 0.8 | 0.8 | $\cdots$ | 0.3 | 0.6 |

- `var`: A `DataFrame` of shape `(..., n_vars)`, where the index is `var_names`. This `DataFrame` contains attributes of each variable. Currently, this goes unused as there is not a Cell Table analogue of this compartment, but in the future this may change.
- `var_names` is a `Pandas` Index where each value is a unique identifier for each variable. These are the names of the channels and should be unique.
- `n_vars` is the number of variables, and in this case it is the number of channels. Each channel is a variable, and each observation has a value for each channel.
- `obs` A `DataFrame` of shape `(n_obs, ...)`, where the index is `obs_names`. This `DataFrame` contains information about each observation, such as numeric metrics from `regionprops` or categorical data such as cell phenotype, or patient-level information.
- `obs_names` is a `Pandas` Index where each value is a unique identifier for each observation. These are the names of the segmented regions, and should be unique.
- `n_obs` is the number of segmented regions or objects of interest. In this case, it is the number of segmented cells.

For example, the following columns from the Cell Table are mapped to the `obs` component of the `AnnData` object:

| label | area | eccentricity | $\cdots$ | centroid_dif | num_concavities | fov |
|----------|----------|--------------|----------|--------------|-----------------|----------|
| 1 | 345 | 0.2 | $\cdots$ | 0.01 | 0 | fov1 |
| 2 | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
| $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
| 1112 | 460 | 0.11 | $\cdots$ | 0.1 | 12 | fov1 |



### 2. obsm, varm

<p align="center">
<img width="50%" alt="image" src="https://github.com/angelolab/ark-analysis/assets/8909315/8ea1c794-f80b-49a3-b814-3357d2718f7b">
</p>

- `obsm` is a key-value store where the values are matrices of shape `(n_obs, a)`, where `a` is an integer. This contains observation level matrices, and we use a mapping `str -> NDArray` to store them. For example, `"X_umap"` would store the UMAP embedding of the sparse matrix `X`, and `"X_pca"` would store the PCA embedding of `X`.
- Currently, from the Cell Table we store the `y` and `x` centroids in the `"spatial"` slot of the `obsm` component.
- `varm` is a key-value store where the values are matrices of shape `(n_vars, b)`, where `b` is an integer. This contains variable level matrices, and we use a mapping `str -> NDArray` to store them. For example, `"Marker_umap"` would store the UMAP embedding of the matrix `var`.


### 3. obsp, varp

<p align="center">
<img width="50%" alt="image" src="https://github.com/angelolab/ark-analysis/assets/8909315/2201f36d-3a7c-4154-8ab6-e875e9811eb4">
</p>

- `obsp` is a square matrix of shape `(n_obs, n_obs)`, and its purpose is to store pairwise computations between observations.
- For example neighborhood information.
- `varp` is a square matrix of shape `(n_vars, n_vars)`, and its purpose is to store pairwise computations between variables.
### 4. **uns**

<p align="center">
<img width="303" alt="image" src="https://github.com/angelolab/ark-analysis/assets/8909315/881a2c63-3ea4-4874-b6d6-b0bc2532f283">
</p>

- `uns` is a free slot for storing *almost* anything. It's a mapping from a string label to anything.
186 changes: 186 additions & 0 deletions docs/_rtd/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,3 +144,189 @@ Finally, to save an `xarray` to a file, use:
You can load the `xarray` back in using:
`arr = xr.load_dataarray(path)`
### Working with `AnnData`
We can load a single `AnnData` object using the function `anndata.read_zarr`, and several `AnnData` objects using the function `load_anndatas` from `ark.utils.data_utils`.
```python
from anndata import read_zarr
from ark.utils.data_utils import load_anndatas
```
```python
fov0 = read_zarr("data/example_dataset/fov0.zarr")
```
The channel intensities for each observation in the `AnnData` object with the `.to_df()` method, and get the channel names with `.var_names`.
```python
fov0.var_names
fov0.to_df()
```
The observations and their properties with the `obs` property of the `AnnData` object. The data here consists of measurements such as `area`, `perimeter`, and categorical information like `cell_meta_cluster` for each cell.
```python
fov0.obs
```
The $x$ and $y$ centroids of each cell can be accessed with the `obsm` attribute and the key `"spatial"`.
```python
fov0.obsm["spatial"]
```
We can load all the `AnnData` objects in a directory lazily with `load_anndatas`. We get a view of the `AnnData` objects in the directory.
```python
fovs_ac = load_anndatas(anndata_dir = "data/example_dataset/fov0.zarr")
```
We can utilize `AnnData` objects or `AnnCollections` in a similar way to a Pandas DataFrame. For example, we can filter the `AnnCollection` to only include cells that have a `cell_meta_cluster` label of `"CD4T"`.
```python
fovs_ac_cd4t = fovs_ac[fovs_ac.obs["cell_meta_cluster"] == "CD4T"]
print(type(fovs_ac_cd4t))
fovs_ac_cd4t.obs.df
```
The type of `fovs_ac_cd4t` is not an `AnnData` object, but instead an `AnnCollectionView`.
This is a `view` of the subset of the `AnnCollection`. This object can *only* access `.obs`, `.obsm`, `.layers` and `.X`.
We can subset a `AnnCollectionView` to only include the first $n$ observations objects with the following code. The slice based indexing behaves like a `numpy` array.
```python
n = 100
fovs_ac_cdt4_100 = fovs_ac_cd4t[:n]
fovs_ac_cd4t_100.obs.df
```
Often we will want to subset the `AnnCollection` to only include observations contained within a specific FOV.
```python
fov1_adata = fovs_ac[fovs_ac.obs["fov"] == "fov1"]
fov1_adata.obs.df
```
We can loop over all FOVs in a `AnnCollection` with the following code (there is alternative method in ):
```python
all_fovs = fovs_ac.obs["fov"].unique()
for fov in all_fovs:
fov_adata = fovs_ac[fovs_ac.obs["fov"] == fov]
# do something with fov_adata
```
Functions which take in `AnnData` objects can often be applied to `AnnCollections`.
The following works as expected:
```python
def dist(adata):
x = adata.obsm["spatial"]["centroid_x"]
y = adata.obsm["spatial"]["centroid_y"]
return np.sqrt(x**2 + y**2)
dist(fovs_ac)
```
While the example below does not:
```python
from squidpy import gr
gr.spatial_neighbors(adata=fovs_ac, spatial_key="spatial")
```
This is due to a `AnnCollection` object not having a `uns` property.
#### Utilizing `DataLoaders`
While a `AnnCollection` can sometimes be used to apply a function over all FOVs, in some instances we either cannot do that, or perhaps we want to apply functions to each FOV independently.
We can access the underlying `AnnData` objects with `.adatas`.
```python
fovs_ac.adatas
```
In these instances we can construct data pipelines with [`torchdata`](https://pytorch.org/data/beta/index.html).
As an example, let's create a multi-stage `DataLoader` which does the following:
- Only extracts the observations with an area greater than `300`.
- Only extracts the observations which have a `cell_meta_cluster` label of `"CD4T"`.
- Compute the Spatial Neighbors graph for those observations (using [`squidpy.gr.spatial_neighbors`](https://squidpy.readthedocs.io/en/stable/api/squidpy.gr.spatial_neighbors.html#squidpy.gr.spatial_neighbors)).


In order to construct a `torchdata` [`DataLoader2`](https://pytorch.org/data/beta/dataloader2.html) iterator we first need to create a `torchdata` [`IterDataPipe`](https://pytorch.org/data/beta/torchdata.datapipes.iter.html). This implements the `__iter__()` protocol, and represents an iterable over data samples.
We can convert the `AnnCollection` to a `torchdata` `IterDataPipe` with `ark.utils.data_utils.AnnDataIterDataPipe`.
```python
from ark.utils.data_utils import AnnDataIterDataPipe

fovs_ip = AnnDataIterDataPipe(fovs=fovs_ac)
```
The following two functions are used to filter the observations in the `AnnData` objects
to only include cells with an area greater than `min_area` and cells with a `cell_meta_cluster` label of `cluster`.
```python
from anndata import AnnData

def filter_cells_by_cluster(adata: AnnData, cluster_label: str) -> AnnData:
return adata[adata.obs["cell_meta_cluster"] == cluster_label]

def filter_cells_by_area(adata: AnnData, min_area: int) -> AnnData:
return adata[adata.obs["area"] > min_area]
```
The following function is used to filter out `AnnData` objects which have no observations.
```python
def filter_empty_adata(fov: AnnData) -> bool:
return len(fov) > 0
```
We can apply these functions to the `IterDataPipe` with the `map` and the `filter` method.
Because those methods return a new `IterDataPipe` object, we can chain them together.
```python
from functools import partial

cd4t_obs_filter = partial(filter_cells_by_cluster, cluster_label="CD4T")
area_obs_filter = partial(filter_cells_by_area, min_area=300)

fovs_subset = fovs_ip.map(cd4t_obs_filter).map(area_obs_filter).filter(filter_empty_adata)
```
The data pipeline can be visualized with `to_graph` function.
```python
from torchdata.datapipes.utils import to_graph

to_graph(fovs_subset)
```
The `DataLoader` can now be constructed.
```python
from torchdata.dataloader2.dataloader2 import DataLoader2

fovs_subset_dl = DataLoader2(fovs_subset)
```
We can now loop over the `DataLoader` and compute the Spatial Neighbors graph per FOV with the filtered observations.
```python
for fov in fovs_subset_dl:
gr.spatial_neighbors(adata=fov, radius=350, spatial_key="spatial", coord_type="generic")
```
#### Further Reading
- [Official AnnData Documentation](https://anndata.readthedocs.io/en/latest/)
- [Getting Started Tutorial](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html)
- [Converting from Single Cell Experiment and Seurat Objects](https://scanpy.readthedocs.io/en/stable/tutorials.html#conversion-anndata-singlecellexperiment-and-seurat-objects)
- [MuData - Multimodal AnnData](https://mudata.readthedocs.io/en/latest/index.html)
7 changes: 6 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,10 @@
'feather',
'google',
'h5py',
'dask',
'distributed',
'anndata',
'torchdata',
'ipywidgets',
'natsort',
'numba',
Expand All @@ -98,7 +102,8 @@
'mpl_toolkits',
'tqdm',
'ark.utils._bootstrapping',
'xmltodict']
'xmltodict',
'zarr',]

# prefix each section label with the name of the document it is in, followed by a colon
# autosection_label_prefix_document = True
Expand Down
6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@ build-backend = "setuptools.build_meta"
[project]
dependencies = [
"alpineer==0.1.10",
"anndata",
"Cython>=0.29,<1",
"dask[distributed]",
"datasets>=2.6,<3.0",
"dill>=0.3.5,<0.4",
"feather-format>=0.4.1,<1",
Expand All @@ -29,16 +31,20 @@ dependencies = [
"requests>=2.20,<3",
"scikit-image<=0.19.3",
"scikit-learn>=1.1,<2",
"graphviz",
"scipy>=1.7,<2",
"seaborn>=0.12,<1",
"spatial-lda>=0.1.3,<1",
"statsmodels>=0.13.2,<1",
"squidpy",
"tifffile>=2022",
"torchdata",
"tqdm>=4,<5",
"umap-learn>=0.5,<1.0",
"xarray>=2022",
"xmltodict>=0.13.0,<1",
"zstandard>=0.19.0,<1",
"zarr",
"ark-analysis[colors]",
]
name = "ark-analysis"
Expand Down
Loading

0 comments on commit f3391e5

Please sign in to comment.