title | title_short | tags | authors | affiliations | date | cito-bibliography | event | biohackathon_name | biohackathon_url | biohackathon_location | group | git_url | authors_short | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VIB Hackathon on spatial omics tools and methods |
VIB Hackathon on spatial omics |
|
|
|
12 June 2024 |
paper.bib |
VIBHackathonJune2024 |
VIB Hackathon on spatial omics |
Ghent, Belgium, 2024 |
Code repository |
VIB Hackathon participants |
During a three-day hackathon, work was performed on various topics within the field of spatial omics data analysis. The topics were organized in five workgroups and included benchmarking, pipelines, spatial transcriptomics, spatial proteomics, spatial multi-omics and cell-cell communication. Most tools and methods were considered in the context of the Python ecosystem for spatial [@marconato_spatialdata_2024] and single-cell [@virshup_scverse_2023] data analysis.
Results were summarized in a final slide deck. A project board collected all task items and GitHub Issues. Here we give a brief overview for each of the five workgroups.
During this hackathon, we have worked on and finished the template update for nf-core/molkart, an nf-core pipeline for processing Molecular Cartography data, allowing for the next expansion that will include spot-based segmentation options. Additionally, we have added Spotiflow, a spot-detection tool into the nf-core framework.
Isoquant is a tool for the reconstruction and quantification of single-cell long-read RNA data (e.g. from PacBio and Oxford Nanopore). Currently, Isoquant is not optimized for spatial data and is limited to reconstructing and quantifying transcripts from a few thousand barcodes at most. While this is often sufficient for single-cell long-read RNA data, spatial data can scale to many more barcodes.
We identified the current bottlenecks in Isoquant and started on implementing a fix to circumvent this. From initial testing we can now perform reconstruction and quantification of transcripts on millions of barcodes efficiently. Currently we are performing further testing to ensure that results and downstream analyses are unaffected before submitting the fix as a pull request.
We merged support for incremental IO (partial read/write) in SpatialData (PR). Identified an issue for multiscale images. Discussed support for apply function on raster data in SpatialData (draft PR).
- Specific issues:
- improve performance of isoquant for large spatial omics datasets
- Build a computational benchmark for spatial omics data
- identify datasets
- identify first benchmarks
- Accessing remote datasets:
- Upload spatial omics datasets to S3
- Support for private remove object storage in SpatialData
Napari is a scalable interactive viewer for multi-dimensional data. It works natively in python. Within this hackathon, we worked on adding functionality to napari-spatialdata, a SpatialData plugin for napari. Firstly, we worked on reusing colors previously defined in the SpatialData object. Secondly, progress had been made to only visualize subsets of the cells. This would allow to plot a certain cell type colored by gene expression of gene x and another cell type colored by gene expression.
Thirdly, work on the annotation widget has been performed and checked.
Lastly, it has been made possible for widgets to communicate with one another. An example screenshot of the annotation widget is available.
We discussed user stories for a workflow that entails drawing annotations interactively with Napari and using the annotations in downstream analysis steps. To this end, we identified the following tasks that would enable such workflow:
- napari-spatialdata widget that would enable:
- Drawing annotations on a specific image or coordinate system.
- Rename the annotations, specifying various metadata to the annotation, such as the identity of the annotator, labels for the annotations and others.
- Save the annotations back to the spatialdata object and on-disk.
- Masked spatial graphs based on annotations: the annotations define specific areas of interest of the tissue. The analyst may wish to analyze the spatial structure enclosed in the annotations, or using the annotation as a "negative mask" in order to remove graph edges going across void regions of the tissue.
- Calculating and plotting gene expression trends at increasing distance to the annotation of interest (or within the boundaries of the annotations of interest). This is similar to the squidpy function
sq.tl.var_by_distance
but computing distances to polygon boundaries and not simply to the centroid of the polygon.
As mentioned in this SpatialData issue, Visium HD bins can't be rasterized in memory (i.e., converted to an image) as a single full-genome image. Indeed, the smallest bins are 2-microns-width squares with full-genome sequencing. Still, for visualization and analyses purposes, rasterization is needed. Therefore, we opened a new PR for bins rasterization, on which we support two modes:
- rasterization of one or multiple channels (in-memory). It uses the indices of the sparse table in CSC format for efficiency.
- lazy rasterization of the full data with Dask (in particular, using map_blocks). The data is therefore rasterized when needed, for instance to display one or a few channels in napari-spatialdata.
Remaining steps includes (i) adding tests and (ii) adding some notebook examples.
Recently a dataset was published [@oliveira_characterization_2024] that applied multimodal spatial transcriptomics techniques on the same colorectal cancer samples on consecutive sections. Namely, Visium HD, Xenium as well as Visium v2 and scRNAseq was performed. Our goal was to compare the high resolution sequencing-based data from Visium HD with the imaging-based Xenium to show whether they can be used as validations for each other. To achieve this, we first converted the data of both modalities to spatialdata-objects and cropped and aligned the H&E image of the Visium HD assay to match the corresponding area of the Xenium HD chip by using the alignment functions of spatialdata. With the aligned dataset we were able to show that the marker gene for epithelial cells (CEACAM6) and a marker gene for crypt base columnar cells (OLFM4) are expressed in the same tissue regions. Finally, we were looking into further methods to analyze these datasets:
- Label transfer from scRNA-seq data to Visium HD (RCTD speed-up verison) and Xenium (SingleR)
- Investigate the impact of different normalization methods on SVG detection, using Visium, Visium HD, and Xenium replicates.
- Merging spatialdata objects of Xenium and Visium HD
- Microenvironment detection using Banksy [@singhal_banksy_2022].
Multiple unsupervised metrics have been added in this Squidpy PR to evaluate niches detection methods. Notably:
- a niche continuity metric (F1-score of cross-niche edges)
- a cross-slide homogeneity metric (jensen-shannon diverge of niches distributions across slides)
- DE tests to compare max gene expression across niches
- ARI, NMI and Fowlkes-Mallows Index for niche result comparison (agreement)
Group members had most experience with analysis of Miltenyi MACSima, Akoya Phenocycler, Lunaphore COMET and MIBI data. After some discussion, four work items were selected.
Some common issues in spatial proteomics analysis were discussed. Reading in datasets in the SpatialData format still lacks for some platforms. Some interesting metadata is also included always included, such as physical pixel size, autofluorescence subtraction, imaging cycles and exposure time. The need in some datasets to detect misalignment and co-register the channel images, either all of them or specific ones. For segmentation, applying CLAHE and using cellpose was found to be sufficient for most cells. For exceptional cell shapes in tissues such as the heart and brain there is additional difficulty and need for fine-tuning the segmentation model with enough training data. This manual labeling is time-consuming and difficult to reproduce. There was a lack of consensus on available normalization techniques, batch effect correction and their usefulness.
Support for exporting cells in SpatialData and interactively annotating them using a classifier with Ilastik software [@berg_ilastik_2019].
Normalization facilitates the integration and comparison of data from different experiments, which is essential for large-scale studies and meta-analyses such as spatial omics data. Therefore, creation of an overview of normalization methods for downstream analysis of spatial proteomics datasets and a comparison between them is crucial.
While evaluation & benchmarking would require a gold standard cell type dataset which is beyond the scope of this hackathon, a new repository was created that contains a summary of 9 methods adapted from published literature. All codes for each method are also available. A visualization of results obtained from these different methods on a MIBI dataset (not publicly available) is provided as well. Among the different methods, a visual qualitative comparison provides evidence that a combined method (Shaban et al. + Greenbaum et al.) may yield more promising results. We plan to extend the work from this hackathon with a quantitative comparison in the future.
An alternative to spatialdata.to_polygons()
label vectorization function, which features improved performance, resolution of the invalid geometries, and shapely.MultiPolygon
filtering based on the area.
Polygonal representation of cells is crucial for characterizing cellular morphologies and establishing spatial relationships between cells. This method is applicable when cells are located on different planes within tissue, as well as for calculating distances between various objects. However, there is a notable lack of tools that can take a TIFF
file with cell labels and output a GeoDataFrame
or GeoJSON
. The developing branch of the SpatialData
framework includes a to_polygons()
vectorization function, but it lacks functionality for resolving invalid geometries and filtering multipolygons.
The following illustrates a practical example: when analyzing thick imaging samples without a z-stack, we observe different cell types located in different z-planes relative to each other. This is usually not an issue when masks come from mutually exclusive intensity channels. However, with more general markers, we may encounter incorrect and overlapping segmentation masks. Resolving these spatially overlapping segmentation masks through geometrical subtraction often results in fragmented multipolygons with small polygons and lines, affecting downstream applications.
We aim to address the problems of invalid geometries and multipolygon filtering and provide an easy-to-use function compatible with standard NumPy
arrays (unlike SpatialData
, which requires a SpatialImage
instance to perform vectorization). Additionally, our approach improves (~2x increase) performance by avoiding chunking of the input array.
We describe the features of this new reader MACSima datasets in spatialdata-io, with support for lazy loading, physical pixel size and imaging cycles in this GitHub Issue. The draft PR is available here.
Spatial multi-omics are an emerging class of technologies that record two or more data modalities from biological samples in a spatial context. Modalities can among others include RNA, protein, epigenetic features like chromatin accessibility and pathohistological stains. To get a better overview of the field we collected available datasets and methods. In addition, we tried to generate reliable in silico spatial multi-omics data.
True multi-omic datasets that record multiple modalities of the same cells are rare, which motivates our subproject on multi-slice alignment via image registration and integration algorithms.
Cell morphology, which is revealed by classical staining methods, is a potential very rich source of information that complements spatial transcriptomic assays like Visium and Xenium. Recently developed vision models allow unsupervised extraction of morphological features which can then be used for clustering and data integration tasks. During the hackathon general purpose models trained on imagenet and UNI [@chen_towards_2024] a model specifically tuned on histopathology were evaluated.
Multi-modal measurements are usually performed on consecutive slides, which do not align in most cases. In order to perform multi-modal analyses, a correspondence between the measurements is needed. Rigid and affine transforms can help align images between modalities but in real-world cases, the alignment obtained is poor.
We planned on using a publicly available multi-modal dataset to test different alignment strategies. We tried performing simple affine transformations (e.g., scaling and rotation) but found the alignment to be poor. Other non-affine methods are available in the literature (e.g., SLAT, ELD, CAST) but found several issues related to installation and data availability. Despite great promise, the lack of standard multi-modal spatial object representation ultimately hinders the applicability and downstream analyses of aligned datasets.
Another promising avenue is the use of landmarks to perform alignment in a supervised manner. Spatially resolved technologies such as Xenium allow for a single cell resolution unvailable on previous iterations, however, the classic H&E slide is not necessarily outputed as in Visium and Visium HD and is usually done afterward. It is necessary to align the xenium assay with the H&E slides, this is done through the use of landmarks annotated in both the Xenium and H&E, align and use Napari to visualize the alignment. The spatialdata package allows for the recovery of the spatial coordinates and the resizing of the H&E slide.
Currently a very limited number of solutions are available for multi-omics integration. Newly developed tools are not widely used, lack proper benchmarking and suffer from a limited number of datasets to perform thorough testing. Here we attempted to collect information on publicly available spatial multi-omics datasets. We also list state-of-the-art computational solutions for horizontal, vertical and diagonal data integration with key details paying special attention to the diagonal unmatched integration. An overview of the collected datasets and methods is provided in the supplement.
Integration challenges:
- number of detected features (e.g. RNA-seq VS proteomics)
- different feature counts, statistical distributions
- differences in resolution (imaging-based)
- image alignment/overlay (imaging-based)
- batch effect
- technical (heavy data)
merging the same omic across different datasets Reasons:
- 3D maps
- technical replicates, integrating batches
- integrating across different technologies
If fact, this is not a true multi-omics integration
Examples:
- STAGATE (spatial transcriptomics, consecutive sections, adaptive graph attention auto-encoder)
- STAligner (spatial transcriptomics datasets, batch effect-corrected embeddings, 3D reconstruction, )
- SpaGCN (spatial transcriptomics, graph convolutional network approach that integrates gene expression, spatial location and histology)
- PASTE (align and integrate ST data from multiple adjacent tissue sections)
- SpaceFlow (embedding is continuous both in space and time, Deep Graph Infomax (DGI) framework with spatial regularization)
Merges data from different omics within the same set of samples (matched integration), using cell as an anchor. Examples:
- archr
- MaxFuse (fuzzy smoothed embedding for weaky-linked modalities, proteomics, transcriptomics and epigenomics at single-cell resolution on the same tissue section)
- MultiMAP (nonlinear manifold learning algorithm that recovers a single manifold on which several datasets reside and then projects the data into a single low-dimensional space so as to preserve the manifold structure)
- Seurat5
Some examples of studies with unmatched integration:
- SpatialGlue
- graph neural network with dual-attention mechanism
- 2 separate graphs to encode data into common embedding space: a spatial proximity graph and a feature graph
- MEFISTO
- factor analysis + flexible non-parametric framework of Gaussian processes
- spatio-temporally informed dimensionality reduction, interpolation, and separation of smooth from non-smooth patterns of variation.
- different omics, multiple sets of samples (different experimental conditions, species or individuals)
- each sample is characterized by "view", "group", and by a continuous covariate such as a one-dimensional temporal or two-dimensional spatial coordinate
- SLAT
- aligning heterogenous spatial data across distinct technologies and modalities
- graph adversarial matching
- Cross-modality mapping using image varifolds
Additional details on this methods are summarized in supplementary Table 1. General issue: gene-based, challenges with proteomics (and even more issues with metabolomics). Direct comparison of these tools is not possible due to different tasks and working principles.
Due to the limited number of available spatial datasets and their complexity, the tools for in silico generation of artificial spatial datasets are becoming more popular. Such tools may be useful for experimental design planning, selecting sampling strategy to get reliable statistics, and for benchmarking of new tools. Unfortunately, some of the current solutions cause serious technical issues during installation and running. Here we list 3 existing tools for dataset generation:
- Power analysis for spatial omics
- tissue scaffold: random-circle-packing algorithm to generate a planar graph
- attributes on nodes represent cell type assignments
- the labeling is based on two data-driven parameters (prior knowledge) for a tissue type: the proportions of the k unique cell types, and the pairwise probabilities of each possible cell type pair being adjacent (a k × k matrix)
- by changing these 2 params one should be able to obtain simulations for different tissues and technologies
- we faced some technical issues while using this tool
- scDesign3
- SRTsim (transcriptomics only)
Spatial landmark detection and tissue registration with deep learning: paper and code.
Multiple vision models were evaluated for feature extraction from Hematoxylin and eosin stains. This includes general purpose vision models included in torchvision like Resnet50 and Inceptionv3 and also a dedicated pathology model in UNI. PCA was performed on the extracted feature vectors followed by k-means clustering of the first 10 principal components. Our results show that multiple models successfully extract region specific features from the images. UNI in particular performed strongly, with clusters closely matching pathologist annotation. It remains to be seen how these features can best be integrated with RNA information for clustering and spatial domain identification.
The goal of the group was to run multiple spatial CCC methods and compare evaluations/visualizations and results. We selected the methods from @armingol_diversification_2024. A more detailed table can be found on the separate GitHub repository of this workgroup.
Methods were implemented and tested on a subset of the MERFISH whole mouse brain data (slice 80) from the Allen Brain Institute.
We obtained results for CCC for the following methods: COMMOT, SpatialDM, MEBOCOST, CellPhoneDB. SpatialDM and CellPhoneDB were run with LIANA+. We also ran SpaTalk but found no LR pairs, as the tool requires that the entire ligand-receptor-tf-target pathway is expressed for a LR pair to be considered, and this was likely not the case in a dataset with 1122 genes. For the other three tools we selected specific LR pairs to compare the results.
Do the tools identify the same sender and receiver cells that participate at communication?
LIANA+ (CellPhoneDB) and COMMOT find common ligand-receptor pairs, however, among the few cell-type source-target pairs we investigated, there was no consensus. The comparison was performed on a qualitative way rather than quantitative due to difference in output format and evaluation metrics used by the different tools. Because MEBOCOST does not use Ligand-Receptor interactions as the other tools but it calculates metabolic communication scores, we could not compare the results directly.
Where do the tools predict the communication to take place in tissue space? Do spatial methods benefit from the additional modality?
COMMOT and SpatialDM both make use of spatial information to predict the communication events. We investigated the LR pair Nts-Ntsr2; the cells seem to interact in the same brain region (hypothalamus).
- Comparison of results is difficult because i) there is no ground thruth regarding CCC, ii) output formats of methods vary, for example SpatialDM returns a
$NxLR$ matrix with a score for each cell indicating the potential strength of a ligand or receptor and COMMOT returns a$NxN$ matrix for each$LR$ interaction, iii) different score metric. - A more systematic comparison should be carried out over all cell types and all ligand-receptor pairs.
- Different input databases on which communication analysis is based (metabolic vs ligand-receptor) but also within LR interactions it might use the CellPhoneDB or CellChat database.
This hackathon was attended by 37 participants from many institutes across Europe. It provided a useful venue for the exchange of ideas and the development of new tools and methods for spatial omics data analysis. Status updates and results were summarized in a slide deck. A project board collected all task items and a Zulip stream was used for communication. Code to use the provided computational resources and some of the hackathon results are available in this code repository.
The hackathon was organized by the Saeys Lab and supported by Data Intuitive, the VIB Spatial Catalyst and the VIB Center for AI and Computational Biology.
The computational resources and services used in this work were provided by the VIB Data Core and the VSC (Flemish Super-computer Center), funded by the Research Foundation – Flanders (FWO) and the Flemish Government. B.R, R.S. and Y.S. are supported by the Flanders AI Research Program.