Skip to content

Latest commit

 

History

History
459 lines (359 loc) · 32.2 KB

paper.md

File metadata and controls

459 lines (359 loc) · 32.2 KB
title title_short tags authors affiliations date cito-bibliography event biohackathon_name biohackathon_url biohackathon_location group git_url authors_short
VIB Hackathon on spatial omics tools and methods
VIB Hackathon on spatial omics
spatial omics
spatial transcriptomics
spatial proteomics
cell-cell communication
bioinformatics pipelines
name orcid affiliation
Benjamin Rombaut
0000-0002-4022-715X
123
name orcid affiliation
Lotte Pollaris
0000-0002-0262-0540
123
name affiliation
Chananchida Sang-aram
123
name affiliation
Michiel Ver Cruysse
13
name orcid affiliation
Robrecht Cannoodt
0000-0003-3641-729X
512
name affiliation
Frank Vernaillen
4
name affiliation
Arne Defauw
4
name affiliation
Julien Mortier
4
name orcid affiliation
Mayar Ali
000-0002-0398-5699
1819
name orcid affiliation
Kresimir Bestak
0009-0009-8245-9846
6
name orcid affiliation
Quentin Blampey
0000-0002-3836-2889
16
name orcid affiliation
Michele Bortolomeazzi
0000-0001-5805-5774
10
name orcid affiliation
Paula V M Cauhy
0000-0003-1004-3656
26
name orcid affiliation
Miray Cetin
0009-0001-7711-0211
14
name orcid affiliation
Daniel Dimitrov
0000-0002-5197-2112
6
name orcid affiliation
Francesca Drummer
18, 20
name orcid affiliation
Lorenzo Giordani
0000-0002-3417-2965
24
name orcid affiliation
Aroj Hada
0000-0002-0691-1214
67
name orcid affiliation
Luuk Harbers
0000-0003-3910-6497
8
name orcid affiliation
Miguel A. Ibarra-Arellano
0000-0001-8411-4854
6
name orcid affiliation
Paul Kiessling
0000-0002-9794-9532
11
name orcid affiliation
Laurens Lehner
0000-0001-7690-7168
18
name orcid affiliation
Susmita Mandal
0000-0003-2248-7860
23
name orcid affiliation
Benedetta Manzato
0009-0008-8369-2327
21
name orcid affiliation
Luca Marconato
0000-0003-3198-1326
13
name orcid affiliation
Claudio Novella-Rausell
0000-0002-7383-6090
21
name orcid affiliation
Anastasiia Okhtienko
0009-0003-5886-811X
17
name orcid affiliation
Giovanni Palla
0000-0002-8004-4462
18
name orcid affiliation
Daryna Pikulska
0009-0005-1638-0268
11
name orcid affiliation
Carlos Ariel Pulido-Vicuna
0000-0001-5049-6997
228
name orcid affiliation
Guillaume Sacchetti
0000-0002-8779-352X
228
name orcid affiliation
Alexander Sudy
0000-0002-7338-4119
12
name orcid affiliation
Lotte Van de Vreken
0009-0000-9283-4720
15
name orcid affiliation
Wouter-Michiel Vierdag
0000-0003-1666-5421
13
name orcid affiliation
Vladislav Vlasov
0009-0005-9514-4860
9
name orcid affiliation
Sai Nirmayi Yasa
0009-0003-6319-9803
5
name orcid affiliation
Estella Yixing Dong
0009-0003-5115-5686
25
name orcid affiliation
Ruth Seurinck
0000-0002-6636-7572
123
name orcid affiliation
Yvan Saeys
0000-0002-0415-1506
123
name index
Data Mining and Modelling for Biomedicine, VIB-UGent Center for Inflammation Research, Ghent, Belgium
1
name index
Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
2
name index
VIB Center for AI and Computational Biology, Ghent, Belgium
3
name index
VIB Spatial Catalyst
4
name index
Data Intuitive, Lebbeke, Belgium
5
name index
Institute for Computational Biomedicine, Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany
6
name index
AI-Health Innovation Cluster, Heidelberg, Germany
7
name index
VIB-KU Leuven Center for Cancer Biology, Leuven, Belgium
8
name index
Brain and Systems Immunology Lab, Brussels Center for Immunology, Vrije Universiteit Brussel
9
name index
ScOpen Lab, German Cancer Research Center (DKFZ), Heidelberg, Germany
10
name index
RWTH Aachen, University Hospital
11
name index
Center of Digital Health, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Germany
12
name index
European Molecular Biology Laboratorium, Heidelberg, Germany
13
name index
Systems Immunology and Single-Cell Biology, German Cancer Research Center (DKFZ), Heidelberg, Germany
14
name index
VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
15
name index
MICS Laboratory, CentraleSupélec, Paris-Saclay University, Paris, France
16
name index
Institute of Virology, Technical University of Munich, Munich, Germany
17
name index
Institute of Computational Biology, Helmholtz Munich, Neuherberg, Germany
18
name index
Institute for Tissue Engineering and Regenerative Medicine, Helmholtz Munich, Neuherberg, Germany
19
name index
Institute for Stroke and Dementia Research, Klinikum Der Universität München, Ludwig-Maximilians-Universität, Munich, Germany
20
name index
Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
21
name index
Laboratory for Molecular Cancer Biology, Center for Cancer Biology, VIB, Leuven, Belgium; Department of Oncology, KU Leuven, Leuven, Belgium.
22
name index
Institute of Pathology at Charité – Universitätsmedizin Berlin, Germany
23
name index
Sorbonne Université, INSERM UMRS 974, Association Institut de Myologie, Centre de Recherche en Myologie, 75013 Paris, France.
24
name index
Biomedical Data Science Center, Lausanne University Hospital; University of Lausanne, Lausanne, Switzerland.
25
name index
UK Dementia Research Institute, University College London, WC1E 6BT, London, UK
26
12 June 2024
paper.bib
VIBHackathonJune2024
VIB Hackathon on spatial omics
Ghent, Belgium, 2024
Code repository
VIB Hackathon participants

Introduction

During a three-day hackathon, work was performed on various topics within the field of spatial omics data analysis. The topics were organized in five workgroups and included benchmarking, pipelines, spatial transcriptomics, spatial proteomics, spatial multi-omics and cell-cell communication. Most tools and methods were considered in the context of the Python ecosystem for spatial [@marconato_spatialdata_2024] and single-cell [@virshup_scverse_2023] data analysis.

Results

Results were summarized in a final slide deck. A project board collected all task items and GitHub Issues. Here we give a brief overview for each of the five workgroups.

Workgroup pipelines

Nextflow

During this hackathon, we have worked on and finished the template update for nf-core/molkart, an nf-core pipeline for processing Molecular Cartography data, allowing for the next expansion that will include spot-based segmentation options. Additionally, we have added Spotiflow, a spot-detection tool into the nf-core framework.

Isoquant

Isoquant is a tool for the reconstruction and quantification of single-cell long-read RNA data (e.g. from PacBio and Oxford Nanopore). Currently, Isoquant is not optimized for spatial data and is limited to reconstructing and quantifying transcripts from a few thousand barcodes at most. While this is often sufficient for single-cell long-read RNA data, spatial data can scale to many more barcodes.

We identified the current bottlenecks in Isoquant and started on implementing a fix to circumvent this. From initial testing we can now perform reconstruction and quantification of transcripts on millions of barcodes efficiently. Currently we are performing further testing to ensure that results and downstream analyses are unaffected before submitting the fix as a pull request.

Infrastructure for pipelines

We merged support for incremental IO (partial read/write) in SpatialData (PR). Identified an issue for multiscale images. Discussed support for apply function on raster data in SpatialData (draft PR).

  • Specific issues:
    • improve performance of isoquant for large spatial omics datasets
    • Build a computational benchmark for spatial omics data
      • identify datasets
      • identify first benchmarks
  • Accessing remote datasets:
    • Upload spatial omics datasets to S3
    • Support for private remove object storage in SpatialData

Workgroup spatial transcriptomics

Napari plugin

Napari is a scalable interactive viewer for multi-dimensional data. It works natively in python. Within this hackathon, we worked on adding functionality to napari-spatialdata, a SpatialData plugin for napari. Firstly, we worked on reusing colors previously defined in the SpatialData object. Secondly, progress had been made to only visualize subsets of the cells. This would allow to plot a certain cell type colored by gene expression of gene x and another cell type colored by gene expression.
Thirdly, work on the annotation widget has been performed and checked. Lastly, it has been made possible for widgets to communicate with one another. An example screenshot of the annotation widget is available.

Annotation workflows

We discussed user stories for a workflow that entails drawing annotations interactively with Napari and using the annotations in downstream analysis steps. To this end, we identified the following tasks that would enable such workflow:

  • napari-spatialdata widget that would enable:
    1. Drawing annotations on a specific image or coordinate system.
    2. Rename the annotations, specifying various metadata to the annotation, such as the identity of the annotator, labels for the annotations and others.
    3. Save the annotations back to the spatialdata object and on-disk.
  • Masked spatial graphs based on annotations: the annotations define specific areas of interest of the tissue. The analyst may wish to analyze the spatial structure enclosed in the annotations, or using the annotation as a "negative mask" in order to remove graph edges going across void regions of the tissue.
  • Calculating and plotting gene expression trends at increasing distance to the annotation of interest (or within the boundaries of the annotations of interest). This is similar to the squidpy function sq.tl.var_by_distance but computing distances to polygon boundaries and not simply to the centroid of the polygon.

Visium HD on-the-fly rasterization

As mentioned in this SpatialData issue, Visium HD bins can't be rasterized in memory (i.e., converted to an image) as a single full-genome image. Indeed, the smallest bins are 2-microns-width squares with full-genome sequencing. Still, for visualization and analyses purposes, rasterization is needed. Therefore, we opened a new PR for bins rasterization, on which we support two modes:

  • rasterization of one or multiple channels (in-memory). It uses the indices of the sparse table in CSC format for efficiency.
  • lazy rasterization of the full data with Dask (in particular, using map_blocks). The data is therefore rasterized when needed, for instance to display one or a few channels in napari-spatialdata.

Remaining steps includes (i) adding tests and (ii) adding some notebook examples.

Visium HD and Xenium

Recently a dataset was published [@oliveira_characterization_2024] that applied multimodal spatial transcriptomics techniques on the same colorectal cancer samples on consecutive sections. Namely, Visium HD, Xenium as well as Visium v2 and scRNAseq was performed. Our goal was to compare the high resolution sequencing-based data from Visium HD with the imaging-based Xenium to show whether they can be used as validations for each other. To achieve this, we first converted the data of both modalities to spatialdata-objects and cropped and aligned the H&E image of the Visium HD assay to match the corresponding area of the Xenium HD chip by using the alignment functions of spatialdata. With the aligned dataset we were able to show that the marker gene for epithelial cells (CEACAM6) and a marker gene for crypt base columnar cells (OLFM4) are expressed in the same tissue regions. Finally, we were looking into further methods to analyze these datasets:

  • Label transfer from scRNA-seq data to Visium HD (RCTD speed-up verison) and Xenium (SingleR)
  • Investigate the impact of different normalization methods on SVG detection, using Visium, Visium HD, and Xenium replicates.
  • Merging spatialdata objects of Xenium and Visium HD
  • Microenvironment detection using Banksy [@singhal_banksy_2022].

Cellular niches validation metrics

Multiple unsupervised metrics have been added in this Squidpy PR to evaluate niches detection methods. Notably:

  • a niche continuity metric (F1-score of cross-niche edges)
  • a cross-slide homogeneity metric (jensen-shannon diverge of niches distributions across slides)
  • DE tests to compare max gene expression across niches
  • ARI, NMI and Fowlkes-Mallows Index for niche result comparison (agreement)

Workgroup spatial proteomics

Group members had most experience with analysis of Miltenyi MACSima, Akoya Phenocycler, Lunaphore COMET and MIBI data. After some discussion, four work items were selected.

Some common issues in spatial proteomics analysis were discussed. Reading in datasets in the SpatialData format still lacks for some platforms. Some interesting metadata is also included always included, such as physical pixel size, autofluorescence subtraction, imaging cycles and exposure time. The need in some datasets to detect misalignment and co-register the channel images, either all of them or specific ones. For segmentation, applying CLAHE and using cellpose was found to be sufficient for most cells. For exceptional cell shapes in tissues such as the heart and brain there is additional difficulty and need for fine-tuning the segmentation model with enough training data. This manual labeling is time-consuming and difficult to reproduce. There was a lack of consensus on available normalization techniques, batch effect correction and their usefulness.

Interoperability with Ilastik and SpatialData

Support for exporting cells in SpatialData and interactively annotating them using a classifier with Ilastik software [@berg_ilastik_2019].

Overview of normalization methods

Normalization facilitates the integration and comparison of data from different experiments, which is essential for large-scale studies and meta-analyses such as spatial omics data. Therefore, creation of an overview of normalization methods for downstream analysis of spatial proteomics datasets and a comparison between them is crucial.

While evaluation & benchmarking would require a gold standard cell type dataset which is beyond the scope of this hackathon, a new repository was created that contains a summary of 9 methods adapted from published literature. All codes for each method are also available. A visualization of results obtained from these different methods on a MIBI dataset (not publicly available) is provided as well. Among the different methods, a visual qualitative comparison provides evidence that a combined method (Shaban et al. + Greenbaum et al.) may yield more promising results. We plan to extend the work from this hackathon with a quantitative comparison in the future.

Polygon vectorization

An alternative to spatialdata.to_polygons() label vectorization function, which features improved performance, resolution of the invalid geometries, and shapely.MultiPolygon filtering based on the area.

Polygonal representation of cells is crucial for characterizing cellular morphologies and establishing spatial relationships between cells. This method is applicable when cells are located on different planes within tissue, as well as for calculating distances between various objects. However, there is a notable lack of tools that can take a TIFF file with cell labels and output a GeoDataFrame or GeoJSON. The developing branch of the SpatialData framework includes a to_polygons() vectorization function, but it lacks functionality for resolving invalid geometries and filtering multipolygons.

The following illustrates a practical example: when analyzing thick imaging samples without a z-stack, we observe different cell types located in different z-planes relative to each other. This is usually not an issue when masks come from mutually exclusive intensity channels. However, with more general markers, we may encounter incorrect and overlapping segmentation masks. Resolving these spatially overlapping segmentation masks through geometrical subtraction often results in fragmented multipolygons with small polygons and lines, affecting downstream applications.

We aim to address the problems of invalid geometries and multipolygon filtering and provide an easy-to-use function compatible with standard NumPy arrays (unlike SpatialData, which requires a SpatialImage instance to perform vectorization). Additionally, our approach improves (~2x increase) performance by avoiding chunking of the input array.

MACSima spatialdata-io reader

We describe the features of this new reader MACSima datasets in spatialdata-io, with support for lazy loading, physical pixel size and imaging cycles in this GitHub Issue. The draft PR is available here.

Workgroup spatial multi-omics

Spatial multi-omics are an emerging class of technologies that record two or more data modalities from biological samples in a spatial context. Modalities can among others include RNA, protein, epigenetic features like chromatin accessibility and pathohistological stains. To get a better overview of the field we collected available datasets and methods. In addition, we tried to generate reliable in silico spatial multi-omics data.

True multi-omic datasets that record multiple modalities of the same cells are rare, which motivates our subproject on multi-slice alignment via image registration and integration algorithms.

Cell morphology, which is revealed by classical staining methods, is a potential very rich source of information that complements spatial transcriptomic assays like Visium and Xenium. Recently developed vision models allow unsupervised extraction of morphological features which can then be used for clustering and data integration tasks. During the hackathon general purpose models trained on imagenet and UNI [@chen_towards_2024] a model specifically tuned on histopathology were evaluated.

Alignment of modalities

Multi-modal measurements are usually performed on consecutive slides, which do not align in most cases. In order to perform multi-modal analyses, a correspondence between the measurements is needed. Rigid and affine transforms can help align images between modalities but in real-world cases, the alignment obtained is poor.

We planned on using a publicly available multi-modal dataset to test different alignment strategies. We tried performing simple affine transformations (e.g., scaling and rotation) but found the alignment to be poor. Other non-affine methods are available in the literature (e.g., SLAT, ELD, CAST) but found several issues related to installation and data availability. Despite great promise, the lack of standard multi-modal spatial object representation ultimately hinders the applicability and downstream analyses of aligned datasets.

Another promising avenue is the use of landmarks to perform alignment in a supervised manner. Spatially resolved technologies such as Xenium allow for a single cell resolution unvailable on previous iterations, however, the classic H&E slide is not necessarily outputed as in Visium and Visium HD and is usually done afterward. It is necessary to align the xenium assay with the H&E slides, this is done through the use of landmarks annotated in both the Xenium and H&E, align and use Napari to visualize the alignment. The spatialdata package allows for the recovery of the spatial coordinates and the resizing of the H&E slide.

Collecting datasets and methods

Currently a very limited number of solutions are available for multi-omics integration. Newly developed tools are not widely used, lack proper benchmarking and suffer from a limited number of datasets to perform thorough testing. Here we attempted to collect information on publicly available spatial multi-omics datasets. We also list state-of-the-art computational solutions for horizontal, vertical and diagonal data integration with key details paying special attention to the diagonal unmatched integration. An overview of the collected datasets and methods is provided in the supplement.

Data integration

Integration challenges:

  • number of detected features (e.g. RNA-seq VS proteomics)
  • different feature counts, statistical distributions
  • differences in resolution (imaging-based)
  • image alignment/overlay (imaging-based)
  • batch effect
  • technical (heavy data)
Horizontal

merging the same omic across different datasets Reasons:

  • 3D maps
  • technical replicates, integrating batches
  • integrating across different technologies

If fact, this is not a true multi-omics integration

Examples:

  • STAGATE (spatial transcriptomics, consecutive sections, adaptive graph attention auto-encoder)
  • STAligner (spatial transcriptomics datasets, batch effect-corrected embeddings, 3D reconstruction, )
  • SpaGCN (spatial transcriptomics, graph convolutional network approach that integrates gene expression, spatial location and histology)
  • PASTE (align and integrate ST data from multiple adjacent tissue sections)
  • SpaceFlow (embedding is continuous both in space and time, Deep Graph Infomax (DGI) framework with spatial regularization)
Vertical

Merges data from different omics within the same set of samples (matched integration), using cell as an anchor. Examples:

  • archr
  • MaxFuse (fuzzy smoothed embedding for weaky-linked modalities, proteomics, transcriptomics and epigenomics at single-cell resolution on the same tissue section)
  • MultiMAP (nonlinear manifold learning algorithm that recovers a single manifold on which several datasets reside and then projects the data into a single low-dimensional space so as to preserve the manifold structure)
  • Seurat5
Diagonal

Some examples of studies with unmatched integration:

  • SpatialGlue
    • graph neural network with dual-attention mechanism
    • 2 separate graphs to encode data into common embedding space: a spatial proximity graph and a feature graph
  • MEFISTO
    • factor analysis + flexible non-parametric framework of Gaussian processes
    • spatio-temporally informed dimensionality reduction, interpolation, and separation of smooth from non-smooth patterns of variation.
    • different omics, multiple sets of samples (different experimental conditions, species or individuals)
    • each sample is characterized by "view", "group", and by a continuous covariate such as a one-dimensional temporal or two-dimensional spatial coordinate
  • SLAT
    • aligning heterogenous spatial data across distinct technologies and modalities
    • graph adversarial matching
  • Cross-modality mapping using image varifolds

Additional details on this methods are summarized in supplementary Table 1. General issue: gene-based, challenges with proteomics (and even more issues with metabolomics). Direct comparison of these tools is not possible due to different tasks and working principles.

In silico datasets generation

Due to the limited number of available spatial datasets and their complexity, the tools for in silico generation of artificial spatial datasets are becoming more popular. Such tools may be useful for experimental design planning, selecting sampling strategy to get reliable statistics, and for benchmarking of new tools. Unfortunately, some of the current solutions cause serious technical issues during installation and running. Here we list 3 existing tools for dataset generation:

  • Power analysis for spatial omics
    • tissue scaffold: random-circle-packing algorithm to generate a planar graph
    • attributes on nodes represent cell type assignments
    • the labeling is based on two data-driven parameters (prior knowledge) for a tissue type: the proportions of the k unique cell types, and the pairwise probabilities of each possible cell type pair being adjacent (a k × k matrix)
    • by changing these 2 params one should be able to obtain simulations for different tissues and technologies
    • we faced some technical issues while using this tool
  • scDesign3
  • SRTsim (transcriptomics only)

Image Registration

Spatial landmark detection and tissue registration with deep learning: paper and code.

Morphological Feature Extraction

Multiple vision models were evaluated for feature extraction from Hematoxylin and eosin stains. This includes general purpose vision models included in torchvision like Resnet50 and Inceptionv3 and also a dedicated pathology model in UNI. PCA was performed on the extracted feature vectors followed by k-means clustering of the first 10 principal components. Our results show that multiple models successfully extract region specific features from the images. UNI in particular performed strongly, with clusters closely matching pathologist annotation. It remains to be seen how these features can best be integrated with RNA information for clustering and spatial domain identification.

Morphology clusters from feature extraction

Workgroup cell-cell communication

The goal of the group was to run multiple spatial CCC methods and compare evaluations/visualizations and results. We selected the methods from @armingol_diversification_2024. A more detailed table can be found on the separate GitHub repository of this workgroup.

Results

Methods were implemented and tested on a subset of the MERFISH whole mouse brain data (slice 80) from the Allen Brain Institute.

We obtained results for CCC for the following methods: COMMOT, SpatialDM, MEBOCOST, CellPhoneDB. SpatialDM and CellPhoneDB were run with LIANA+. We also ran SpaTalk but found no LR pairs, as the tool requires that the entire ligand-receptor-tf-target pathway is expressed for a LR pair to be considered, and this was likely not the case in a dataset with 1122 genes. For the other three tools we selected specific LR pairs to compare the results.

Comparison on cell type level

Do the tools identify the same sender and receiver cells that participate at communication?

LIANA+ (CellPhoneDB) and COMMOT find common ligand-receptor pairs, however, among the few cell-type source-target pairs we investigated, there was no consensus. The comparison was performed on a qualitative way rather than quantitative due to difference in output format and evaluation metrics used by the different tools. Because MEBOCOST does not use Ligand-Receptor interactions as the other tools but it calculates metabolic communication scores, we could not compare the results directly.

Comparison on spatial level

Where do the tools predict the communication to take place in tissue space? Do spatial methods benefit from the additional modality?

COMMOT and SpatialDM both make use of spatial information to predict the communication events. We investigated the LR pair Nts-Ntsr2; the cells seem to interact in the same brain region (hypothalamus).

Discussion

  • Comparison of results is difficult because i) there is no ground thruth regarding CCC, ii) output formats of methods vary, for example SpatialDM returns a $NxLR$ matrix with a score for each cell indicating the potential strength of a ligand or receptor and COMMOT returns a $NxN$ matrix for each $LR$ interaction, iii) different score metric.
  • A more systematic comparison should be carried out over all cell types and all ligand-receptor pairs.
  • Different input databases on which communication analysis is based (metabolic vs ligand-receptor) but also within LR interactions it might use the CellPhoneDB or CellChat database.

Conclusions

This hackathon was attended by 37 participants from many institutes across Europe. It provided a useful venue for the exchange of ideas and the development of new tools and methods for spatial omics data analysis. Status updates and results were summarized in a slide deck. A project board collected all task items and a Zulip stream was used for communication. Code to use the provided computational resources and some of the hackathon results are available in this code repository.

Acknowledgements

The hackathon was organized by the Saeys Lab and supported by Data Intuitive, the VIB Spatial Catalyst and the VIB Center for AI and Computational Biology.

The computational resources and services used in this work were provided by the VIB Data Core and the VSC (Flemish Super-computer Center), funded by the Research Foundation – Flanders (FWO) and the Flemish Government. B.R, R.S. and Y.S. are supported by the Flanders AI Research Program.

References