Clustering reveals different cell subpopulations but these differences disappear upon integration using reciprocal PCA #6640

jjacob12 · 2022-10-16T16:34:01Z

jjacob12
Oct 16, 2022

Hello,
I wonder if I can get your insights into a problem when clustering cells grown in vitro in an un-integrated vs integrated workflow.
If I run either the conventional coding pipeline (e.g. Seurat - Guided Clustering Tutorial) or the SCTransform variation I can see that a neuronal subtype of interest has just one cluster in a control/healthy condition (condition 1) and three clusters in another condition which contains cancer cells and healthy cells (condition 2). To explain the difference I've hypothesised that some of the cancer cells differentiate in condition 2 and start to resemble superficially the healthy cell subtype. Here are a couple of images - the top one is the control (condition 1) and the bottom is the test (condition 2). NEUROD1 is the gene used to mark the subtype of interest:

For both condition 1 and condition 2, the res was the same (res=0.5) in the FindClusters() command.
To see if these different cell identities could be resolved in an integrated workflow I subsetted the seurat object for the clusters of interest from analysis of each condition individually and attempted to integrate these clusters by running the following (with and without regressing out the cell cycle difference between G2M and S which were apparent in an exploratory analysis - made little difference to the final DimPlot result):

# load the objects to be integrated
neurod1.ons76cbo <- readRDS("outputs50K/ONS76-CBO/neurod1Clusts.NoCCregress.ons76cbo.rds")

neurod1.cbo <- readRDS("outputs50K/CBO/neurod1Clust.NoCCregress.cbo.rds")

neurod1.list <- list(neurod1.cbo, neurod1.ons76cbo)

# normalize and identify variable features for each dataset independently
neurod1.list <- lapply(neurod1.list, FUN = function(x) {
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

# select features that are repeatedly variable across datasets for integration; run PCA on each
# dataset using these features
features <- SelectIntegrationFeatures(object.list = neurod1.list)

neurod1.list <- lapply(neurod1.list, FUN = function(x){
  x <- ScaleData(x, features = features, verbose = FALSE)
  x <- RunPCA(x, features = features, verbose = FALSE)
})

# here, use reciprocal PCA (rpca) with k.anchor = 5 so that biologically different cells
# don't cluster together.
neurod1.anchors <- FindIntegrationAnchors(object.list = neurod1.list,
                                          anchor.features = features,
                                          reduction = "rpca",
                                          k.anchor = 5)   # also tried k.anchor = 20, and not much difference
neurod1.combined <- IntegrateData(anchorset = neurod1.anchors)

# switch the default assay
DefaultAssay(neurod1.combined) <- "integrated"

On the neurod1.combined object I ran the standard commands for clustering and visualisation:

ScaleData(neurod1.combined, verbose = FALSE)
RunPCA(neurod1.combined, npcs = 30, verbose = FALSE)
RunUMAP(neurod1.combined, reduction = "pca", dims = 1:30)
FindNeighbors(neurod1.combined, reduction = "pca", dims = 1:30)
FindClusters(neurod1.combined, resolution = 1) # I also tried res = 0.4 - no difference.

then got this result (red dots= condition 1; blue dots=condition 2):

Based on the clustering of the individual samples for NeuroD1 shown above, I was expecting to see all red dots and some blue dots intermingled forming a cluster (condition 1 and 2 both contain healthy cells) and a separate cluster of only blue dots (representing cancer cells from condition 2 that differentiated). Maybe that's a naive expectation!

I'm not sure if this workflow is appropriate (pulling out clusters from different conditions and trying to integrate them, with repeat normalisation, scaling, etc of the integrated object). I also ran this with CCA instead of RPCA, and as expected there was again no difference.
Incidentally, I also tried integrating the entirety of the cell populations from both condition 1 and condition 2, not just specific clusters of interest, and then sub-clustered the NeuroD1 expressing cluster in the integrated object, but this too did not reveal the partitioning of cells expressing the marker according to their condition of origin.

Hope I have explained this sufficiently well!
Thanks in advance.
John

f6v · 2022-10-17T09:19:43Z

f6v
Oct 17, 2022

The key question is: why you think the integration is necessary? There're subquestions to that:

How many biological samples are there in each condition?
Were the samples sequenced and prepared on the same day?
Do the cells from different batches cluster by batch or by cell type?

0 replies

jjacob12 · 2022-10-17T10:47:34Z

jjacob12
Oct 17, 2022
Author

Hi, Thanks very much for getting back to me. In response: - Reason for integration: To decide which of the three clusters in condition 2 are healthy cells. I assumed the healthy cells in condition 2 would cluster with the healthy cells in the control condition 1 and the cancer cells in condition 2 would form a separate cluster(s). Is there another way apart from integration of telling which of the 3 clusters in condition 2 is the healthy one? I used a Venn diagram approach to check how many genes there were in common between each cluster in condition 2 versus the single cluster in condition 1. However, that seems a little unsophisticated perhaps?? - For each condition there is only 1 sample with a couple of thousand cells each approximately. - All the samples were prepped and sequenced on the same day in the same 10X run. - All samples were processed as part of the same, single batch as explained above. In a separate integration with another sample (3-sample integration) cells clustered by cell type. Best, John Dr John Jacob

…

Sent from my iPhone

On 17 Oct 2022, at 10:19, f6v ***@***.***> wrote: The key question is: why you think the integration is necessary? There're subquestions to that: How many biological samples are there in each condition? Were the samples sequenced and prepared on the same day? Do the cells from different batches cluster by batch or by cell type? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

0 replies

f6v · 2022-10-19T07:02:08Z

f6v
Oct 19, 2022

It's a bit hard to say without experimenting with the data, so take everything with a grain of salt. I think there's a risk to erase biological variation when using data integration. And 1 vs 1 sample is tricky, I've seen datasets where two samples from the same group end up very different to each other. My point is that this like an underdetermined system.

Is there another way apart from integration of telling which of the 3 clusters in condition 2 is the healthy one?

This depends on cancer type, but some people use other software to infer CNVs or other mutations. So you could distinguish cancer and healthy cells based on that. http://www.bioconductor.org/packages/devel/bioc/vignettes/infercnv/inst/doc/inferCNV.html as an example, but I haven't used it myself.

0 replies

jjacob12 · 2022-10-20T08:40:53Z

jjacob12
Oct 20, 2022
Author

Hi @f6v,
Thanks for your explanation. I omitted to mention that I already tried inferCNV but the results were not as clean as expected, possibly because the number of cancer cells was not that high, but this is just speculation.
All the best,
John

0 replies

yuhanH · 2022-11-04T19:18:02Z

yuhanH
Nov 4, 2022
Collaborator

HI @jjacob12
This issue is unrelated to seurat technical issues. I will move it to the discussion section.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering reveals different cell subpopulations but these differences disappear upon integration using reciprocal PCA #6640

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Clustering reveals different cell subpopulations but these differences disappear upon integration using reciprocal PCA #6640

jjacob12 Oct 16, 2022

Replies: 5 comments

f6v Oct 17, 2022

jjacob12 Oct 17, 2022 Author

f6v Oct 19, 2022

jjacob12 Oct 20, 2022 Author

yuhanH Nov 4, 2022 Collaborator

jjacob12
Oct 16, 2022

f6v
Oct 17, 2022

jjacob12
Oct 17, 2022
Author

f6v
Oct 19, 2022

jjacob12
Oct 20, 2022
Author

yuhanH
Nov 4, 2022
Collaborator