Elaborate FindMarkers() and AverageExpression() for Seurat v4 #4210

tulikakakati · 2021-03-01T07:32:18Z

tulikakakati
Mar 1, 2021

Hello,

I am using Seurat v4 to integrate two disease samples and find differentially expressed genes between two samples for one particular cell type. I am very confused how Seurat calculates log2FC. Can anyone help me in understanding the basic steps in the example below?

I followed the steps from the “Introduction to scRNAseq Integration” Vignette on the Seurat website to find DE genes. In particular, here are the functions that I used:

CreateSeuratObject()-> SCTransform()-> ScaleData()-> FindVariableFeatures()-> SelectIntegrationFeatures()-> FindIntegrationAnchors()-> IntegrateData() -> ScaleData() -> RunPCA() -> RunUMAP() -> FindNeighbors() -> FindClusters()-> FindConservedMarkers().

Now, after clustering and finding the cell-type markers for each celltype, I want to find marker genes that are differentially expressed between the two samples for cell type B.

I used FindMarkers() like this:
B_response <- FindMarkers(sample.list, ident.1 = id1, ident.2 = id2, verbose = FALSE)

The top 2 genes output for this cell type are:
p_val avg_log2FC pct.1 pct.2 p_val_adj
geneA 4.32E-11 79.1474718 0.97 0.919 8.22E-07
geneB 8.98E-11 7.075509727 0.537 0.149 1.71E-06

I thought that the log2FC of 79 was very high, so I wanted to see the average expression values for these two samples in this cell type.

I used AverageExpression() like this:

avg.t.cells <- AverageExpression(t.cells,slot='counts',use.counts=TRUE,return.seurat=TRUE)

For sample#1 and the B cell type and geneA, the average expression is 2.90027283

For sample#2 and the B cell type and geneA, the average expression is 1.79175947

Usually, to calculate the avg2FC using the average expression, it would be something like this:

log2(avg_AC / avg_HC) = log2( 2.90027283 / 1.791775947) = log2 (1.61867) = 0.6948

So, I am confused as to why it is a number like 79.1474718?

Please explain how you calculate the avg_log2FC? Also, can you confirm that the steps given above for finding cell type clusters are correct?

Best,
Tulika

Answered by saketkc

Mar 11, 2021

Hi,

There are a bunch of things happening in your code which do no look correct. If you have three objects to start off with, you can follow these steps before proceeding with integration:

data1 <- Read10X(data.dir = "data1/filtered_feature_bc_matrix")
data2 <- Read10X(data.dir = "data2/filtered_feature_bc_matrix")
data3 <- Read10X(data.dir = "data3/filtered_feature_bc_matrix")

d1 <- CreateSeuratObject(counts = data1, project = "Data1")
d1$disease <- "D1"
d2 <- CreateSeuratObject(counts = data2, project = "Data2")
d2$disease <- "D2"
d3 <- CreateSeuratObject(counts = data3, project = "Data3")
d3$disease <- "D3"
# if you have metadata, you can use `AddMetaData()` instead

dataset.combined <-…

View full answer

saketkc · 2021-03-05T18:36:02Z

saketkc
Mar 5, 2021
Maintainer

We recommend FindMarkers be run on the on the RNA assay and not the integrated assay (which I am assuming is the source of discrepancy here). Can you confirm if you are running find marker after setting `DefaultAssay(obj) <- "RNA"?

It might help to paste here the code you are using.

0 replies

tulikakakati · 2021-03-11T18:34:11Z

tulikakakati
Mar 11, 2021
Author

Hi @saketkc ,

Thank you for your reply.
We used defaultAssay -> "RNA" to find the marker genes (FindMarkers()) from each cell type.

We tested two different approaches using Seurat v4:

use logNormalize for each sample before integrating the samples. After integrating, we use DefaultAssay->"RNA" to find the marker genes for each cell type. The log2FC values seem to be within the range of 2,-2 for most of the top genes.
Create a Seurat object with the counts of three samples, use SCTransform() on the Seurat object with three samples, integrate the samples. After integrating, we use DefaultAssay->"RNA" to find the marker genes for each cell type. The log2FC values seem to be very weird for most of the top genes, which is shown in the post above.

We feel that there is a problem with SCTransform(). Are we doing something wrong?? Should we stick with logNormalize() if we are doing differential expression for integrated samples?

Thank you,
Tulika

0 replies

saketkc · 2021-03-11T19:14:09Z

saketkc
Mar 11, 2021
Maintainer

Thanks for your response.

Your second approach is correct (so is the first; also see: #4000). I am not able to reproduce the discrepancy in log2FC. Can you share a reproducible example? You can use a subset of your data or any of the public datasets avaialble in SeuratData?

Also, the workflow you mentioned in your first comment is different from what we recommend. There is no ScaleData step in the SCT workflow and it uses PrepSCTIntegration (not clear from your original post if you are using this workflow).

0 replies

tulikakakati · 2021-03-11T20:08:55Z

tulikakakati
Mar 11, 2021
Author

Hi @saketkc ,

Thank you for your prompt reply.
Before we dive into log2FC and average expression values, can you please look if I have followed the correct steps for integration of 3 samples using SCTransform?
############################################
data1 <- Read10X(data.dir = "data1/filtered_feature_bc_matrix")
data2 <- Read10X(data.dir = "data2/filtered_feature_bc_matrix")
data3 <- Read10X(data.dir = "data3/filtered_feature_bc_matrix")
colnames(data1)=paste0('disease1-', colnames(data1))
colnames(data2)=paste0('disease2-', colnames(data2))
colnames(data3)=paste0('disease3-', colnames(data3))
d1 <- CreateSeuratObject(counts = data1, project = Data1")
d2 <- CreateSeuratObject(counts = data2, project = Data2")
d3 <- CreateSeuratObject(counts = data3, project = Data3")

combined_counts=cbind(d1[["RNA"]]@CountS,d2[["RNA"]]@CountS,d3[["RNA"]]@CountS)

seurat_obj=CreateSeuratObject(counts= combined_counts, min.cells = 3, project = "d1vsd2vsd3")
seurat_obj[["percent.mt"]] <- PercentageFeatureSet(seurat_obj, pattern = "^MT-")
seurat_obj <- SCTransform(seurat_obj, method = "glmGamPoi", vars.to.regress = "percent.mt", verbose = FALSE)
seurat_obj <- ScaleData(object = seurat_obj, vars.to.regress = c("nCount_RNA", "percent.mt"), verbose = TRUE)
seurat_obj <- SplitObject(seurat_obj, split.by = "orig.ident")
for (i in 1:length(seurat_obj)) {
seurat_obj[[i]] <- FindVariableFeatures(seurat_obj[[i]], selection.method = "vst", nfeatures = 2000)
}
seurat_features <- SelectIntegrationFeatures(object.list = seurat_obj, nfeatures = 3000)
seurat_anchors <- FindIntegrationAnchors(object.list = seurat_obj, dims = 1:20, anchor.features = seurat_features, verbose = TRUE)
seurat_obj <- IntegrateData(anchorset = seurat_anchors, dims = 1:20,verbose=TRUE)
DefaultAssay(seurat_obj) <- "integrated"
seurat_obj<- ScaleData(seurat_obj, verbose = FALSE)
seurat_obj <- RunPCA(seurat_obj, npcs = 30, verbose= FALSE)
seurat_obj <- RunUMAP(seurat_obj, reduction = "pca", dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction = "pca", dims = 1:20)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)
DefaultAssay(seurat_obj) <- "RNA"
clusters=as.numeric(levels(Idents(seurat_obj)))
for (i in 1:length(clusters)){
id=clusters[i]
cluster1.markers <- FindConservedMarkers(seurat_obj, ident.1 = id, grouping.var = "orig.ident", verbose = TRUE,min.pct = -0.25)
}

we used dotplot() to do the cellmapping of the conservedmarker genes to particular cell type.

seurat_obj <- RenameIdents(seurat_obj, 0 = "Naive CD4+ T", 1 = "CD8+ T" ,2 = "Naive CD4+ T",3 = "Memory CD4+", 4 = "Undefined",5 = "CD14+ Mono", 6 = "NK",
7 = "CD8+ T", 8 = "DC", 9 = "B", 10 = "Undefined",11 = "Undefined", 12 = "FCGR3A+ Mono", 13 = "Platelet", 14 = "DC")
clusters=as.character(levels(Idents(seurat_obj)))

seurat_obj$celltype.orig.ident <- paste(Idents(seurat_obj), seurat_obj$orig.ident, sep = "")
seurat_obj$celltype <- Idents(seurat_obj)
Idents(seurat_obj) <- "celltype.orig.ident"
for (i in 1:length(clusters)){
id=clusters[i]
id1=sprintf("%s_d1",clusters[i])
id2=sprintf("%s_d2",clusters[i])
cluster1.markers <- FindMarkers(seurat_obj, ident.1 = id1, ident.2 = id2, min.pct = 0.25)
write.table(cluster1.markers,paste0("d1_vs_d2_DE_marker_genes_cellcluster",id,".csv"), sep=",",col.names=NA)

Thanks,

Tulika

0 replies

saketkc · 2021-03-11T21:00:14Z

saketkc
Mar 11, 2021
Maintainer

Hi,

There are a bunch of things happening in your code which do no look correct. If you have three objects to start off with, you can follow these steps before proceeding with integration:

data1 <- Read10X(data.dir = "data1/filtered_feature_bc_matrix")
data2 <- Read10X(data.dir = "data2/filtered_feature_bc_matrix")
data3 <- Read10X(data.dir = "data3/filtered_feature_bc_matrix")

d1 <- CreateSeuratObject(counts = data1, project = "Data1")
d1$disease <- "D1"
d2 <- CreateSeuratObject(counts = data2, project = "Data2")
d2$disease <- "D2"
d3 <- CreateSeuratObject(counts = data3, project = "Data3")
d3$disease <- "D3"
# if you have metadata, you can use `AddMetaData()` instead

dataset.combined <-   merge(d1, y = c(d2, d3), add.cell.ids = c("D1", "D2", "D3"), project = "D1D2D3")

object.list <- SplitObject(dataset.combined, split.by = "disease")
object.list <- lapply(X = object.list, FUN = SCTransform)

You can then proceed with object.list analogous to ifnb.list in this vignette

4 replies

tulikakakati Mar 15, 2021
Author

Hi @saketkc,

Thank you for your elaborate steps of codes. I did try with these codes for SCtransform, but I could still confused with the results.
For example, using logNormalize (approach 1), the log2FC value of one of the top genes, gene A is 1.4923. But when I use the codes for SCtransform (approach 2), the log2FC value of gene A is 79.11711.

Can you please explain me, why the log2FC values is higher for SCtransform than those of logNormalize ? Can you also explain with a suitable example how to Seurat's AverageExpression() and FindMarkers() are calculated?

Thank you,

Best,
Tulika

saketkc Mar 15, 2021
Maintainer

It's hard to guess what is going on without looking at the code.

AverageExpression uses the "data" slot by default (which for RNA assay would store log1p(counts)). Since you did not run LogNormalize here, you can specify slot="counts" here to calculate average expression ( with assay="RNA").

For FindMarkers, you could run it on the RNA (even though you use SCT for rest of the steps) assay which uses the default slot of data. You would want to do something like this

DefaultAssay(object = obj) <- "RNA"
obj.second <- NormalizeData(object =fresh.obj, normalization.method = "LogNormalize", scale.factor = 10000)
all.genes <- rownames(x = obj)
obj <- ScaleData(object = obj, features = all.genes)
obj.markers <- FindAllMarkers(fresh.obj , only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

other options is to run FindMarkers on the pearson residuals themselves (stored in slot=scale.data of assay="SCT")

pabloatria18 Aug 24, 2021

Hello @saketkc
I've been reading because I have had similar issues, questions.
I have two datasets where I performed SCT and Integration.

Now I want to run the DE between both conditions but I am unsure how to do it
I know has to be in the RNA slot so I am running this

NormalizeData(object = my.integrated, assay = "RNA")
DefaultAssay(my.integrated) <- "RNA"

a.cells <- subset(integrated, idents = "A Cells")
Idents(a.cells) <- "group"
avg.a.cells <- as.data.frame(log1p(AverageExpression(a.cells, verbose = FALSE)$RNA))
Here I get this error:

Warning message:
In PseudobulkExpression(object = object, pb.method = "average", :
Exponentiation yielded infinite values. data may not be log-normed.

Can you please advise me?

Thank you

saketkc Aug 24, 2021
Maintainer

If you want to do DE on the a.cells, you should be able to do (I use the SCT data slot here which has corrected counts - no effect of library size):

DefaultAssay(a.cells) <- "SCT"
x_markers <- FindMarkers(a.cells, ident.1="x", slot="data")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elaborate FindMarkers() and AverageExpression() for Seurat v4 #4210

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Elaborate FindMarkers() and AverageExpression() for Seurat v4 #4210

tulikakakati Mar 1, 2021

Replies: 5 comments · 4 replies

saketkc Mar 5, 2021 Maintainer

tulikakakati Mar 11, 2021 Author

saketkc Mar 11, 2021 Maintainer

tulikakakati Mar 11, 2021 Author

we used dotplot() to do the cellmapping of the conservedmarker genes to particular cell type.

saketkc Mar 11, 2021 Maintainer

tulikakakati Mar 15, 2021 Author

saketkc Mar 15, 2021 Maintainer

pabloatria18 Aug 24, 2021

saketkc Aug 24, 2021 Maintainer

tulikakakati
Mar 1, 2021

Replies: 5 comments 4 replies

saketkc
Mar 5, 2021
Maintainer

tulikakakati
Mar 11, 2021
Author

saketkc
Mar 11, 2021
Maintainer

tulikakakati
Mar 11, 2021
Author

saketkc
Mar 11, 2021
Maintainer

tulikakakati Mar 15, 2021
Author

saketkc Mar 15, 2021
Maintainer

saketkc Aug 24, 2021
Maintainer