-
I have read the related discussions, e.g. #4000, #1256, #1900, #1659, and elsewhere online. I understand that the current recommendation from the Seurat authors is that differential expression (DE) analysis should NOT be performed using the integrated data, but on the original RNA data with or w/o log normalization depending on DE algorithms used. I have also read the related discussions in the "Multi-Sample Single-Cell Analyses with Bioconductor" and understands the issues in using the corrected gene expression values obtained after integration. For within dataset (batch) DE, e.g. to identify the marker genes of the common clusters identified with data integration, we can indeed go back to the non-integrated RNA data and use either a P-value-combination-based meta-analysis approach (like in However, for across dataset (batch) comparisons (one example is in the Seurat "Introduction to scRNA-seq integration" vignette, the task is to compare the control and stimulated B cells, which are from separate datasets/batches), due to reasons that "correction will inevitably introduce artificial agreement across batches", "removal of biological differences between batches in the corrected data is unavoidable", and "integration procedure inherently introduces dependencies between data points, which violates the assumptions of the statistical tests used for DE" (references here, #4000 and #1256), many authors still recommend using the uncorrected expression values (e.g. Section "Identify differential expressed genes across conditions" of the Seurat vignette and Section 8.9 of OSCA). But as far as I understand, the uncorrected values should still contain the original batch effects intrinsic to the data, and if the datasets/batches are very different, such a comparison would be meaningless (i.e. batch and biological differences are completely confounded). Am I missing anything? Does this recommendation represents the best possible analysis one can attempt for such across-dataset comparison tasks, and it's never intended to work for all cases? Could there be another better workaround for the across-dataset DE to extract biological differences, depending on the downstream analysis? In this line, I am wondering whether post-integration corrected data values could be used but in a cautious way. Would the use of corrected values be less of a concern, if one is not interested in interpreting DE P values or an exact estimation of DE fold-changes, but is only interested in the relative DE effect sizes across genes (e.g. to be used for downstream gene set enrichment or pathway-level analysis that is rank-based)? To pose the question the other way around -- Other than the so-called cell-based analysis (e.g. clustering, trajectory, etc.), what types of downstream analysis can the corrected values be reasonably used for? (And I'm specifically referring to the gene/feature-level corrected values rather than dimension-reduced embeddings.) In terms of Seurat Would be great to hear your thoughts. Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
If batch and biological condition are completely confounded, there's not much you can do about that and using integrated values for DE is not going to help you. You can hope that the batch variation is smaller than the biological variation due to the different conditions, and validate any findings with other lines of evidence. Having replicates (multiple batches per condition) is a better design.
No, not necessarily. The OSCA book chapter you linked to gives a very good explanation and I would recommend limiting the use of integrated data values to cell-based analyses. |
Beta Was this translation helpful? Give feedback.
-
If we have more than one samples per condition, we could perform batch effect correction at the pseudo bulk level like what we do in a bulk RNAseq analysis. |
Beta Was this translation helpful? Give feedback.
If batch and biological condition are completely confounded, there's not much you can do about that and using integrated values for DE is not going to help you. You can hope that the batch variation is smaller than the biologica…