Skip to content

Latest commit

 

History

History
140 lines (107 loc) · 5.86 KB

2.QC.md

File metadata and controls

140 lines (107 loc) · 5.86 KB

Basic Pipeline for scRNAseq Data Analysis : QC

Instructors : Somi Kim, Eunseo Park, Donggon Cha 2021/07/06

Removal of low quality cells

To preprocess liver cancer scRNA-seq data, basic information of each cell of experimental design is necessary. Given sample information is loaded and organized. First, we set sample ID per each patient from its sample name to include liver cancer subtype information. We also set Diagnosis as liver cancer subtypes which are HCC or iCCA in this data.

library(DropletUtils)
library(dplyr)
library(scater)

dir = "/BiO/home/edu4/data/afterDU"
sampleInfo <- read.table(paste0(dir, "/samples.txt"), header=TRUE, sep="\t")
head(sampleInfo)
##          Sample       Cell.Barcode Type
## 1 S02_P01_LCP21 AAACCTGAGGCGTACA-1  CAF
## 2 S02_P01_LCP21 AAACGGGAGATCGATA-1  CAF
## 3 S02_P01_LCP21 AAAGCAAAGATCGGGT-1  CAF
## 4 S02_P01_LCP21 AAATGCCGTCTCAACA-1  CAF
## 5 S02_P01_LCP21 AACACGTCACGGCTAC-1  TEC
## 6 S02_P01_LCP21 AACCGCGAGACGCTTT-1  CAF
sampleInfo$ID <- sapply(sampleInfo$Sample %>% as.character(), function(x) {strsplit(x, split="_")[[1]][3]}) 
sampleInfo$ID <- gsub("LCP", "", sampleInfo$ID)

orig.ids = sampleInfo$ID %>% as.factor() %>% levels()
new.ids = c("H18", "H21", "H23", "C25", "C26", "H28", 
            "C29", "H30", "C35", "H37", "H38", "C39")

for(i in orig.ids){
  new.id = new.ids[grep(i, new.ids)]
  sampleInfo[grep(i, sampleInfo$ID),]$ID = new.id
}
sampleInfo$Diagnosis = sampleInfo$ID
for(i in c("H", "C")){
  if(i == "H"){
    sampleInfo[grep(i, sampleInfo$ID),]$Diagnosis = "HCC"
  }else{
    sampleInfo[grep(i, sampleInfo$ID),]$Diagnosis = "iCCA"
  }
}

For preprocessing of scRNA-seq data, mapped reads are loaded as a Singlecellexperiment (SCE) object by read10XCounts() of DropletUtils R package. A SCE object contains a gene-by-cell count matrix, gene data (gene annotation, etc) and cell data (sample information, experimental condition information, etc). Gene information will be stored in rowData(SCE), and cell information is stored in colData(SCE). A gene-by-cell matrix is stored as a sparse matrix in a SCE object. Ensembl gene ids is transformed into gene symbol for eaiser further analysis. In colData(sce), Sample name, cancer histologic subtypes (Diagnosis), sample ID, cell type information (from the literature) is stored.

sce <- read10xCounts(
  samples = dir,
  type="sparse",
  col.names = TRUE
)
rownames(sce) = uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)

sce$Sample = sampleInfo$Sample
sce$Diagnosis = sampleInfo$Diagnosis
sce$ID = sampleInfo$ID
sce$Type = sampleInfo$Type

sce
## class: SingleCellExperiment 
## dim: 20124 5115 
## metadata(1): Samples
## assays(1): counts
## rownames(20124): RP11-34P13.7 FO538757.2 ... AC233755.1 AC240274.1
## rowData names(2): ID Symbol
## colnames(5115): AAACCTGAGGCGTACA-1 AAACGGGAGATCGATA-1 ...
##   TTTATGCTCCTTAATC-13 TTTGTCAGTTTGGGCC-13
## colData names(5): Sample Barcode Diagnosis ID Type
## reducedDimNames(0):
## altExpNames(0):

To remove low quality cells, several values such as number of unique molecular identifiers (UMIs) per cell, number of genes detected per cell, the percentage of UMIs assigned to mitochondrial (MT) genes are calculated using addPerCellQC() of scater R package. We define poor quality cells with < 700 UMIs and > 20% of UMIs assigned to MT genes and excluded them. Criteria can be visualized as histograms as below.

library(scater)

mtgenes = rowData(sce)[grep("MT-", rowData(sce)$Symbol),]$Symbol
is.mito = rownames(sce) %in% mtgenes
table(is.mito)

sce <- addPerCellQC(
  sce,
  subsets = list(MT=mtgenes),
  percent_top = c(50, 100, 200, 500), 
  detection_limit = 5
)

sce$log10_sum = log10(sce$sum + 1)
sce$log10_detected = log10(sce$detected + 1)

umi = 700
mtpct = 20

hist(sce$sum, breaks = 100)
abline(v = umi, col="red")

hist(sce$subsets_MT_percent, breaks=100)
abline(v=mtpct, col="red")

filter_by_total_counts = sce$sum > umi
filter_by_mt_percent = sce$subsets_MT_percent < mtpct

sce <- runColDataPCA(sce, variables = list("sum", "detected", "subsets_MT_percent", "percent.top_500"))

sce$use <- (
  filter_by_total_counts &
    filter_by_mt_percent 
)

plotReducedDim(sce, dimred="PCA_coldata", colour_by="sum")
plotReducedDim(sce, dimred="PCA_coldata", colour_by="subsets_MT_percent")
plotReducedDim(sce, dimred="PCA_coldata", colour_by="use")

sce = sce[,sce$use]
## is.mito
## FALSE  TRUE 
## 20111    13

References

L. Ma, M.O. Hernandez, Y. Zhao, M. Mehta, B. Tran, M. Kelly, Z. Rae, J.M. Hernandez, J.L. Davis, S.P. Martin, D.E. Kleiner, S.M. Hewitt, K. Ylaya, B.J. Wood, T.F. Greten, X.W. Wang. Tumor cell biodiversity drives microenvironmental reprogramming in liver cancer. Canc. Cell, 36 (4): 418-430 (2019)

McCarthy, D. J., Campbell, K. R., Lun, A. T. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017)

Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).