CONICS

CONICS: COpy-Number analysis In single-Cell RNA-Sequencing

CONICS works with either full transcript (e.g. Fluidigm C1) or 5'/3' tagged (e.g. 10X Genomics) data!

The CONICS paper has been accepted for publication in Bioinformatics. Check it out here !

CONICSmat - Identifying CNVs from scRNA-seq with only a count table (R tutorial)
Identifying CNVs from scRNA-seq using aligned reads and a control dataset (advanced users)
Integrating the minor-allele frequencies of point mutations (advanced users)
Phylogenetic tree contruction
Intra-clone co-expression networks
Assessing the correlation of CNV status with single-cell expression
False discovery rate estimation: Cross validation
False discovery rate estimation: Empirical testing

CONICSmat - Identifying CNVs from scRNA-seq using a count table

CONICSmat is an R package that can be used to identify CNVs in single cell RNA-seq data from a gene expression table, without the need of an explicit normal control dataset. CONICSmat works with either full transcript (e.g. Fluidigm C1) or 5'/3' tagged (e.g. 10X Genomics) data. A tutorial on how to use CONICSmat, and a Smart-Seq2 dataset, can be found on the CONICSmat Wiki page [CLICK here].

Visualizations of scRNA-seq data from Oligodendroglioma (Tirosh et al., 2016) generated with CONICSmat.

CONICS - Identifying CNVs from scRNA-seq with alignment files

Requirements

Python and Perl
beanplot
samtools
bedtools IMPORTANT: Bedtools >2.2.5 is needed in order to correctly calculate the coverage using CONICS.
Two directories, the first containing the aligned scRNA-seq data to be classified by CNV status, and a second, containing aligned scRNA-seq data to be used as a control.
A file containing the genomic coordinates of the CNVs in BED format.

Config file

Adjust CONICS.cfg to customize the following:

Path to python/samtools/bedtools/Rscript
Thresholds for mapping-quality and read-count
FDR for CNV calling

Test data

A set of tumor cells from three glioblastoma patients and a normal brain control, as well as a file with genomic coordinate of large-scale CNVs are available here.

Running

bash run_CONICS.sh [directory for tumor] [directory for normal] [.bed file for CNV segments] [base name]

[directory for tumor]: path to directory containing aligned bam files to be tested. Example glioblastoma data, used in the manuscript, can be obtained here.
[directory for normal]: path to directory containing aligned bam files to be used as a control. Example nonmalignant brain data, used in the manuscript, can be obtained here was used as an examples for the journal
[BED file for CNV segments]: tab-delimited bed file of CNV segments to be quantified.
```
[chromosome]	[start]	[end]	[chromosome:start:end:CNV]
```
Note: the 4th column of the file must have the exact format shown here:(Amp: amplification, Del: deletion)

7   19533   157408385   7:19533:157408385:Amp
9   19116859    32405639    9:19116859:32405639:Del

[base name] : base name for output directory

Output

All output files will be located in the directory output[base name]_.

incidenceMatrix.csv: matrix of presence/absence for all CNVs, in individual cells
Read-count distribution in CNV segments. (violin plot)
Hierarchical clustering of the single cells by CNV status.

Integrating estimates of point-mutation minor-allele frequencies

Regions of copy-number alteration will show a drop in the frequency of reads quantifying the minor allele. Averaged over large regions of copy-number alteration, this provides an additional metric to increase confidence in single-cell CNV-calls.

Requirements

Python and R
bam-readcount
gplots and ggplot2
One directory containing the aligned tumor scRNA-seq data to be classified
Two variant VCF files from exome-seq of (blood) control and tumor tissue, eg generated with the GATK toolkit.
A file containing the genomic coordinates of the CNVs in BED format.

Config file

Adjust CONICS.cfg to customize the following:

Path to python/bam-readcount/Rscript
Path to genome which reads were aligned to (FASTA format)

Running

 bash run_BAf_analysis.sh [directory for tumor] [VCF file for normal exome-seq] [VCF file for tumor exome-seq] [BED file for CNV segments] [base name]

[directory for tumor]: path to directory containing aligned bam files to be tested. Example glioblastoma data, used in the manuscript, can be obtained here.
[VCF file for normal exome-seq]: Vcf file containing mutations for a control exome-seq, e.g. from blood of the patient. This file can be generated with tools like GATK toolkit.
[VCF file for tumor exome-seq]: Vcf file containing mutations detected in exome-seq of the tumor. This file can be generated with tools like GATK toolkit.
[BED file for CNV segments]: tab-delimited bed file of CNV segments to be quantified.
[base name] : base name for output directory

Output

All output files will be located in the directory output[base name]_

_germline-snvs.bed: BED file containing position and BAFs from Exome-seq, generated in step 1.
_af.txt: TAB separated table containing the counts for the A allele at each locus in each cell, generated in step 2
_bf.txt: TAB separated table containing the counts for the B allele at each locus in each cell, generated in step 2
baf_hist.pdf Hierarchical clustering of the average B allele frequency in each of the loci altered by copy number for each cell, generated in step 3

Phylogenetic tree contruction

CONICS can generate a phylogenetic tree from the CNV incidence matrix, using the Fitch-Margoliash algorithm. Other phylogenetic reconstruction algorithms can be applied, using the incidence matrix as a starting point.

Requirements

Config file

Adjust Tree.cfg to change the following.

Path to Rscript
Path to Rphylip

Running

Before running, set the path to Phylip in Tree.cfg file.

bash run_Tree.sh [CNV presence/absence matrix][number of genotypes] [base name for output file]

[CNV presence/absence matrix]: .incidenceMatrix.csv files.
[number of genotypes]: the number of genotypes to model
[base name] : base name for output directory

Output

cluster.pdf (phylogenetic trees) and cluster.txt will be generated in the output directory. Each leaf corresponds to a clusters of cells with a common genotype. Cluster assignments for each cell will be in cluster.txt.

cluster_1  D12,E10,F9,G3,A12,C8,C9,A3,A5,A6,C3,C2,C1,C7,H12,C4,D8,D9,A9,E4,E7,E3,F1,E1,B5,B7,E9,B3,D7,D1
cluster_2  E8,G7,G9,A7,G2,B6,E2
cluster_3  H3,A2,A4,H8,G11,F2,F3,H1,H7
cluster_4  A10,B2
cluster_5  C5
cluster_6  F8,B1

Intra-clone co-expression networks

CONICS can construct the local co-expression network of a given gene, based on correlations across single cells.

Requirements

Config file

Adjust CorrelationNetwork.cfg to configure the following:

Path to Rscript
ncore: Number of cores (default: 12)
cor_threshold: Starting threshold to construct the co-expression network (default: 0.9)
min_neighbours: How many direct neighbours of gene of interest should be analyzed (default: 20)
minRawReads: How many raw reads should map to a gene for it to be included (default: 100)
percentCellsExpressing: Percentage (0.15 =15%) of cells expressing a gene for it to be included (default: 0.15)
minGenesExpr: How many genes should be expressed in a cell for it to be included (default: 800)
depth: How deep should the gene analysis search. (2=only direct neighbor genes would be considered) (default: 2)

Running

bash run_CorrelationNetwork.sh [input matrix] [centered gene] [base name]

[input matrix]: tab-delimited file of read counts for each gene (rows), for each cell (columns).
[centered gene]: a target gene of which neighbor genes are analyzed.
[base name] : base name for output directory

Output

All the output files will be located in output.

[correlstion_threshold]_[gene_name].txt : co-expression network
[gene_name]corMat.rd: Rdata containing the adjusted correlation matrix
topCorrelations.pdf: bar graph of top correlations.

Assessing the correlation of CNV status with single-cell gene-expression

Requirements

zoo

Config file

Adjust CompareExomeSeq_vs_ScRNAseq.cfg to set the following:

Path to Rscript
window size for assessing CNV status

Running

bash run_compareExomeSeq_vs_ScRNAseq.sh [matrix for read counts] [base name for output file]

[matrix for read counts]: tab-delimited file of the number of mapped reads to each gene in the DNA sequencing and in scRNA-seq. Genes on each chromosome should be ordered by their chromosomal position.
```
[gene] [chromosome] [start] [#read in DNA-seq(normal)] [#read in DNA-seq(tumor)] [#read in scRNA-seq(normal)] [#read in scRNA-seq(tumor)]
```
- example


DDX11L1   1   11874   538   199   5   0
WASH7P    1   14362   4263   6541   223   45

[base name] : base name for output directory

Output

Compare[window_size].pdf_ (Box plot) will be generated in the output directory.

False discovery rate estimation: Cross validation

CONICS can estimate false discovery rate via 10-fold cross-validation, using the user-supplied control scRNA-seq dataset. For example, in the manuscript cross validation was performed using normal brain controls.

Requirements

Config file

Adjust 10X_cross_validation.cfg to set the following:

Paths to python/samtools/bedtools/Rscript
Thresholds for mapping-quality and read-count.
FDR for CNV calling

Running

bash run_10X_cross_validation.sh [directory for control scRNA-seq] [.bed file containing CNV segments] [base name]

[directory for test]: path to directory containing the aligned BAM files of the scRNA-seq control data.
[.bed file for CNV segments], [base name] : same as described in run_CONICS.sh ;

Output

Box plot of 10 FDRs resulting from each pooled sample would be generated (boxplot.pdf) in the output directory.

False discovery rate estimation: Empirical testing

FDRs can also be estimated by empirical testing. In the manuscript, the number of false positive CNV calls was calculated using a non-malignant fetal brain dataset. These data are independent from the training set

Requirements

Config file

Adjust Empirical_validation.cfg to change the following:

Paths to python/samtools/bedtools/Rscript
Thresholds for mapping-quality and read count
FDR for CNV calling

Running

bash run_empirical_validation.sh [directory for train] [directory for test] [.bed files for CNV segments] [base name]

[directory for train]: path to directory containing aligned bam files of scRNA-seq data used as a control to call CNVs
[directory for test]: path to directory containing aligned bam files of scRNA-seq data known not to have CNVs, used as a gold standard.
[BED file for CNV segments] : same as described in run_CONICS.sh

Output

Box plot of FDRs will be generated (boxplot.pdf) in the output directory.

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
CONICSmat		CONICSmat
backend		backend
images		images
10X_cross_validation.cfg		10X_cross_validation.cfg
CONICS.cfg		CONICS.cfg
CompareExomeSeq_vs_ScRNAseq.cfg		CompareExomeSeq_vs_ScRNAseq.cfg
CorrelationNetwork.cfg		CorrelationNetwork.cfg
Empirical_validation.cfg		Empirical_validation.cfg
README.md		README.md
SF10281c.cnv.merged_gt1500000_20percent.bed		SF10281c.cnv.merged_gt1500000_20percent.bed
Tree.cfg		Tree.cfg
chromosome_arm_positions_grch38.txt		chromosome_arm_positions_grch38.txt
chromosome_full_positions_grch38.txt		chromosome_full_positions_grch38.txt
chromosome_full_positions_mm10.txt		chromosome_full_positions_mm10.txt
run_10X_cross_validation.sh		run_10X_cross_validation.sh
run_BAf_analysis.sh		run_BAf_analysis.sh
run_CONICS.sh		run_CONICS.sh
run_CorrelationNetwork.sh		run_CorrelationNetwork.sh
run_Tree.sh		run_Tree.sh
run_compareExomeSeq_vs_ScRNAseq.sh		run_compareExomeSeq_vs_ScRNAseq.sh
run_empirical_validation.sh		run_empirical_validation.sh

Neurosurgery-Brain-Tumor-Center-DiazLab/CONICS

Folders and files

Latest commit

History

Repository files navigation

CONICS

Table of contents

CONICSmat - Identifying CNVs from scRNA-seq using a count table

CONICS - Identifying CNVs from scRNA-seq with alignment files

Requirements

Config file

Test data

Running

Output

Integrating estimates of point-mutation minor-allele frequencies

Requirements

Config file

Running

Output

Phylogenetic tree contruction

Requirements

Config file

Running

Output

Intra-clone co-expression networks

Requirements

Config file

Running

Output

Assessing the correlation of CNV status with single-cell gene-expression

Requirements

Config file

Running

Output

False discovery rate estimation: Cross validation

Requirements

Config file

Running

Output

False discovery rate estimation: Empirical testing

Requirements

Config file

Running

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages