Skip to content

Fang0828/Texomer

Repository files navigation

Texomer

Author: Fang Wang Email: [email protected] or [email protected]

Draft date: Nov. 1, 2018

Description

Texomer is a tool used in cancer genomic studies to perform allele-specific, tumor-deconvoluted transcriptome-exome integration of the bulk whole exome (WES) and whole transcriptome sequencing (WTS) data obtained from autologous patient tissue sample.

It reports estimations of tumor purity at both the DNA and RNA level as well as intratumor heterogeneity at the DNA level. Moreover, Texomer yeilds allelic specific copy numbers and expression levels along with a differential allelic cis-regulatory effect (DACRE) score of each variant allele.

If the input only includes information at the DNA level, it only reports the estimates at the DNA level and focuses on somatic mutations.

System requirements and dependency

Texomer runs on a x86_64 Linux system. It depends on samtools and bedtools to extract reads covering variant sites from WTS data.

It also requires R (version >= 3.4) to run and has dependency on the R packages:

bbmle, emdbook, copynumber,TitanCNA, facets, mixtools, ASCAT and Sequenza.

These R packeages are already incorporated in this release. Users do not need to install them.

Installation

Please download and copy the distribution to your specific location. If you are cloning from github, ensure that you have git-lfs installed.

For example, if the downloaded distribuition is Texomer-2.0.tar.gz. Type 'tar zxvf Texomer-2.0.tar.gz'

Then, run Texomer.py in the resulting folder.

Usage

Options:
  --version             show program's version number and exit
  -h, --help            Show this help message and exit.
  -p TEXOMER, --Texomer=TEXOMER
                        the path of Texomer
  -I INPUT, --Input=INPUT
                        the form of input file: BAM, Varscan, and Defined									
  -t TUMOR, --Tumor=TUMOR
                        Tumor WES bam file.
  -n NORMAL, --Normal=NORMAL
                        Normal WES bam file.
  -o OUTPATH, --outpath=OUTPATH
                        the output path.
  -r RNA, --RNA=RNA     RNA-seq data bam file       
  -g GERMLINE, --germline=GERMLINE
                        You can input your own germline mutation file together with
                        somatic mutation file (-s). The file include 8 columns:
                        chromosome, position, RefAllele, Altallele,
                        read counts of RefAllele in normal, read counts of
                        Altallele in normal, read counts of RefAllele in tumor
                        and read counts of Altallele in tumor with header
                        seperated by tab.
  -s SOMATIC, --somatic=SOMATIC
                        You can input your own somatic mutation file together with
                        germline mutation file (-g). The file include 8 columns:
                        chromosome, position, RefAllele, Altallele, read
                        counts of RefAllele in normal, read counts of
                        Altallele in normal, read counts of RefAllele in tumor
                        and read counts of Altallele in tumor with header
                        seperated by tab.
  -v VARSCAN, --varscan=VARSCAN
                        the Varscan2 output file based on somatic calling
  -e SNVEXPRESS, --snvexpress=SNVEXPRESS
                        The allelic read count of mutation from RNA-seq
                        including 7 columns: chromosome, position, ref, alt,
                        refnum, altnum and type (germline or somatic)
  -u ITER, --iter=ITER  optimization using somatic mutation, 0 corresponding to
                        no optimization and 1 is optimization. The default = 1

Input files:

DNA input files:

Three kinds of input files are allowed in Texomer:

(1) BAM file of tumor and normal tissue from WES

(1) The output generated by Varscan, which includes both germline and somatic variants through -v;

(2) Two Defined files: one contains germline variants through -g and the other contains somatic mutations through -s.

The format of these two files are the same. It should include 8 tab-delimited columns:
	chromosome (start from "chr"),
	position,
	reference allele,
	alternative allele,
	ref allelic read counts in the normal sample,
	alt allelic read counts in the normal sample,
	ref allelic read counts in the tumor sample,
	alt allelic read counts in the tumor sample.

RNA input files:

Two kinds of input files are allowed in Texomer:

(1) A RNA-seq bam file.

(2) Allelic read counts of mutation from the RNA-seq bam, which includes 7 tab-delimited cloumns:
	chromosome (start from "chr"),
	position,
	reference allele,
	alternative allele,
	reference allelic read count from the RNA-seq bam,
	alternative allelic read count from the RNA-seq bam,
	type of mutation (germline or somatic).    

Output files:

There are multiple files:

If the input includes a RNA file, Texomer would generate output at both the DNA and the RNA levels. Otherwise, it will output only estimation at the DNA level.

(1) output.segment.txt file:
	position of copy nuymber segments;
	allele-specific copy number (Dmajor and Dminor),
	allele-specific expression levels(Rmajor and Rminor), and
	posterior probability of discordant expression corresponding to each segment.

(2) output.mutation.txt file:
	position of mutations,
	reference(ref) and alternative(alt) allele,
	allelic read count from RNA-seq(refNum and altNum),
	mutation type(germline or somatic),
	allele specific copy number (altD corresponding to alternative allele and wildD corresponding to reference allele),
	allele specific expression level (altR corresponding to alternative allele and wildR corresponding to reference allele),
	posterior probability of discordant expression and DACRE score for each mutation.

(3) output.summaryres.txt file:
	tumor purity at the DNA and/or the RNA level;
	ploidy and intra-tumor heterogeneity at the DNA level

Run Texomer: The python script for easy run of Texomer is in the release directory. You can tune the parameters as you wish.

Python Texomer.py [-t <tumor bam file>] [-n <normal bam file>] [-r <RNA bam file>] [-o <output path>] [-v <Varscan output>] [-g <Defined germline mutation input file>] [-s <Defined somatic mutation input file>] [-u <optimization>] [-e <Defined expression file of mutation>] –p <Texomer path> –I <input form>

[...] contains optional parameters. The mandatory arguments are -p and -I. The form of input includes BAM, Varscan, and Defined.

About the default parameters

Texomer optimizes estimation of tumor purity and allele specific copy numbers by combining both germline SNPs and somatic SNVs.

By default, -u is 1, which turns on iterative optimization.

A user can set -u 0 to turn off the iterative optimization.

Example

Try Texomer in the package directory on the different example datasets

Example 1: Input defined mutation file

python Texomer.py -p ./ -I Defined -g ./example/germline.input -s ./example/somatic.input.vcf -e ./example/RNA.SNV -o ./res1

germline.input:

chr	pos	ref	alt	refNumN	altNumN	refNumT	altNumT
chr1	12198	G	C	51	40	26	25
chr1	12383	G	A	31	22	15	19

somatic.input:

chr	pos	ref	alt	refNumN	altNumN	refNumT	altNumT
chr1	16757604	G	A	170	0	45	8
chr1	23083363	G	A	34	0	22	3

RNA.SNV:

chr	pos	ref	alt	refNum	altNum	type
chr1	12198	G	C	8	0	germline
chr1	12383	G	A	2	0	germline

Example 2: Input varscan2 output

python Texomer.py -p ./ -I Varscan -v ./example/varscan.snp -r ./example/RNA.bam -o ./res2

All example data of bam files are not provided on GitHub because of the limitation of file size.

Example 3: Input WES bam file

python Texomer.py -p ./ -I BAM -t ./example/Tumor.bam -n ./example/Normal.bam -r ./example/RNA.bam -o ./res3

The bam file should be alligned based on GRCH38. If you input bam file, Texomer will run longer time because it needs to call mutation from bam file.

Texomer output:

output.summaryres.txt

DNApurity	0.272030490933697
Heterogeneity	0.362166913670839
Ploidy	2.43420373296619
RNApurity	0.333623992547959

output.segment.txt

chr	start	end	Dmajor	Dminor	Rmajor	Rminor	RTEL	BayesP
chr1	12383	149854059	3	0	3	0	3	0.00271288011564652
chr1	149904572	151142541	11	0	11	0	11	0.00976386546342944

Dmajor: Copy number of major allele. Dminor: Copy number of minor allele. Rmajor: Expressoin level of major allele. Rminor: Expression level of minor allele. RTEL: Total expression level of two alleles. BayesP: Posterior probability that expression level is discordant with copy number

output.mutations.txt

chr	pos	ref	alt	refNum	altNum	type	altD	altR	wildD	wildR	BayesP	eASEL	AEI	DACRE
chr1	14464	A	T	161	29	Germline	0	0.869112773207088	3	4.26647728279266	0.165069332450197	0.869112773207088	-2.29543007263941	-2.99135865987574
chr1	14677	G	A	562	275	Germline	0	0.565025922875357	3	3.23925254542283	0.0153830861293918	0.565025922875357	-2.51927198724942	-2.26822077283764

refNum: Read counts of reference allele from RNA-seq data. altNum: Read counts of alernative allele from RNA-seq data. type: the type of mutation Germline or Somatic. altD: copy number of alternative allele. altR: expression level of alternative allele. wildD: copy number of reference allele (wildtype allele). wildR: expression level of reference allele (wildtype allele). BayesP: posterior probability that expression level is discordant with copy number. eASEL: difference of alternative allele between expression and copy number (altR-altD). AEI: difference of expression levels between alternative and reference allele (altR-wildR). DACRE: DACRE score