diff --git a/.idea/.gitignore b/.idea/.gitignore new file mode 100644 index 0000000..13566b8 --- /dev/null +++ b/.idea/.gitignore @@ -0,0 +1,8 @@ +# Default ignored files +/shelf/ +/workspace.xml +# Editor-based HTTP Client requests +/httpRequests/ +# Datasource local storage ignored files +/dataSources/ +/dataSources.local.xml diff --git a/.idea/.name b/.idea/.name new file mode 100644 index 0000000..8da8728 --- /dev/null +++ b/.idea/.name @@ -0,0 +1 @@ +proseq2.0_mt \ No newline at end of file diff --git a/.idea/inspectionProfiles/Project_Default.xml b/.idea/inspectionProfiles/Project_Default.xml new file mode 100644 index 0000000..0bf95dd --- /dev/null +++ b/.idea/inspectionProfiles/Project_Default.xml @@ -0,0 +1,21 @@ + + + + \ No newline at end of file diff --git a/.idea/inspectionProfiles/profiles_settings.xml b/.idea/inspectionProfiles/profiles_settings.xml new file mode 100644 index 0000000..105ce2d --- /dev/null +++ b/.idea/inspectionProfiles/profiles_settings.xml @@ -0,0 +1,6 @@ + + + + \ No newline at end of file diff --git a/.idea/misc.xml b/.idea/misc.xml new file mode 100644 index 0000000..d1e22ec --- /dev/null +++ b/.idea/misc.xml @@ -0,0 +1,4 @@ + + + + \ No newline at end of file diff --git a/.idea/modules.xml b/.idea/modules.xml new file mode 100644 index 0000000..8a2ddd6 --- /dev/null +++ b/.idea/modules.xml @@ -0,0 +1,8 @@ + + + + + + + + \ No newline at end of file diff --git a/.idea/proseq2.0_mt.iml b/.idea/proseq2.0_mt.iml new file mode 100644 index 0000000..d0876a7 --- /dev/null +++ b/.idea/proseq2.0_mt.iml @@ -0,0 +1,8 @@ + + + + + + + + \ No newline at end of file diff --git a/.idea/vcs.xml b/.idea/vcs.xml new file mode 100644 index 0000000..94a25f7 --- /dev/null +++ b/.idea/vcs.xml @@ -0,0 +1,6 @@ + + + + + + \ No newline at end of file diff --git a/README.md b/README.md index e7468e3..13ef6f0 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,10 @@ Preprocesses and Aligns Run-On Sequencing (PRO/GRO/ChRO-seq) data from Single-Re Currently we provide two commands: proseq mapper and bigWig merge. +# MULTITHREADING: + +This is a fork of the original [proseq2.0](https://github.com/Danko-Lab/proseq2.0.git) repo. This package 1) multithreads cutadapt and 2) executes all commands in `./proseq2.0.bsh` sequentially in the main shell (no farming out commands to the background with `&`. This should fix an issue in the original proseq2.0 where the script would randomly hang on the `wait` lines. + ## Overview Our proseq2.0 pipeline will take single-end or paired-end sequencing reads in fastq.gz format as input. The pipeline will automate three routine pre-processing and alignment options, including + pre-processing reads: remove the adapter sequence and quality trim the reads (cutadapt), deduplicate the reads if UMI barcodes are used (prinseq-lite.pl) @@ -18,7 +22,7 @@ Chu, T., Wang, Z., Chou, S. P., & Danko, C. G. (2018). Discovering Transcription ## Dependencies -The pipelines depend on several common bioinformatics tools: +The pipelines depend on several common bioinformatics tools: - [ ] cutadapt (https://cutadapt.readthedocs.io/en/stable/installation.html) - [ ] fastx_trimmer (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html) - [ ] seqtk (https://github.com/lh3/seqtk) @@ -28,16 +32,16 @@ The pipelines depend on several common bioinformatics tools: - [ ] bedtools v2.28.0 (http://bedtools.readthedocs.org/en/latest/) - [ ] bedGraphToBigWig (from the Kent source utilities http://hgdownload.cse.ucsc.edu/admin/exe/) -Please make sure you can call the bioinformatics tools from your current working directory. +Please make sure you can call the bioinformatics tools from your current working directory. ## Usage ``` Preprocesses and aligns PRO-seq data. -Takes PREFIX.fastq.gz (SE), PREFIX_R1.fastq.gz, PREFIX_R2.fastq.gz (PE) +Takes PREFIX.fastq.gz (SE), PREFIX_1.fastq.gz, PREFIX_2.fastq.gz (PE) or *.fastq.gz in the current working directory as input and writes BAM and bigWig files as output to the user-assigned output-dir. -The output bigWig files ending with _minus.bw or _plus.bw are raw read counts without normalization. +The output bigWig files ending with _minus.bw or _plus.bw are raw read counts without normalization. The RPM normalized outputs end with a suffix of .rpm.bw. @@ -61,8 +65,8 @@ Required options: I/O options: -I, --fastq=PREFIX Prefix for input files. Paired-end files require identical prefix - and end with _R1.fastq.gz and _R2.fastq.gz - eg: PREFIX_R1.fastq.gz, PREFIX_R2.fastq.gz. + and end with _1.fastq.gz and _2.fastq.gz + eg: PREFIX_1.fastq.gz, PREFIX_2.fastq.gz. -T, --tmp=PATH Path to a temporary storage directory. -O, --output-dir=DIR Specify a directory to store output in. @@ -115,7 +119,7 @@ When UMI1 or UMI2 are set > 0, the pipeline will perform PCR deduplicate. -4DREG Using the pre-defined parameters to get the most reads for dREG package. Please use this flag to make the bigWig - files compatible with dREG algorithm. Only available for + files compatible with dREG algorithm. Only available for Single-end sequencing.[default: off] -aln Use BWA-backtrack [default: SE uses BWA-backtrack (aln), PE uses BWA-MEM (mem)] -mem Use BWA-MEM [default: SE uses BWA-backtrack (aln), PE uses BWA-MEM (mem)] @@ -127,7 +131,7 @@ When UMI1 or UMI2 are set > 0, the pipeline will perform PCR deduplicate. ## Examples -The pipeline requires two parameters for genome information, including BWA index (--bwa-index) and chrom info (--chrom-info). +The pipeline requires two parameters for genome information, including BWA index (--bwa-index) and chrom info (--chrom-info). __BWA index__ should be generated using the __bwa index__ command according to BWA manual at http://bio-bwa.sourceforge.net/bwa.shtml . Please note that the program only take in the prefix when you assign the index, no ".bwt" in the end. See the BWA manual for more details. @@ -141,7 +145,7 @@ export chromInfo=PathToChromInfo ### Example 1 -PREFIX.fastq.gz were made according to GRO-seq protocol as in https://www.ncbi.nlm.nih.gov/pubmed/19056941 +PREFIX.fastq.gz were made according to GRO-seq protocol as in https://www.ncbi.nlm.nih.gov/pubmed/19056941 Give UMI1=6, the pipeline will remove PCR duplicates and trim the 6bp UMI barcode. ``` bash proseq2.0.bsh -i $bwaIndex -c $chromInfo -SE -G -T myOutput1 -O myOutput1 --UMI1=6 -I PREFIX @@ -155,16 +159,16 @@ bash proseq2.0.bsh -i $bwaIndex -c $chromInfo -SE -P -T myOutput2 -O myOutput2 - ``` ### Example 3 -__PREFIX_R1.fastq.gz__ and __PREFIX_R2.fastq.gz__ were Paired-End sequenced as in chromatin run-on and sequencing (ChRO-seq) in https://www.biorxiv.org/content/early/2017/09/07/185991 -* Please note that Paired-end files require identical PREFIX and end with _R1.fastq.gz and _R2.fastq.gz. +__PREFIX_1.fastq.gz__ and __PREFIX_2.fastq.gz__ were Paired-End sequenced as in chromatin run-on and sequencing (ChRO-seq) in https://www.biorxiv.org/content/early/2017/09/07/185991 +* Please note that Paired-end files require identical PREFIX and end with _1.fastq.gz and _2.fastq.gz. - Assign the file use __-I PREFIX__. No _R1.fastq.gz, _R2.fastq.gz, nor *fastq.gz is in the end. -* There is a 6N UMI barcode on R1. Pipeline will perform PCR deduplicat. + Assign the file use __-I PREFIX__. No _1.fastq.gz, _2.fastq.gz, nor *fastq.gz is in the end. +* There is a 6N UMI barcode on R1. Pipeline will perform PCR deduplicat. ``` bash proseq2.0.bsh -i $bwaIndex -c $chromInfo -PE --RNA3=R1_5prime -T myOutput3 -O myOutput3 -I PREFIX --UMI1=6 --ADAPT1=GATCGTCGGACTGTAGAACTCTGAAC --ADAPT2=TGGAATTCTCGGGTGCCAAGG ``` ### Example 4 -Same as in Example 3 but without UMI barcode. +Same as in Example 3 but without UMI barcode. * UMI1 and UMI2 were set to 0 by default. The pipeline will NOT remove PCR duplicates. ``` bash proseq2.0.bsh -i $bwaIndex -c $chromInfo -PE --RNA3=R1_5prime -T myOutput4 -O myOutput4 -I PREFIX --ADAPT1=GATCGTCGGACTGTAGAACTCTGAAC --ADAPT2=TGGAATTCTCGGGTGCCAAGG @@ -185,12 +189,12 @@ bash proseq2.0.bsh -i $bwaIndex -c $chromInfo -PE --UMI1=4 --UMI2=4 --ADD_B1=6 - ## Notes for **CBSUdanko** users: 1. Setup your environment to use the bioinformatics tools (e.g. prinseq-lite.pl,bedGraphToBigWig,samtools...) -``` +``` export PATH=$PATH:/programs/prinseq-lite-0.20.2:/programs:/home/zw355/lib/bin:/home/zw355/lib/ucsc ``` 2. Find the BWA index and chromosome table in the server: -``` +``` export human_genome=/local/storage/data/short_read_index/hg19/bwa.rRNA-0.7.5a-r405/hg19.rRNA export human_chinfo=/local/storage/data/hg19/hg19.chromInfo @@ -199,13 +203,13 @@ export mouse_chinfo=/local/storage/data/mm10/mm10.chromInfo export dog_genome=/local/storage/data/short_read_index/canFam3/bwa.rRNA-0.7.8-r455/canFam3.rRNA.fa export dog_chinfo=/local/storage/data/canFam3/canFam3.chromInfo -``` +``` 3. Using --UMI1=6 to replace -b6 if you have used it in the old version (proseqMapper.bsh). ## Notes for **dREG** users: -In order to make the most compatible with dREG algorithm, please use **-4DREG** flag when you process the PRO-seq and GRO-seq reads. The dREG package needs enriched reads to +In order to make the most compatible with dREG algorithm, please use **-4DREG** flag when you process the PRO-seq and GRO-seq reads. The dREG package needs enriched reads to detect the transcriptional peaks, we use the "bwa aln" to do mappping and set lower filtering score (0) to get the most reads in this pipeline. Only available for Single-end sequencing. Here is an examples to generate the bigWig for dREG. diff --git a/input_file_exmaples/.DS_Store b/input_file_exmaples/.DS_Store deleted file mode 100644 index 5008ddf..0000000 Binary files a/input_file_exmaples/.DS_Store and /dev/null differ diff --git a/input_file_exmaples/mm10.chromInfo b/input_file_exmaples/mm10.chromInfo deleted file mode 100644 index 657a3cf..0000000 --- a/input_file_exmaples/mm10.chromInfo +++ /dev/null @@ -1,66 +0,0 @@ -chr1 195471971 -chr10 130694993 -chr11 122082543 -chr12 120129022 -chr13 120421639 -chr14 124902244 -chr15 104043685 -chr16 98207768 -chr17 94987271 -chr18 90702639 -chr19 61431566 -chr1_GL456210_random 169725 -chr1_GL456211_random 241735 -chr1_GL456212_random 153618 -chr1_GL456213_random 39340 -chr1_GL456221_random 206961 -chr2 182113224 -chr3 160039680 -chr4 156508116 -chr4_GL456216_random 66673 -chr4_JH584292_random 14945 -chr4_GL456350_random 227966 -chr4_JH584293_random 207968 -chr4_JH584294_random 191905 -chr4_JH584295_random 1976 -chr5 151834684 -chr5_JH584296_random 199368 -chr5_JH584297_random 205776 -chr5_JH584298_random 184189 -chr5_GL456354_random 195993 -chr5_JH584299_random 953012 -chr6 149736546 -chr7 145441459 -chr7_GL456219_random 175968 -chr8 129401213 -chr9 124595110 -chrM 16299 -chrX 171031299 -chrX_GL456233_random 336933 -chrY 91744698 -chrY_JH584300_random 182347 -chrY_JH584301_random 259875 -chrY_JH584302_random 155838 -chrY_JH584303_random 158099 -chrUn_GL456239 40056 -chrUn_GL456367 42057 -chrUn_GL456378 31602 -chrUn_GL456381 25871 -chrUn_GL456382 23158 -chrUn_GL456383 38659 -chrUn_GL456385 35240 -chrUn_GL456390 24668 -chrUn_GL456392 23629 -chrUn_GL456393 55711 -chrUn_GL456394 24323 -chrUn_GL456359 22974 -chrUn_GL456360 31704 -chrUn_GL456396 21240 -chrUn_GL456372 28664 -chrUn_GL456387 24685 -chrUn_GL456389 28772 -chrUn_GL456370 26764 -chrUn_GL456379 72385 -chrUn_GL456366 47073 -chrUn_GL456368 20208 -chrUn_JH584304 114452 diff --git a/input_file_exmaples/test_R1.fastq.gz b/input_file_exmaples/test_R1.fastq.gz deleted file mode 100644 index 9a5c595..0000000 Binary files a/input_file_exmaples/test_R1.fastq.gz and /dev/null differ diff --git a/input_file_exmaples/test_R2.fastq.gz b/input_file_exmaples/test_R2.fastq.gz deleted file mode 100644 index 6ab9d7f..0000000 Binary files a/input_file_exmaples/test_R2.fastq.gz and /dev/null differ diff --git a/input_file_exmaples/test_SE.fastq.gz b/input_file_exmaples/test_SE.fastq.gz deleted file mode 100644 index 9a5c595..0000000 Binary files a/input_file_exmaples/test_SE.fastq.gz and /dev/null differ diff --git a/output_file_exmaples/myOutput1/test_SE.prinseq-pcrDups.gd b/output_file_exmaples/myOutput1/test_SE.prinseq-pcrDups.gd deleted file mode 100644 index 0855127..0000000 --- a/output_file_exmaples/myOutput1/test_SE.prinseq-pcrDups.gd +++ /dev/null @@ -1,10 +0,0 @@ -Input and filter stats: - Input sequences: 6,425 - Input bases: 247,838 - Input mean length: 38.57 - Good sequences: 6,425 (100.00%) - Good bases: 247,838 - Good mean length: 38.57 - Bad sequences: 0 (0.00%) - Sequences filtered by specified parameters: - none diff --git a/output_file_exmaples/myOutput1/test_SE_dedup.align.log b/output_file_exmaples/myOutput1/test_SE_dedup.align.log deleted file mode 100644 index 0338fef..0000000 --- a/output_file_exmaples/myOutput1/test_SE_dedup.align.log +++ /dev/null @@ -1,5 +0,0 @@ -test_SE_dedup -Number of mappable reads: -2183 -Number of mappable reads (excluding rRNA): -2181 diff --git a/output_file_exmaples/myOutput1/test_SE_dedup.sort.bam b/output_file_exmaples/myOutput1/test_SE_dedup.sort.bam deleted file mode 100644 index 71100e6..0000000 Binary files a/output_file_exmaples/myOutput1/test_SE_dedup.sort.bam and /dev/null differ diff --git a/output_file_exmaples/myOutput1/test_SE_dedup_minus.bw b/output_file_exmaples/myOutput1/test_SE_dedup_minus.bw deleted file mode 100644 index 79fa61f..0000000 Binary files a/output_file_exmaples/myOutput1/test_SE_dedup_minus.bw and /dev/null differ diff --git a/output_file_exmaples/myOutput1/test_SE_dedup_plus.bw b/output_file_exmaples/myOutput1/test_SE_dedup_plus.bw deleted file mode 100644 index ef2b2c8..0000000 Binary files a/output_file_exmaples/myOutput1/test_SE_dedup_plus.bw and /dev/null differ diff --git a/output_file_exmaples/myOutput2/test_SE.prinseq-pcrDups.gd b/output_file_exmaples/myOutput2/test_SE.prinseq-pcrDups.gd deleted file mode 100644 index 0855127..0000000 --- a/output_file_exmaples/myOutput2/test_SE.prinseq-pcrDups.gd +++ /dev/null @@ -1,10 +0,0 @@ -Input and filter stats: - Input sequences: 6,425 - Input bases: 247,838 - Input mean length: 38.57 - Good sequences: 6,425 (100.00%) - Good bases: 247,838 - Good mean length: 38.57 - Bad sequences: 0 (0.00%) - Sequences filtered by specified parameters: - none diff --git a/output_file_exmaples/myOutput2/test_SE_dedup.align.log b/output_file_exmaples/myOutput2/test_SE_dedup.align.log deleted file mode 100644 index 0338fef..0000000 --- a/output_file_exmaples/myOutput2/test_SE_dedup.align.log +++ /dev/null @@ -1,5 +0,0 @@ -test_SE_dedup -Number of mappable reads: -2183 -Number of mappable reads (excluding rRNA): -2181 diff --git a/output_file_exmaples/myOutput2/test_SE_dedup.sort.bam b/output_file_exmaples/myOutput2/test_SE_dedup.sort.bam deleted file mode 100644 index 5ef4ebb..0000000 Binary files a/output_file_exmaples/myOutput2/test_SE_dedup.sort.bam and /dev/null differ diff --git a/output_file_exmaples/myOutput2/test_SE_dedup_minus.bw b/output_file_exmaples/myOutput2/test_SE_dedup_minus.bw deleted file mode 100644 index c671b7b..0000000 Binary files a/output_file_exmaples/myOutput2/test_SE_dedup_minus.bw and /dev/null differ diff --git a/output_file_exmaples/myOutput2/test_SE_dedup_plus.bw b/output_file_exmaples/myOutput2/test_SE_dedup_plus.bw deleted file mode 100644 index 028b27b..0000000 Binary files a/output_file_exmaples/myOutput2/test_SE_dedup_plus.bw and /dev/null differ diff --git a/output_file_exmaples/myOutput3/test.prinseq-pcrDups.gd b/output_file_exmaples/myOutput3/test.prinseq-pcrDups.gd deleted file mode 100644 index ad33b0e..0000000 --- a/output_file_exmaples/myOutput3/test.prinseq-pcrDups.gd +++ /dev/null @@ -1,20 +0,0 @@ -Input and filter stats: - Input sequences (file 1): 6,425 - Input bases (file 1): 247,838 - Input mean length (file 1): 38.57 - Input sequences (file 2): 6,773 - Input bases (file 2): 261,248 - Input mean length (file 2): 38.57 - Good sequences (pairs): 6,319 - Good bases (pairs): 487,421 - Good mean length (pairs): 77.14 - Good sequences (singletons file 1): 106 (1.65%) - Good bases (singletons file 1): 4,108 - Good mean length (singletons file 1): 38.75 - Good sequences (singletons file 2): 454 (6.70%) - Good bases (singletons file 2): 17,557 - Good mean length (singletons file 2): 38.67 - Bad sequences (file 1): 0 (0.00%) - Bad sequences (file 2): 0 (0.00%) - Sequences filtered by specified parameters: - none diff --git a/output_file_exmaples/myOutput3/test_dedup_end.align.log b/output_file_exmaples/myOutput3/test_dedup_end.align.log deleted file mode 100644 index c643abb..0000000 --- a/output_file_exmaples/myOutput3/test_dedup_end.align.log +++ /dev/null @@ -1,5 +0,0 @@ -test_dedup_end -Number of mappable reads: -4250 -Number of mappable reads (excluding rRNA): -4245 diff --git a/output_file_exmaples/myOutput3/test_dedup_end.sort.bam b/output_file_exmaples/myOutput3/test_dedup_end.sort.bam deleted file mode 100644 index 95d928d..0000000 Binary files a/output_file_exmaples/myOutput3/test_dedup_end.sort.bam and /dev/null differ diff --git a/output_file_exmaples/myOutput3/test_dedup_end_minus.bw b/output_file_exmaples/myOutput3/test_dedup_end_minus.bw deleted file mode 100644 index 4aa60a9..0000000 Binary files a/output_file_exmaples/myOutput3/test_dedup_end_minus.bw and /dev/null differ diff --git a/output_file_exmaples/myOutput3/test_dedup_end_plus.bw b/output_file_exmaples/myOutput3/test_dedup_end_plus.bw deleted file mode 100644 index a6e43d4..0000000 Binary files a/output_file_exmaples/myOutput3/test_dedup_end_plus.bw and /dev/null differ diff --git a/proseq2.0.bsh b/proseq2.0.bsh index 4d3df9a..a1cac74 100644 --- a/proseq2.0.bsh +++ b/proseq2.0.bsh @@ -8,9 +8,9 @@ while test $# -gt 0; do echo "Preprocesses and aligns PRO-seq data." echo "" echo "Takes PREFIX.fastq.gz (SE), PREFIX_R1.fastq.gz, PREFIX_R2.fastq.gz (PE)" - echo "or *.fastq.gz in the current working directory as input and writes" + echo "or *.fastq.gz in the current working directory as input and writes" echo "BAM and bigWig files as output to the user-assigned output-dir." - echo "The output bigWig files ending with _minus.bw or _plus.bw are raw read counts without normalization." + echo "The output bigWig files ending with _minus.bw or _plus.bw are raw read counts without normalization." echo "The RPM normalized outputs end with a suffix of .rpm.bw." echo "" echo "Requirements in current working directory:" @@ -78,7 +78,7 @@ while test $# -gt 0; do echo "--Force_deduplicate=FALSE" echo " When --Force_deduplicate=TRUE, it will force the pipeline to" echo " perform PCR deduplicate even there is no UMI barcode" - echo " (i.e. UMI1=0 and UMI2=0). [default: FALSE]" + echo " (i.e. UMI1=0 and UMI2=0). [default: FALSE]" echo "--ADD_B1=0 The length of additional barcode that will be trimmed" echo " on the 5' of R1 read. [default: 0]" echo "--ADD_B2=0 The length of additional barcode that will be trimmed" @@ -87,12 +87,12 @@ while test $# -gt 0; do echo "" echo "-4DREG Using the pre-defined parameters to get the most reads" echo " for dREG package. Please use this flag to make the bigWig" - echo " files compatible with dREG algorithm. [default: off, only available to SE] " + echo " files compatible with dREG algorithm. [default: off, only available to SE] " echo "-aln Use BWA-backtrack [default: SE uses BWA-backtrack (aln), PE uses BWA-MEM (mem)]" echo "-mem Use BWA-MEM [default: SE uses BWA-backtrack (aln), PE uses BWA-MEM (mem)]" echo "--MAP_LENGTH Set a data-set wide length cutoff for mapping (e.g. --MAP_LENGTH=36) " echo " or map the whole reads after QC (off, --MAP_LENGTH=0). [default: off]" - + @@ -237,7 +237,7 @@ while test $# -gt 0; do --ADD_B1*) export ADD_B1=`echo $1 | sed -e 's/^[^=]*=//g'` shift - ;; + ;; --ADD_B2*) export ADD_B2=`echo $1 | sed -e 's/^[^=]*=//g'` shift @@ -314,8 +314,8 @@ fi ## INPUT & Parameters # PE -if [[ "$SEQ" == "PE" ]] ; then - if [ -z "$SE_OUTPUT" ]; then +if [[ "$SEQ" == "PE" ]] ; then + if [ -z "$SE_OUTPUT" ]; then : else echo "-G and -P can only be used with -SE." @@ -336,19 +336,19 @@ if [[ "$SEQ" == "PE" ]] ; then tmp=${FQ_INPUT} FQ_INPUT=() #array for name in ${tmp} - do + do if [ ! -f ${name}_R1.fastq.gz ]; then echo "" echo "##################################################################################" echo " File ${name}_R1.fastq.gz was not found! Skip ${name}*.fastq.gz from analysis." echo " Paired-end files require identical prefix and end with _R1.fastq.gz and _R2.fastq.gz" echo " eg: PREFIX_R1.fastq.gz, PREFIX_R2.fastq.gz" - echo " Please make sure you have the correct file suffix! " + echo " Please make sure you have the correct file suffix! " echo "##################################################################################" echo "" else FQ_INPUT+=(${name}) # append to the array - #echo ${FQ_INPUT} + #echo ${FQ_INPUT} fi done @@ -412,6 +412,7 @@ if [ ! -d $TMPDIR ]; then fi # bash generate random 32 character alphanumeric string (upper and lowercase). tmp=`cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1` +TMPR=$TMPDIR TMPDIR=$TMPDIR/$tmp if [ -z "$thread" ]; then @@ -467,8 +468,8 @@ if [[ "$SEQ" == "PE" ]]; then elif [[ "$RNA3" == "R2_5prime" ]]; then RNA5="R1_5prime" fi - - if [[ "$RNA5" == "R1_5prime" || "$RNA5" == "R2_5prime" ]] ; then + + if [[ "$RNA5" == "R1_5prime" || "$RNA5" == "R2_5prime" ]] ; then : else echo "--RNA5 and --RNA3 value can only be R1_5prime or R2_5prime." @@ -485,13 +486,13 @@ if [ -z "$MAP5" ]; then fi -if [[ "${MAP5}" == "FALSE" && "$SEQ" == "SE" ]] ; then +if [[ "${MAP5}" == "FALSE" && "$SEQ" == "SE" ]] ; then echo "For single-end (SE), can only report the 5 prime end of reads (--map5=TRUE)" exit 1 fi -if [ "${MAP5}" == "TRUE" ] ; then +if [ "${MAP5}" == "TRUE" ] ; then : elif [ "${MAP5}" == "FALSE" ] ; then : @@ -510,8 +511,8 @@ done exec > >(tee ${OUTPUT}/proseq2.0_Run_${tmp}.log) exec 2>&1 ## Print out -echo " " -echo "Processing PRO-seq data ..." +echo " " +echo "Processing PRO-seq data ..." echo "Command line parameters:" ${INPUT_LINE} echo " " @@ -545,7 +546,7 @@ echo "chromInfo $CHINFO" if [[ "$SEQ" == "SE" ]]; then i=1 for name in ${FQ_INPUT[@]} - do + do echo "input file $i ${name}.fastq.gz" let "i++" done @@ -553,7 +554,7 @@ fi if [[ "$SEQ" == "PE" ]]; then i=1 for name in ${FQ_INPUT[@]} - do + do echo "input file pair $i ${name}_R1.fastq.gz, ${name}_R2.fastq.gz" let "i++" done @@ -570,7 +571,7 @@ echo "UMI2 barcode length $UMI2" echo "ADD_B1 length $ADD_B1" echo "ADD_B2 length $ADD_B2" echo "number of threads $thread" -if [[ ${UMI2} != 0 || ${UMI1} != 0 || ${Force_deduplicate} == "TRUE" ]]; then +if [[ ${UMI2} != 0 || ${UMI1} != 0 || ${Force_deduplicate} == "TRUE" ]]; then echo "Remove PCR duplicates TRUE" else echo "Remove PCR duplicates FALSE" @@ -588,15 +589,15 @@ fi ## DOIT! mkdir ${TMPDIR} -if [[ "$SEQ" == "PE" ]] ; then +if [[ "$SEQ" == "PE" ]] ; then ############################################# ## Preprocess data. Remove adapters. Trim. echo " " echo "Preprocessing fastq files:" mkdir ${TMPDIR}/noadapt mkdir ${TMPDIR}/passQC -if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then - dedup_L=30 +if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then + dedup_L=30 mkdir ${TMPDIR}/noadapt/l${dedup_L} mkdir ${TMPDIR}/noadapt/l${dedup_L}_nodups fi @@ -616,74 +617,69 @@ for name in ${FQ_INPUT[@]} ## Remove adapter, UMI barcode, additional barcode, and low quality (q=20) base from 3prime end of reads. Keep read length >=15 after trimmming # Remove adapter - cutadapt -a ${ADAPT2} -e 0.10 --overlap 2 --output=${TMPDIR}/${name}_trim_R1.fastq --untrimmed-output=${TMPDIR}/${name}_untrim_R1.fastq ${old_name}_R1.fastq.gz & - cutadapt -a ${ADAPT1} -e 0.10 --overlap 2 --output=${TMPDIR}/${name}_trim_R2.fastq --untrimmed-output=${TMPDIR}/${name}_untrim_R2.fastq ${old_name}_R2.fastq.gz & - wait + cutadapt -a ${ADAPT2} -e 0.10 --overlap 2 --output=${TMPDIR}/${name}_trim_R1.fastq --untrimmed-output=${TMPDIR}/${name}_untrim_R1.fastq ${old_name}_R1.fastq.gz -j ${thread} + cutadapt -a ${ADAPT1} -e 0.10 --overlap 2 --output=${TMPDIR}/${name}_trim_R2.fastq --untrimmed-output=${TMPDIR}/${name}_untrim_R2.fastq ${old_name}_R2.fastq.gz -j ${thread} + echo 'Removed adapters' >> ${TMPDIR}/${name}.QC.log # Read1 # remove UMI2 and ADD_B2 from the 3 prime end of R1 n2=$[UMI2+ADD_B2] - cutadapt --cut -${n2} --minimum-length=10 ${TMPDIR}/${name}_trim_R1.fastq --output=${TMPDIR}/${name}_trim.${n2}Nremoved_R1.fastq -q 20 & - cutadapt --minimum-length=10 ${TMPDIR}/${name}_untrim_R1.fastq --output=${TMPDIR}/${name}_q20trim_R1.fastq -q 20 & - wait - cat ${TMPDIR}/${name}_q20trim_R1.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved_R1.fastq | paste - - - - |LC_ALL=C sort --temporary-directory=${TMPDIR} --parallel=10 -k1,1 -S 10G | tr '\t' '\n' > ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq & + cutadapt --cut -${n2} --minimum-length=10 ${TMPDIR}/${name}_trim_R1.fastq --output=${TMPDIR}/${name}_trim.${n2}Nremoved_R1.fastq -q 20 -j ${thread} + cutadapt --minimum-length=10 ${TMPDIR}/${name}_untrim_R1.fastq --output=${TMPDIR}/${name}_q20trim_R1.fastq -q 20 -j ${thread} + cat ${TMPDIR}/${name}_q20trim_R1.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved_R1.fastq | paste - - - - | LC_ALL=C sort --temporary-directory=${TMPDIR} --parallel=${thread} -k1,1 -S 10G | tr '\t' '\n' > ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq + echo 'Removed UMI2 and ADD_B2 from the 3 prime end of R1' >> ${TMPDIR}/${name}.QC.log # Read2 # remove UMI1 and ADD_B1 from the 3 prime end of R2 n1=$[UMI1+ADD_B1] - cutadapt --cut -${n1} --minimum-length=10 ${TMPDIR}/${name}_trim_R2.fastq --output=${TMPDIR}/${name}_trim.${n1}Nremoved_R2.fastq -q 20 & - cutadapt --minimum-length=10 ${TMPDIR}/${name}_untrim_R2.fastq --output=${TMPDIR}/${name}_q20trim_R2.fastq -q 20 & - wait - cat ${TMPDIR}/${name}_q20trim_R2.fastq ${TMPDIR}/${name}_trim.${n1}Nremoved_R2.fastq | paste - - - - | LC_ALL=C sort --temporary-directory=${TMPDIR} --parallel=10 -k1,1 -S 10G | tr '\t' '\n' > ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq & + cutadapt --cut -${n1} --minimum-length=10 ${TMPDIR}/${name}_trim_R2.fastq --output=${TMPDIR}/${name}_trim.${n1}Nremoved_R2.fastq -q 20 -j ${thread} + cutadapt --minimum-length=10 ${TMPDIR}/${name}_untrim_R2.fastq --output=${TMPDIR}/${name}_q20trim_R2.fastq -q 20 -j ${thread} + cat ${TMPDIR}/${name}_q20trim_R2.fastq ${TMPDIR}/${name}_trim.${n1}Nremoved_R2.fastq | paste - - - - | LC_ALL=C sort --temporary-directory=${TMPDIR} --parallel=${thread} -k1,1 -S 10G | tr '\t' '\n' > ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq + echo 'Removed UMI1 and ADD_B1 from the 3 prime end of R2' >> ${TMPDIR}/${name}.QC.log - wait echo 'Number of reads after adapter removal and QC:' >> ${TMPDIR}/${name}.QC.log echo R1 >> ${TMPDIR}/${name}.QC.log cat ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log echo R2 >> ${TMPDIR}/${name}.QC.log cat ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log - rm ${TMPDIR}/${name}_trim_R1.fastq ${TMPDIR}/${name}_untrim_R1.fastq ${TMPDIR}/${name}_q20trim_R1.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved_R1.fastq & - rm ${TMPDIR}/${name}_trim_R2.fastq ${TMPDIR}/${name}_untrim_R2.fastq ${TMPDIR}/${name}_q20trim_R2.fastq ${TMPDIR}/${name}_trim.${n1}Nremoved_R2.fastq & - + rm ${TMPDIR}/${name}_trim_R1.fastq ${TMPDIR}/${name}_untrim_R1.fastq ${TMPDIR}/${name}_q20trim_R1.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved_R1.fastq + rm ${TMPDIR}/${name}_trim_R2.fastq ${TMPDIR}/${name}_untrim_R2.fastq ${TMPDIR}/${name}_q20trim_R2.fastq ${TMPDIR}/${name}_trim.${n1}Nremoved_R2.fastq + ## Collapse reads using prinseq-lite.pl. if there are UMI barcodes or $Force_deduplicate=TRUE if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then # if there is UMI barcode, Will perform deduplicates with first ${dedup_L} bp # Remove PCR duplciates. - + # fastq with the first dedup_L nt - cat ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq | fastx_trimmer -Q33 -l ${dedup_L} -o ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R1.fastq & - cat ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq | fastx_trimmer -Q33 -l ${dedup_L} -o ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R2.fastq & - wait - + cat ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq | fastx_trimmer -Q33 -l ${dedup_L} -o ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R1.fastq + cat ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq | fastx_trimmer -Q33 -l ${dedup_L} -o ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R2.fastq + # deduplicate using the first dedup_L nt prinseq-lite.pl -fastq ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R1.fastq -fastq2 ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R2.fastq -derep 1 -out_format 3 -out_bad null -out_good ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup -min_len 15 2> ${OUTPUT}/${name}.prinseq-pcrDups.gd - rm ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R1.fastq ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R2.fastq & + rm ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R1.fastq ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R2.fastq # make a list of name from deduplicated fastq - cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt & - cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons_l${dedup_L}.txt & - cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons_l${dedup_L}.txt & - wait + cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt + cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons_l${dedup_L}.txt + cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons_l${dedup_L}.txt rm ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons.fastq # generate fastq from the list of name - seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq & - seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq & - seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_1_singletons.fastq & - seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_2_singletons.fastq & - wait - rm ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup*l${dedup_L}.txt & - + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_1_singletons.fastq + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_2_singletons.fastq + rm ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup*l${dedup_L}.txt + # trim the UMI and additional barcode after dereplicate - prinseq-lite.pl -trim_left ${n1} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd & - prinseq-lite.pl -trim_left ${n2} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd & - wait + prinseq-lite.pl -trim_left ${n1} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + prinseq-lite.pl -trim_left ${n2} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd rm ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq # min_len 15 prinseq-lite.pl -min_len 15 -fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1.fastq -fastq2 ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_QC_end 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd - rm ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1.fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2.fastq & + rm ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1.fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2.fastq echo 'Number of paired reads after PCR duplicates removal and QC:' >> ${TMPDIR}/${name}.QC.log - cat ${TMPDIR}/passQC/${name}_dedup_QC_end_1.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log & + cat ${TMPDIR}/passQC/${name}_dedup_QC_end_1.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log elif [[ ${Force_deduplicate} == "TRUE" ]]; then # if there is NO UMI barcode, Will perform deduplicates with whole legnth of reads # Remove PCR duplciates. @@ -712,31 +708,28 @@ for name in ${FQ_INPUT[@]} done -wait # if there is a data-set wide length cutoff map_L if [[ ${map_L} != 0 ]]; then mkdir ${TMPDIR}/passQC_length_${map_L} for f in ${TMPDIR}/passQC/*_QC_end_1.fastq do name=`echo $f |rev |cut -d / -f 1 |cut -d _ -f 4-|rev` - cat ${f} | fastx_trimmer -Q33 -l ${map_L} -o ${TMPDIR}/passQC_length_${map_L}/${name}_l${map_L}_QC_end_1.fastq & + cat ${f} | fastx_trimmer -Q33 -l ${map_L} -o ${TMPDIR}/passQC_length_${map_L}/${name}_l${map_L}_QC_end_1.fastq done for f in ${TMPDIR}/passQC/*_QC_end_2.fastq do name=`echo $f |rev |cut -d / -f 1 |cut -d _ -f 4-|rev` - cat ${f} | fastx_trimmer -Q33 -l ${map_L} -o ${TMPDIR}/passQC_length_${map_L}/${name}_l${map_L}_QC_end_2.fastq & + cat ${f} | fastx_trimmer -Q33 -l ${map_L} -o ${TMPDIR}/passQC_length_${map_L}/${name}_l${map_L}_QC_end_2.fastq done - wait -fi +fi -wait ## Cleanup. -rm -r ${TMPDIR}/noadapt -gzip ${TMPDIR}/passQC*/*.fastq +rm -r ${TMPDIR}/noadapt +gzip ${TMPDIR}/passQC*/*.fastq ############################################# ## Align reads. -if [[ ${map_L} != 0 ]]; then +if [[ ${map_L} != 0 ]]; then ToMapDir=${TMPDIR}/passQC_length_${map_L} else ToMapDir=${TMPDIR}/passQC @@ -767,8 +760,8 @@ if [[ ! -z "$PREMAP_BWAIDX" ]]; then bwa sampe -n 1 -f ${ToMapDir}/${name}_end.sam ${PREMAP_BWAIDX} ${TMPDIR}/${name}_aln_sa1.sai ${TMPDIR}/${name}_aln_sa2.sai ${ToMapDir}/${name}_1.fastq.gz ${ToMapDir}/${name}_2.fastq.gz samtools view -bF 0x2 ${ToMapDir}/${name}_end.sam| samtools sort -n -@ 1 - > ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam rm ${TMPDIR}/${name}_aln_sa1.sai ${TMPDIR}/${name}_aln_sa2.sai ${ToMapDir}/${name}_end.sam - bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq -fq2 ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq - gzip ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq + bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq -fq2 ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq + gzip ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq rm ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam done else #defaul use mem @@ -776,8 +769,8 @@ if [[ ! -z "$PREMAP_BWAIDX" ]]; then do bwa mem -k 19 -t 1 ${PREMAP_BWAIDX} ${ToMapDir}/${name}_1.fastq.gz ${ToMapDir}/${name}_2.fastq.gz |\ samtools view -bF 0x2 | samtools sort -n -@ 1 - > ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam - bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq -fq2 ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq - gzip ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq + bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq -fq2 ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq + gzip ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq rm ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam done fi @@ -811,12 +804,10 @@ else #defaul use mem done fi - for name in ${QC_INPUT} do - cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} & + cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} done -wait @@ -837,13 +828,13 @@ for f in ${TMPDIR}/*.sort.bam echo $j > ${OUTPUT}/${j}.align.log if [ "${RNA5}" == "R1_5prime" ] ; then if [ "${OPP}" == "FALSE" ] ; then - if [ "${MAP5}" == "TRUE" ] ; then ## report The 5' end of the RNA. Danko lab leChRO-Seq protocol is on the 5' of _R1 readl, same strand of R1 ($9) + if [ "${MAP5}" == "TRUE" ] ; then ## report The 5' end of the RNA. Danko lab leChRO-Seq protocol is on the 5' of _R1 readl, same strand of R1 ($9) bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/kill.warnings | awk 'BEGIN{OFS="\t"} ($9 == "+") {print $1,$2,$2+1,$7,$8,$9}; ($9 == "-") {print $1,$3-1,$3,$7,$8,$9}' | gzip > ${TMPDIR}/$j.bed.gz else ## report The 3' end of the RNA. Danko lab leChRO-Seq protocol is on the 5 prime of _R2 read, opposite strand of R2 (R2 strand $10, R1 strand $9) bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "-") {print $1,$6-1,$6,$7,$8,$9}; ($10 == "+") {print $1,$5,$5+1,$7,$8,$9}' | gzip > ${TMPDIR}/$j.bed.gz fi elif [ "${OPP}" == "TRUE" ] ; then - if [ "${MAP5}" == "TRUE" ] ; then ## report The 5' end of the RNA. + if [ "${MAP5}" == "TRUE" ] ; then ## report The 5' end of the RNA. bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/kill.warnings | awk 'BEGIN{OFS="\t"} ($9 == "+") {print $1,$2,$2+1,$7,$8,$10}; ($9 == "-") {print $1,$3-1,$3,$7,$8,$10}' | gzip > ${TMPDIR}/$j.bed.gz else ## report The 3' end of the RNA. bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "-") {print $1,$6-1,$6,$7,$8,$10}; ($10 == "+") {print $1,$5,$5+1,$7,$8,$10}' | gzip > ${TMPDIR}/$j.bed.gz @@ -852,45 +843,43 @@ for f in ${TMPDIR}/*.sort.bam elif [ "${RNA5}" == "R2_5prime" ] ; then if [ "${OPP}" == "FALSE" ] ; then if [ "${MAP5}" == "TRUE" ] ; then #report the 5 prime end of RNA, in Engreitz data is 5 prime end of R2, same strand - bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "+") {print $1,$5,$5+1,$7,$8,$10}; ($10 == "-") {print $1,$6-1,$6,$7,$8,$10}'|gzip > ${TMPDIR}/${j}.bed.gz + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "+") {print $1,$5,$5+1,$7,$8,$10}; ($10 == "-") {print $1,$6-1,$6,$7,$8,$10}'|gzip > ${TMPDIR}/${j}.bed.gz else ## report the 3-prime end of the RNA, in Engreitz data is the 5' end of R1 read, but opposite strand bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($9 == "+") {print $1,$2,$2+1,$7,$8,$10}; ($9 == "-") {print $1,$3-1,$3,$7,$8,$10}' |gzip > ${TMPDIR}/${j}.bed.gz fi elif [ "${OPP}" == "TRUE" ] ; then if [ "${MAP5}" == "TRUE" ] ; then #report the 5 prime end of RNA, in Engreitz data is 5 prime end of R2, same strand - bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "+") {print $1,$5,$5+1,$7,$8,$9}; ($10 == "-") {print $1,$6-1,$6,$7,$8,$9}'|gzip > ${TMPDIR}/${j}.bed.gz + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "+") {print $1,$5,$5+1,$7,$8,$9}; ($10 == "-") {print $1,$6-1,$6,$7,$8,$9}'|gzip > ${TMPDIR}/${j}.bed.gz else ## report the 3-prime end of the RNA, in Engreitz data is the 5' end of R1 read, but opposite strand bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($9 == "+") {print $1,$2,$2+1,$7,$8,$9}; ($9 == "-") {print $1,$3-1,$3,$7,$8,$9}' |gzip > ${TMPDIR}/${j}.bed.gz fi fi fi - + echo 'Number of mappable reads:' >> ${OUTPUT}/${j}.align.log readCount=`zcat ${TMPDIR}/$j.bed.gz | grep "" -c` echo ${readCount} >> ${OUTPUT}/${j}.align.log - + ## Remove rRNA and reverse the strand (PRO-seq). zcat ${TMPDIR}/$j.bed.gz | grep "rRNA\|chrM" -v | grep "_" -v | sort-bed - | gzip > ${TMPDIR}/$j.nr.rs.bed.gz echo 'Number of mappable reads (excluding rRNA):' >> ${OUTPUT}/${j}.align.log echo `zcat ${TMPDIR}/$j.nr.rs.bed.gz | grep "" -c` >> ${OUTPUT}/${j}.align.log - + ## Convert to bedGraph ... Can't gzip these, unfortunately. bedtools genomecov -bg -i ${TMPDIR}/$j.nr.rs.bed.gz -g ${CHINFO} -strand + > ${TMPDIR}/$j\_plus.bedGraph bedtools genomecov -bg -i ${TMPDIR}/$j.nr.rs.bed.gz -g ${CHINFO} -strand - > ${TMPDIR}/$j\_minus.noinv.bedGraph - + ## Invert minus strand. cat ${TMPDIR}/$j\_minus.noinv.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,-1*$4}' > ${TMPDIR}/$j\_minus.bedGraph ## Invert read counts on the minus strand. - + ## normalized by RPM - cat ${TMPDIR}/$j\_plus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_plus.rpm.bedGraph & - cat ${TMPDIR}/$j\_minus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_minus.rpm.bedGraph & - wait + cat ${TMPDIR}/$j\_plus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_plus.rpm.bedGraph + cat ${TMPDIR}/$j\_minus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_minus.rpm.bedGraph ## Then to bigWig (nomalized and non-nomrmalized ones) - bedGraphToBigWig ${TMPDIR}/$j\_plus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.rpm.bw - bedGraphToBigWig ${TMPDIR}/$j\_minus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.rpm.bw - bedGraphToBigWig ${TMPDIR}/$j\_plus.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.bw - bedGraphToBigWig ${TMPDIR}/$j\_minus.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.bw -wait + bedGraphToBigWig ${TMPDIR}/$j\_plus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.rpm.bw + bedGraphToBigWig ${TMPDIR}/$j\_minus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.rpm.bw + bedGraphToBigWig ${TMPDIR}/$j\_plus.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.bw + bedGraphToBigWig ${TMPDIR}/$j\_minus.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.bw rm ${TMPDIR}/$j.nr.rs.bed.gz ${TMPDIR}/$j.bed.gz ${TMPDIR}/$j*.bedGraph done @@ -898,15 +887,15 @@ wait fi #*/ -if [[ "$SEQ" == "SE" ]] ; then +if [[ "$SEQ" == "SE" ]] ; then ############################################# ## Preprocess data. Remove adapters. Trim. echo " " echo "Preprocessing fastq files:" mkdir ${TMPDIR}/noadapt mkdir ${TMPDIR}/passQC -if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then - dedup_L=30 +if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then + dedup_L=30 mkdir ${TMPDIR}/noadapt/l${dedup_L} mkdir ${TMPDIR}/noadapt/l${dedup_L}_nodups fi @@ -922,29 +911,27 @@ for name in ${FQ_INPUT[@]} ## Remove adapter, UMI barcode, additional barcode, and low quality (q=20) base from 3prime end of reads. Keep read length >=15 after trimmming # Remove adapter - cutadapt -a ${ADAPT_SE} -e 0.10 --overlap 2 --output=${TMPDIR}/${name}_trim.fastq --untrimmed-output=${TMPDIR}/${name}_untrim.fastq ${old_name}.fastq.gz + cutadapt -a ${ADAPT_SE} -e 0.10 --overlap 2 --output=${TMPDIR}/${name}_trim.fastq --untrimmed-output=${TMPDIR}/${name}_untrim.fastq ${old_name}.fastq.gz -j ${thread} # Read1 # remove UMI2 and ADD_B2 from the 3 prime end of R1 n2=$[UMI2+ADD_B2] n1=$[UMI1+ADD_B1] - cutadapt --cut -${n2} --minimum-length=10 ${TMPDIR}/${name}_trim.fastq --output=${TMPDIR}/${name}_trim.${n2}Nremoved.fastq -q 20 & - cutadapt --minimum-length=10 ${TMPDIR}/${name}_untrim.fastq --output=${TMPDIR}/${name}_q20trim.fastq -q 20 & - wait - cat ${TMPDIR}/${name}_q20trim.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved.fastq | paste - - - - |LC_ALL=C sort --temporary-directory=${TMPDIR} --parallel=10 -k1,1 -S 10G | tr '\t' '\n' > ${TMPDIR}/noadapt/${name}_noadapt.fastq & + cutadapt --cut -${n2} --minimum-length=10 ${TMPDIR}/${name}_trim.fastq --output=${TMPDIR}/${name}_trim.${n2}Nremoved.fastq -q 20 -j ${thread} + cutadapt --minimum-length=10 ${TMPDIR}/${name}_untrim.fastq --output=${TMPDIR}/${name}_q20trim.fastq -q 20 -j ${thread} + cat ${TMPDIR}/${name}_q20trim.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved.fastq | paste - - - - |LC_ALL=C sort --temporary-directory=${TMPDIR} --parallel=10 -k1,1 -S 10G | tr '\t' '\n' > ${TMPDIR}/noadapt/${name}_noadapt.fastq - wait echo 'Number of reads after adapter removal and QC:' >> ${TMPDIR}/${name}.QC.log cat ${TMPDIR}/noadapt/${name}_noadapt.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log rm ${TMPDIR}/${name}_trim.fastq ${TMPDIR}/${name}_untrim.fastq ${TMPDIR}/${name}_q20trim.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved.fastq - + ## Collapse reads using prinseq-lite.pl. if there are UMI barcodes - if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then + if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then # Remove PCR duplciates. # fastq with the first dedup_L nt - cat ${TMPDIR}/noadapt/${name}_noadapt.fastq | fastx_trimmer -Q33 -l ${dedup_L} -o ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}.fastq + cat ${TMPDIR}/noadapt/${name}_noadapt.fastq | fastx_trimmer -Q33 -l ${dedup_L} -o ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}.fastq # deduplicate using the first dedup_L nt prinseq-lite.pl -fastq ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}.fastq -derep 1 -out_format 3 -out_bad null -out_good ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup -min_len 15 2> ${OUTPUT}/${name}.prinseq-pcrDups.gd rm ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}.fastq @@ -954,7 +941,7 @@ for name in ${FQ_INPUT[@]} rm ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup.fastq # generate fastq from the list of name - seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_withBarcode.fastq + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_withBarcode.fastq rm ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt # trim the UMI and additional barcode after dereplicate @@ -964,8 +951,8 @@ for name in ${FQ_INPUT[@]} rm ${TMPDIR}/passQC/${name}_dedup_withBarcode.fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved.fastq echo 'Number of reads after PCR duplicates removal and QC:' >> ${TMPDIR}/${name}.QC.log cat ${TMPDIR}/passQC/${name}_dedup_QC_end.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log - - elif [[ ${Force_deduplicate} == "TRUE" ]]; then + + elif [[ ${Force_deduplicate} == "TRUE" ]]; then # Remove PCR duplciates. prinseq-lite.pl -derep 1 -fastq ${TMPDIR}/noadapt/${name}_noadapt.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_withBarcode 2> ${OUTPUT}/${name}.prinseq-pcrDups.gd # trim the UMI and additional barcode after dereplicate @@ -986,7 +973,6 @@ for name in ${FQ_INPUT[@]} fi cat ${OUTPUT}/${name}.prinseq-pcrDups.gd - wait done # if there is a data-set wide length cutoff map_L @@ -994,23 +980,20 @@ if [[ ${map_L} != 0 ]]; then mkdir ${TMPDIR}/passQC_length_${map_L} for f in ${TMPDIR}/passQC/*_QC_end.fastq do name=`echo $f |rev |cut -d / -f 1 |cut -d _ -f 3-|rev` - cat ${f} | fastx_trimmer -Q33 -l ${map_L} -o ${TMPDIR}/passQC_length_${map_L}/${name}_l${map_L}_QC_end.fastq & + cat ${f} | fastx_trimmer -Q33 -l ${map_L} -o ${TMPDIR}/passQC_length_${map_L}/${name}_l${map_L}_QC_end.fastq done - wait -fi - +fi -wait ## Cleanup. -rm -r ${TMPDIR}/noadapt -gzip ${TMPDIR}/passQC*/*.fastq +rm -r ${TMPDIR}/noadapt +gzip ${TMPDIR}/passQC*/*.fastq ############################################# ## Align reads. -if [[ ${map_L} != 0 ]]; then +if [[ ${map_L} != 0 ]]; then ToMapDir=${TMPDIR}/passQC_length_${map_L} else ToMapDir=${TMPDIR}/passQC @@ -1036,7 +1019,7 @@ if [[ ! -z "$PREMAP_BWAIDX" ]]; then do bwa mem -k 19 -t 1 ${PREMAP_BWAIDX} ${ToMapDir}/${name}_end.fastq.gz |\ samtools view -bf 0x4 | samtools sort -n -@ 1 - > ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam - bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq.gz + bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq.gz gzip ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq.gz rm ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam done @@ -1050,7 +1033,7 @@ if [[ ! -z "$PREMAP_BWAIDX" ]]; then bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${PREMAP_BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz samtools view -bf 0x4 ${ToMapDir}/${name}_end.sam| samtools sort -n -@ 1 - > ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam rm ${ToMapDir}/${name}_end.sam - bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq + bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq gzip ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq rm ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam done @@ -1070,14 +1053,12 @@ if [[ -z "$DREG_MODEL" ]]; then do ## Align using BWA. bwa mem -k 19 -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ - samtools view -b -q 20 - | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam & + samtools view -b -q 20 - | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam done - wait for name in ${QC_INPUT} do - cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} & ## Saves the sorted BAM in the output file. + cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} ## Saves the sorted BAM in the output file. done - wait else #default use aln for name in ${QC_INPUT} do @@ -1086,49 +1067,40 @@ echo "bwa aln -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz " bwa aln -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ - bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz & + bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz done - wait for name in ${QC_INPUT} do echo "samtools view -b -q 20 ${ToMapDir}/${name}_end.sam | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam" - samtools view -b -q 20 ${ToMapDir}/${name}_end.sam | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam & + samtools view -b -q 20 ${ToMapDir}/${name}_end.sam | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam done - wait for name in ${QC_INPUT} do rm ${ToMapDir}/${name}_end.sam - cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} & ## Saves the sorted BAM in the output file. + cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} ## Saves the sorted BAM in the output file. done - wait fi else #dREG - echo "Aligning reads using the same parameters with dREG" + echo "Aligning reads using the same parameters with dREG" for name in ${QC_INPUT} do ## Align using BWA aln , same as the original dreg model. bwa aln -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ - bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz & + bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz done - - wait for name in ${QC_INPUT} do - samtools view -b -q 0 ${ToMapDir}/${name}_end.sam | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam & + samtools view -b -q 0 ${ToMapDir}/${name}_end.sam | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam done - - wait + for name in ${QC_INPUT} do rm ${ToMapDir}/${name}_end.sam - cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} & ## Saves the sorted BAM in the output file. + cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} ## Saves the sorted BAM in the output file. done - wait fi -wait - ## Cleanup find ${TMPDIR} -name "*.sort.bam" -size -1024k -delete @@ -1161,29 +1133,27 @@ for f in ${TMPDIR}/*.sort.bam echo 'Number of mappable reads:' >> ${OUTPUT}/${j}.align.log readCount=`zcat ${TMPDIR}/$j.bed.gz | grep "" -c` echo ${readCount} >> ${OUTPUT}/${j}.align.log - + ## Remove rRNA and reverse the strand (PRO-seq). zcat ${TMPDIR}/$j.bed.gz | grep "rRNA\|chrM" -v | grep "_" -v | sort-bed - | gzip > ${TMPDIR}/$j.nr.rs.bed.gz echo 'Number of mappable reads (excluding rRNA):' >> ${OUTPUT}/${j}.align.log echo `zcat ${TMPDIR}/$j.nr.rs.bed.gz | grep "" -c` >> ${OUTPUT}/${j}.align.log - + ## Convert to bedGraph ... Cannot gzip these, unfortunately. bedtools genomecov -bg -i ${TMPDIR}/$j.nr.rs.bed.gz -g ${CHINFO} -strand + > ${TMPDIR}/$j\_plus.bedGraph bedtools genomecov -bg -i ${TMPDIR}/$j.nr.rs.bed.gz -g ${CHINFO} -strand - > ${TMPDIR}/$j\_minus.noinv.bedGraph - + ## Invert minus strand. cat ${TMPDIR}/$j\_minus.noinv.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,-1*$4}' > ${TMPDIR}/$j\_minus.bedGraph ## Invert read counts on the minus strand. - + ## normalized by RPM - cat ${TMPDIR}/$j\_plus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_plus.rpm.bedGraph & - cat ${TMPDIR}/$j\_minus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_minus.rpm.bedGraph & - wait + cat ${TMPDIR}/$j\_plus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_plus.rpm.bedGraph + cat ${TMPDIR}/$j\_minus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_minus.rpm.bedGraph ## Then to bigWig (nomalized and non-nomrmalized ones) - bedGraphToBigWig ${TMPDIR}/$j\_plus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.rpm.bw - bedGraphToBigWig ${TMPDIR}/$j\_minus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.rpm.bw - wait - bedGraphToBigWig ${TMPDIR}/$j\_plus.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.bw - bedGraphToBigWig ${TMPDIR}/$j\_minus.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.bw + bedGraphToBigWig ${TMPDIR}/$j\_plus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.rpm.bw + bedGraphToBigWig ${TMPDIR}/$j\_minus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.rpm.bw + bedGraphToBigWig ${TMPDIR}/$j\_plus.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.bw + bedGraphToBigWig ${TMPDIR}/$j\_minus.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.bw # rm ${TMPDIR}/$j.nr.rs.bed.gz ${TMPDIR}/$j.bed.gz ${TMPDIR}/$j*.bedGraph done @@ -1208,5 +1178,4 @@ for f in ${TMPDIR}/*.sort.bam ############################################# ## CLEANUP! -#rm -Rf ${TMPDIR} - +rm -rf ${TMPR} diff --git a/proseq2.0_old.bsh b/proseq2.0_old.bsh new file mode 100644 index 0000000..c2be1fa --- /dev/null +++ b/proseq2.0_old.bsh @@ -0,0 +1,1211 @@ +#!/usr/bin/bash +# +INPUT_LINE=`echo $@` +while test $# -gt 0; do + case "$1" in + -h|--help) + echo "" + echo "Preprocesses and aligns PRO-seq data." + echo "" + echo "Takes PREFIX.fastq.gz (SE), PREFIX_R1.fastq.gz, PREFIX_R2.fastq.gz (PE)" + echo "or *.fastq.gz in the current working directory as input and writes" + echo "BAM and bigWig files as output to the user-assigned output-dir." + echo "The output bigWig files ending with _minus.bw or _plus.bw are raw read counts without normalization." + echo "The RPM normalized outputs end with a suffix of .rpm.bw." + echo "" + echo "Requirements in current working directory:" + echo "cutadapt 1.8.3, fastx_trimmer, seqtk, prinseq-lite.pl 0.20.2, bwa, samtools, bedtools, and bedGraphToBigWig." + echo "" + echo "bash proseq2.0.bsh [options]" + echo "" + echo "options:" + echo "" + echo "To get help:" + echo "-h, --help Show this brief help menu." + echo "" + echo "Required options:" + echo "-SE, --SEQ=SE Single-end sequencing." + echo "-PE, --SEQ=PE Paired-end sequencing." + echo "-i, --bwa-index=PATH Path to the BWA index of the target genome" + echo " (i.e., bwa index)." + echo "-c, --chrom-info=PATH Location of the chromInfo table." + echo "" + echo "I/O options:" + echo "-I, --fastq=PREFIX Prefix for input files." + echo " Paired-end files require identical prefix" + echo " and end with _R1.fastq.gz and _R2.fastq.gz" + echo " eg: PREFIX_R1.fastq.gz, PREFIX_R2.fastq.gz." + echo "-T, --tmp=PATH Path to a temporary storage directory." + echo "-O, --output-dir=DIR Specify a directory to store output in." + echo "" + echo "Required options for SE" + echo "-G, --SE_READ=RNA_5prime Single-end sequencing from 5' end of" + echo " nascent RNA, like GRO-seq." + echo "-P, --SE_READ=RNA_3prime Single-end sequencing from 3' end of" + echo " nascent RNA, like PRO-seq." + echo "" + echo "Options for PE" + echo "--RNA5=R1_5prime Specify the location of the 5' end of RNA" + echo " [default: R1_5prime]." + echo "--RNA3=R2_5prime Specify the location of the 3' end of RNA" + echo " [default: R2_5prime]." + echo " Available options: R1_5prime: the 5' end of R1 reads" + echo " R2_5prime: the 5' end of R2 reads" + echo "-5, --map5=TRUE Report the 5' end of RNA [default on, --map5=TRUE]." + echo "-3, --map5=FALSE Report the 3' end of RNA," + echo " only available for PE [default off, --map5=TRUE]." + echo "-s, --opposite-strand=TRUE" + echo " Enable this option if the RNA are at the different strand" + echo " as the reads set at RNA5 [default: disable]." + echo "" + echo "Optional operations:" + echo "--ADAPT_SE=TGGAATTCTCGGGTGCCAAGG" + echo " 3' adapter to be removed from the 3' end of SE reads." + echo " [default:TGGAATTCTCGGGTGCCAAGG]" + echo "--ADAPT1=GATCGTCGGACTGTAGAACTCTGAACG" + echo " 3' adapter to be removed from the 3' end of R2." + echo " [default:GATCGTCGGACTGTAGAACTCTGAACG]" + echo "--ADAPT2=AGATCGGAAGAGCACACGTCTGAACTC" + echo " 3' adapter to be removed from the 3' end of R1." + echo " [default:AGATCGGAAGAGCACACGTCTGAACTC]" + echo "" + echo "--UMI1=0 The length of UMI barcode on the 5' of R1 read. " + echo " [default: 0]" + echo "--UMI2=0 The length of UMI barcode on the 5' of R2 read. " + echo " [default: 0]" + echo "When UMI1 or UMI2 are set > 0, the pipeline will perform PCR deduplicate." + echo "" + echo "--Force_deduplicate=FALSE" + echo " When --Force_deduplicate=TRUE, it will force the pipeline to" + echo " perform PCR deduplicate even there is no UMI barcode" + echo " (i.e. UMI1=0 and UMI2=0). [default: FALSE]" + echo "--ADD_B1=0 The length of additional barcode that will be trimmed" + echo " on the 5' of R1 read. [default: 0]" + echo "--ADD_B2=0 The length of additional barcode that will be trimmed" + echo " on the 5' of R2 read. [default: 0]" + echo "--thread=1 Number of threads can be used [default: 1]" + echo "" + echo "-4DREG Using the pre-defined parameters to get the most reads" + echo " for dREG package. Please use this flag to make the bigWig" + echo " files compatible with dREG algorithm. [default: off, only available to SE] " + echo "-aln Use BWA-backtrack [default: SE uses BWA-backtrack (aln), PE uses BWA-MEM (mem)]" + echo "-mem Use BWA-MEM [default: SE uses BWA-backtrack (aln), PE uses BWA-MEM (mem)]" + echo "--MAP_LENGTH Set a data-set wide length cutoff for mapping (e.g. --MAP_LENGTH=36) " + echo " or map the whole reads after QC (off, --MAP_LENGTH=0). [default: off]" + + + + + exit 0 + ;; + -SE) + export SEQ="SE" + shift + ;; + -PE) + export SEQ="PE" + shift + ;; + -i) + shift + if test $# -gt 0; then + export BWAIDX=$1 + else + echo "no BWA index specified" + exit 1 + fi + shift + ;; + --bwa-index*) + export BWAIDX=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + -c) + shift + if test $# -gt 0; then + export CHINFO=$1 + else + echo "no chromInfo specified" + exit 1 + fi + shift + ;; + --chrom-info*) + export CHINFO=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + -I) + shift + if test $# -gt 0; then + export FQ_INPUT=$1 + else + echo "no input prefix specified." + exit 1 + fi + shift + ;; + --fastq*) + export FQ_INPUT=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + -T) + shift + if test $# -gt 0; then + export TMPDIR=$1 + else + echo "no temp folder specified." + exit 1 + fi + shift + ;; + --tmp*) + export TMPDIR=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + -O) + shift + if test $# -gt 0; then + export OUTPUT=$1 + else + echo "no output dir specified." + exit 1 + fi + shift + ;; + --output-dir*) + export OUTPUT=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --thread*) + export thread=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --RNA5*) # report location of the 5 prime end of RNA + # acce + export RNA5=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --RNA3*) # report location of the 5 prime end of RNA + # acce + export RNA3=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + -5) + export MAP5="TRUE" + shift + ;; + --map5*) + export MAP5=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + -3) + export MAP5="FALSE" + shift + ;; + -s) + export OPP="TRUE" + shift + ;; + --opposite-strand*) + export OPP=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --ADAPT_SE*) + export ADAPT_SE=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --ADAPT1*) + export ADAPT1=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --ADAPT2*) + export ADAPT2=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --UMI1*) + export UMI1=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --UMI2*) + export UMI2=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --Force_deduplicate*) + export Force_deduplicate=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --ADD_B1*) + export ADD_B1=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --ADD_B2*) + export ADD_B2=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + -G) + export SE_OUTPUT="G" + export RNA5="R1_5prime" + export OPP="FALSE" + #export MAP5="TRUE" + export SE_READ="RNA_5prime" + shift + ;; + -P) + export SE_OUTPUT="P" + export RNA3="R1_5prime" + export OPP="TRUE" + #export MAP5="TRUE" + export SE_READ="RNA_3prime" + shift + ;; + --SE_READ*) + export SE_READ=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + -4DREG*) + export DREG_MODEL=1 + shift + ;; + -mem*) + export mem=1 + shift + ;; + -aln*) + export aln=1 + shift + ;; + --MAP_LENGTH*) + export map_L=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + --PREMAP_BWAIDX*) + export PREMAP_BWAIDX=`echo $1 | sed -e 's/^[^=]*=//g'` + shift + ;; + *) + break + ;; + esac +done + + +## CHECK ARGUMENTS. +if [ -z "$BWAIDX" ]; then + echo "--bwa-index is required." + echo " use bash proseq2.0.bsh --help." + exit 1 +fi +if [ -z "$CHINFO" ]; then + echo "--chrom-info is required." + echo " use bash proseq2.0.bsh --help." + exit 1 +fi +if [ -z "$SEQ" ]; then + echo "Please specify the input data is SE or PE." + echo " use bash proseq2.0.bsh --help." + exit 1 +fi + +if [[ "$mem" == 1 && "$aln" == 1 ]]; then + echo "Please choose between -mem or -aln" + echo " use bash proseq2.0.bsh --help." + exit 1 +fi + +## INPUT & Parameters +# PE +if [[ "$SEQ" == "PE" ]] ; then + if [ -z "$SE_OUTPUT" ]; then + : + else + echo "-G and -P can only be used with -SE." + echo " use bash proseq2.0.bsh --help." + exit 1 + fi + + + if [[ "$FQ_INPUT" == "*.fastq.gz" ]]; then + FQ_INPUT=`ls *.fastq.gz | awk -F"/" '{print $NF}' | rev | cut -d \. -f 3-| cut -d _ -f 2- |rev | sort | uniq` + fi + if [ -z "$FQ_INPUT" ]; then + echo "No input files specified. Using *.fastq.gz" + FQ_INPUT=`ls *.fastq.gz | awk -F"/" '{print $NF}' | rev | cut -d \. -f 3-| cut -d _ -f 2- |rev | sort | uniq` + fi + + ## Check if input file end with _R1.fastq.gz or _R2.fastq.gz + tmp=${FQ_INPUT} + FQ_INPUT=() #array + for name in ${tmp} + do + if [ ! -f ${name}_R1.fastq.gz ]; then + echo "" + echo "##################################################################################" + echo " File ${name}_R1.fastq.gz was not found! Skip ${name}*.fastq.gz from analysis." + echo " Paired-end files require identical prefix and end with _R1.fastq.gz and _R2.fastq.gz" + echo " eg: PREFIX_R1.fastq.gz, PREFIX_R2.fastq.gz" + echo " Please make sure you have the correct file suffix! " + echo "##################################################################################" + echo "" + else + FQ_INPUT+=(${name}) # append to the array + #echo ${FQ_INPUT} + fi + done + +# SE +elif [[ "$SEQ" == "SE" ]] ; then + if [[ "$SE_READ" == "RNA_5prime" ]] ; then + SE_OUTPUT="G" + RNA5="R1_5prime" + OPP="FALSE" + #MAP5="TRUE" + elif [[ "$SE_READ" == "RNA_3prime" ]] ; then + SE_OUTPUT="P" + RNA3="R1_5prime" + OPP="TRUE" + #MAP5="TRUE" + fi + if [ -z "$SE_OUTPUT" ] ; then + echo "Please specify output format for SE [-G or -P]" + exit 1 + fi + if [[ "$FQ_INPUT" == "*.fastq.gz" ]]; then + FQ_INPUT=`ls *.fastq.gz | awk -F"/" '{print $NF}' | rev | cut -d \. -f 3-| rev` + elif [ -z "$FQ_INPUT" ]; then + echo "No input files specified. Using *.fastq.gz" + FQ_INPUT=`ls *.fastq.gz | awk -F"/" '{print $NF}' | rev | cut -d \. -f 3-| rev` + elif [[ "$FQ_INPUT" == *.fastq.gz ]]; then + FQ_INPUT=`echo $FQ_INPUT | rev | cut -d \. -f 3-| rev` + fi +else + echo "Please specify the input data is SE or PE." + echo " use bash proseq2.0.bsh --help." + exit 1 +fi + + +# Check input file number +if [[ ${#FQ_INPUT[@]} == 0 ]]; then # if the length of array is 0 + echo "##################################################################################" + echo " No files is in the correct format." + echo " Paired-end files require identical prefix and end with _R1.fastq.gz and _R2.fastq.gz" + echo " eg: PREFIX_R1.fastq.gz, PREFIX_R2.fastq.gz" + echo " Please make sure you have the correct file suffix! Aborting." + echo "##################################################################################" + exit 1 +fi + +if [ -z "$OUTPUT" ]; then + now=$(date +"%m_%d_%Y") + OUTPUT=./My_proseq_output_dir-${now} + echo No output path specified. Using ${OUTPUT} +fi + +if [ ! -d $OUTPUT ]; then + mkdir $OUTPUT +fi +if [ -z "$TMPDIR" ]; then + TMPDIR="./" +fi +if [ ! -d $TMPDIR ]; then + mkdir $TMPDIR +fi +# bash generate random 32 character alphanumeric string (upper and lowercase). +tmp=`cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1` +TMPDIR=$TMPDIR/$tmp + +if [ -z "$thread" ]; then + thread=1 +fi + +if [ -z "$ADAPT_SE" ]; then + ADAPT_SE="TGGAATTCTCGGGTGCCAAGG" +fi +if [ -z "$ADAPT2" ]; then + ADAPT2="AGATCGGAAGAGCACACGTCTGAACTC" +fi +if [ -z "$ADAPT1" ]; then + ADAPT1="GATCGTCGGACTGTAGAACTCTGAACG" +fi +if [ -z "$UMI1" ]; then + UMI1=0 +elif [[ $[UMI1] -lt 0 ]]; then + echo "UMI1 only take natural number" + exit 1 +fi +if [ -z "$UMI2" ]; then + UMI2=0 +elif [[ $[UMI2] -lt 0 ]]; then + echo "UMI2 only take natural number" + exit 1 +fi +if [ -z "$Force_deduplicate" ]; then + Force_deduplicate="FALSE" +elif [[ "$Force_deduplicate" == "TRUE" || "$Force_deduplicate" == "FALSE" ]] ; then +: +else + echo "Force_deduplicate only take TRUE or FALSE" + exit 1 +fi + +if [ -z "$ADD_B1" ]; then + ADD_B1=0 +fi +if [ -z "$ADD_B2" ]; then + ADD_B2=0 +fi +if [ -z "$map_L" ]; then + map_L=0 +fi + +if [[ "$SEQ" == "PE" ]]; then + if [[ -z "$RNA5" && -z "$RNA3" ]]; then + RNA5="R1_5prime" + fi + if [[ "$RNA3" == "R1_5prime" ]]; then + RNA5="R2_5prime" + elif [[ "$RNA3" == "R2_5prime" ]]; then + RNA5="R1_5prime" + fi + + if [[ "$RNA5" == "R1_5prime" || "$RNA5" == "R2_5prime" ]] ; then + : + else + echo "--RNA5 and --RNA3 value can only be R1_5prime or R2_5prime." + echo " use bash proseq2.0.bsh --help." + exit 1 + fi +fi + +if [ -z "$OPP" ]; then + OPP="FALSE" +fi +if [ -z "$MAP5" ]; then + MAP5="TRUE" +fi + + +if [[ "${MAP5}" == "FALSE" && "$SEQ" == "SE" ]] ; then + echo "For single-end (SE), can only report the 5 prime end of reads (--map5=TRUE)" + exit 1 +fi + + +if [ "${MAP5}" == "TRUE" ] ; then + : +elif [ "${MAP5}" == "FALSE" ] ; then + : +else + echo "--map5 value can only be TRUE or FALSE." + echo " use bash proseq2.0.bsh --help." + exit 1 +fi + +## Check all the bioinformatics tools can be called from current working directory. +for tool in cutadapt fastx_trimmer seqtk prinseq-lite.pl bwa samtools bedtools bedGraphToBigWig sort-bed +do command -v ${tool} >/dev/null 2>&1 || { echo >&2 ${tool}" is required. Please make sure you can call the bioinformatics tools from your current working directoryb. Aborting."; exit 1; } +done + +#exec 1>test.log 2>&1 +exec > >(tee ${OUTPUT}/proseq2.0_Run_${tmp}.log) +exec 2>&1 +## Print out +echo " " +echo "Processing PRO-seq data ..." +echo "Command line parameters:" ${INPUT_LINE} +echo " " + +if [[ ! -z "$DREG_MODEL" ]]; then + echo "dREG compatible mode Yes" +fi + +echo "SEQ $SEQ" + +if [[ "$SEQ" == "SE" ]]; then +echo "SE_OUTPUT $SE_OUTPUT" +echo "SE_READ $SE_READ" +fi + +if [[ "$SEQ" == "PE" ]]; then +echo "Location of 5' of RNA $RNA5" +echo "Location of 3' of RNA $RNA3" +fi + +echo "Report 5' ends $MAP5" +echo "Report opposite strand $OPP" + +echo "" +echo "Input files/ paths:" +if [[ ! -z "$PREMAP_BWAIDX" ]]; then +echo "bwa index for pre-mapping $PREMAP_BWAIDX" +fi + +echo "bwa index $BWAIDX" +echo "chromInfo $CHINFO" +if [[ "$SEQ" == "SE" ]]; then +i=1 +for name in ${FQ_INPUT[@]} + do +echo "input file $i ${name}.fastq.gz" + let "i++" +done +fi +if [[ "$SEQ" == "PE" ]]; then +i=1 +for name in ${FQ_INPUT[@]} + do +echo "input file pair $i ${name}_R1.fastq.gz, ${name}_R2.fastq.gz" + let "i++" +done +fi +echo "temp folder $TMPDIR" +echo "output-dir $OUTPUT" +echo " " +echo "Optional operations:" +echo "ADAPT_SE $ADAPT_SE" +echo "ADAPT1 $ADAPT1" +echo "ADAPT2 $ADAPT2" +echo "UMI1 barcode length $UMI1" +echo "UMI2 barcode length $UMI2" +echo "ADD_B1 length $ADD_B1" +echo "ADD_B2 length $ADD_B2" +echo "number of threads $thread" +if [[ ${UMI2} != 0 || ${UMI1} != 0 || ${Force_deduplicate} == "TRUE" ]]; then + echo "Remove PCR duplicates TRUE" +else + echo "Remove PCR duplicates FALSE" +fi +if [[ ${map_L} != 0 ]]; then + echo "Trim Read length to ${map_L}nt for mapping" +fi + +## Exits .. for debugging. +#exit 1 + +## Functions + + +## DOIT! +mkdir ${TMPDIR} + +if [[ "$SEQ" == "PE" ]] ; then +############################################# +## Preprocess data. Remove adapters. Trim. +echo " " +echo "Preprocessing fastq files:" +mkdir ${TMPDIR}/noadapt +mkdir ${TMPDIR}/passQC +if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then + dedup_L=30 + mkdir ${TMPDIR}/noadapt/l${dedup_L} + mkdir ${TMPDIR}/noadapt/l${dedup_L}_nodups +fi + +for name in ${FQ_INPUT[@]} + do + old_name=${name} + name=`echo ${old_name} | rev | cut -d \/ -f 1 |rev` + ## read stats + echo 'Number of original input reads:' > ${TMPDIR}/${name}.QC.log + echo ${old_name}_R1.fastq.gz >> ${TMPDIR}/${name}.QC.log + zcat ${old_name}_R1.fastq.gz | grep @ -c >> ${TMPDIR}/${name}.QC.log + echo ${old_name}_R2.fastq.gz >> ${TMPDIR}/${name}.QC.log + zcat ${old_name}_R2.fastq.gz | grep @ -c >> ${TMPDIR}/${name}.QC.log + + + + ## Remove adapter, UMI barcode, additional barcode, and low quality (q=20) base from 3prime end of reads. Keep read length >=15 after trimmming + # Remove adapter + cutadapt -a ${ADAPT2} -e 0.10 --overlap 2 --output=${TMPDIR}/${name}_trim_R1.fastq --untrimmed-output=${TMPDIR}/${name}_untrim_R1.fastq ${old_name}_R1.fastq.gz & + cutadapt -a ${ADAPT1} -e 0.10 --overlap 2 --output=${TMPDIR}/${name}_trim_R2.fastq --untrimmed-output=${TMPDIR}/${name}_untrim_R2.fastq ${old_name}_R2.fastq.gz & + wait + + # Read1 + # remove UMI2 and ADD_B2 from the 3 prime end of R1 + n2=$[UMI2+ADD_B2] + cutadapt --cut -${n2} --minimum-length=10 ${TMPDIR}/${name}_trim_R1.fastq --output=${TMPDIR}/${name}_trim.${n2}Nremoved_R1.fastq -q 20 & + cutadapt --minimum-length=10 ${TMPDIR}/${name}_untrim_R1.fastq --output=${TMPDIR}/${name}_q20trim_R1.fastq -q 20 & + wait + cat ${TMPDIR}/${name}_q20trim_R1.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved_R1.fastq | paste - - - - |LC_ALL=C sort --temporary-directory=${TMPDIR} --parallel=10 -k1,1 -S 10G | tr '\t' '\n' > ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq & + + # Read2 + # remove UMI1 and ADD_B1 from the 3 prime end of R2 + n1=$[UMI1+ADD_B1] + cutadapt --cut -${n1} --minimum-length=10 ${TMPDIR}/${name}_trim_R2.fastq --output=${TMPDIR}/${name}_trim.${n1}Nremoved_R2.fastq -q 20 & + cutadapt --minimum-length=10 ${TMPDIR}/${name}_untrim_R2.fastq --output=${TMPDIR}/${name}_q20trim_R2.fastq -q 20 & + wait + cat ${TMPDIR}/${name}_q20trim_R2.fastq ${TMPDIR}/${name}_trim.${n1}Nremoved_R2.fastq | paste - - - - | LC_ALL=C sort --temporary-directory=${TMPDIR} --parallel=10 -k1,1 -S 10G | tr '\t' '\n' > ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq & + + wait + echo 'Number of reads after adapter removal and QC:' >> ${TMPDIR}/${name}.QC.log + echo R1 >> ${TMPDIR}/${name}.QC.log + cat ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log + echo R2 >> ${TMPDIR}/${name}.QC.log + cat ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log + + rm ${TMPDIR}/${name}_trim_R1.fastq ${TMPDIR}/${name}_untrim_R1.fastq ${TMPDIR}/${name}_q20trim_R1.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved_R1.fastq & + rm ${TMPDIR}/${name}_trim_R2.fastq ${TMPDIR}/${name}_untrim_R2.fastq ${TMPDIR}/${name}_q20trim_R2.fastq ${TMPDIR}/${name}_trim.${n1}Nremoved_R2.fastq & + + + ## Collapse reads using prinseq-lite.pl. if there are UMI barcodes or $Force_deduplicate=TRUE + if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then # if there is UMI barcode, Will perform deduplicates with first ${dedup_L} bp + # Remove PCR duplciates. + + # fastq with the first dedup_L nt + cat ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq | fastx_trimmer -Q33 -l ${dedup_L} -o ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R1.fastq & + cat ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq | fastx_trimmer -Q33 -l ${dedup_L} -o ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R2.fastq & + wait + + # deduplicate using the first dedup_L nt + prinseq-lite.pl -fastq ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R1.fastq -fastq2 ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R2.fastq -derep 1 -out_format 3 -out_bad null -out_good ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup -min_len 15 2> ${OUTPUT}/${name}.prinseq-pcrDups.gd + rm ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R1.fastq ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}_R2.fastq & + # make a list of name from deduplicated fastq + cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt & + cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons_l${dedup_L}.txt & + cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons_l${dedup_L}.txt & + wait + rm ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons.fastq + + # generate fastq from the list of name + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq & + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq & + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_1_singletons_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_1_singletons.fastq & + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_2_singletons_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_2_singletons.fastq & + wait + rm ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup*l${dedup_L}.txt & + + # trim the UMI and additional barcode after dereplicate + prinseq-lite.pl -trim_left ${n1} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd & + prinseq-lite.pl -trim_left ${n2} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd & + wait + rm ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq + # min_len 15 + prinseq-lite.pl -min_len 15 -fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1.fastq -fastq2 ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_QC_end 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + rm ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1.fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2.fastq & + echo 'Number of paired reads after PCR duplicates removal and QC:' >> ${TMPDIR}/${name}.QC.log + cat ${TMPDIR}/passQC/${name}_dedup_QC_end_1.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log & + + elif [[ ${Force_deduplicate} == "TRUE" ]]; then # if there is NO UMI barcode, Will perform deduplicates with whole legnth of reads + # Remove PCR duplciates. + prinseq-lite.pl -derep 1 -fastq ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq -fastq2 ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_withBarcode 2> ${OUTPUT}/${name}.prinseq-pcrDups.gd + # trim the UMI and additional barcode after dereplicate + prinseq-lite.pl -trim_left ${n1} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + prinseq-lite.pl -trim_left ${n2} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + rm ${TMPDIR}/passQC/${name}_dedup_withBarcode_1.fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode_2.fastq + # min_len 15 + prinseq-lite.pl -min_len 15 -fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1.fastq -fastq2 ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_QC_end 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + rm ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_1.fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved_2.fastq + echo 'Number of paired reads after PCR duplicates removal and QC:' >> ${TMPDIR}/${name}.QC.log + cat ${TMPDIR}/passQC/${name}_dedup_QC_end_1.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log + else + # trim the additional barcode ${ADD_B1} and ${ADD_B2} WITHOUT dereplicate. If no barcode, prinseq-lite.pl will remove unpair reads and reads that are length 0 + prinseq-lite.pl -trim_left ${ADD_B1} -fastq ${TMPDIR}/noadapt/${name}_noadapt_R1.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_BarcodeRemoved_1 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + prinseq-lite.pl -trim_left ${ADD_B2} -fastq ${TMPDIR}/noadapt/${name}_noadapt_R2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_BarcodeRemoved_2 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + # min_len 15 + prinseq-lite.pl -min_len 15 -fastq ${TMPDIR}/passQC/${name}_BarcodeRemoved_1.fastq -fastq2 ${TMPDIR}/passQC/${name}_BarcodeRemoved_2.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_QC_end 2> ${OUTPUT}/${name}.prinseq-pcrDups.gd + rm ${TMPDIR}/passQC/${name}_BarcodeRemoved_1.fastq ${TMPDIR}/passQC/${name}_BarcodeRemoved_2.fastq + echo 'Number of paired reads after final QC:' >> ${TMPDIR}/${name}.QC.log + cat ${TMPDIR}/passQC/${name}_QC_end_1.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log + fi + + cat ${OUTPUT}/${name}.prinseq-pcrDups.gd + + done + +wait +# if there is a data-set wide length cutoff map_L +if [[ ${map_L} != 0 ]]; then + mkdir ${TMPDIR}/passQC_length_${map_L} + for f in ${TMPDIR}/passQC/*_QC_end_1.fastq + do name=`echo $f |rev |cut -d / -f 1 |cut -d _ -f 4-|rev` + cat ${f} | fastx_trimmer -Q33 -l ${map_L} -o ${TMPDIR}/passQC_length_${map_L}/${name}_l${map_L}_QC_end_1.fastq & + done + for f in ${TMPDIR}/passQC/*_QC_end_2.fastq + do name=`echo $f |rev |cut -d / -f 1 |cut -d _ -f 4-|rev` + cat ${f} | fastx_trimmer -Q33 -l ${map_L} -o ${TMPDIR}/passQC_length_${map_L}/${name}_l${map_L}_QC_end_2.fastq & + done + wait +fi + +wait +## Cleanup. +rm -r ${TMPDIR}/noadapt +gzip ${TMPDIR}/passQC*/*.fastq + + +############################################# +## Align reads. + +if [[ ${map_L} != 0 ]]; then + ToMapDir=${TMPDIR}/passQC_length_${map_L} +else + ToMapDir=${TMPDIR}/passQC +fi +echo "Reads for mapping is from : $ToMapDir" + +QC_INPUT=`ls ${ToMapDir}/*_QC_end_1.fastq.gz | awk -F"/" '{print $NF}' | rev | cut -d \. -f 3- | cut -d _ -f 2- | rev| sort | uniq` + +if [[ ${#QC_INPUT} == 0 ]]; then + echo "#########################################" + echo " Something went wrong with Preprocess." + echo " No fastq.gz files passed." + echo " Aborting." + echo "#########################################" + exit 1 +fi + +# premap if PREMAP_BWAIDX is given +if [[ ! -z "$PREMAP_BWAIDX" ]]; then + echo " " + echo "Pre-Mapping reads start ..." + mkdir ${TMPDIR}/Unmapped_to_premap + if [[ ! -z "$aln" ]]; then + for name in ${QC_INPUT} + do + bwa aln -t 1 ${PREMAP_BWAIDX} ${ToMapDir}/${name}_1.fastq.gz > ${TMPDIR}/${name}_aln_sa1.sai + bwa aln -t 1 ${PREMAP_BWAIDX} ${ToMapDir}/${name}_2.fastq.gz > ${TMPDIR}/${name}_aln_sa2.sai + bwa sampe -n 1 -f ${ToMapDir}/${name}_end.sam ${PREMAP_BWAIDX} ${TMPDIR}/${name}_aln_sa1.sai ${TMPDIR}/${name}_aln_sa2.sai ${ToMapDir}/${name}_1.fastq.gz ${ToMapDir}/${name}_2.fastq.gz + samtools view -bF 0x2 ${ToMapDir}/${name}_end.sam| samtools sort -n -@ 1 - > ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam + rm ${TMPDIR}/${name}_aln_sa1.sai ${TMPDIR}/${name}_aln_sa2.sai ${ToMapDir}/${name}_end.sam + bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq -fq2 ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq + gzip ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq + rm ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam + done + else #defaul use mem + for name in ${QC_INPUT} + do + bwa mem -k 19 -t 1 ${PREMAP_BWAIDX} ${ToMapDir}/${name}_1.fastq.gz ${ToMapDir}/${name}_2.fastq.gz |\ + samtools view -bF 0x2 | samtools sort -n -@ 1 - > ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam + bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq -fq2 ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq + gzip ${TMPDIR}/Unmapped_to_premap/${name}_1.fastq ${TMPDIR}/Unmapped_to_premap/${name}_2.fastq + rm ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam + done + fi + ToMapDir=${TMPDIR}/Unmapped_to_premap + echo " " + echo "Pre-Mapping reads end" +fi + +# main map +echo " " +echo "Mapping reads:" +echo "Reads for mapping is from : $ToMapDir" + +if [[ ! -z "$aln" ]]; then # elif -aln was given, use aln + for name in ${QC_INPUT} + do + ## Align using BWA aln + bwa aln -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_1.fastq.gz > ${TMPDIR}/${name}_aln_sa1.sai + bwa aln -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_2.fastq.gz > ${TMPDIR}/${name}_aln_sa2.sai + bwa sampe -n 1 -f ${ToMapDir}/${name}_end.sam ${BWAIDX} ${TMPDIR}/${name}_aln_sa1.sai ${TMPDIR}/${name}_aln_sa2.sai ${ToMapDir}/${name}_1.fastq.gz ${ToMapDir}/${name}_2.fastq.gz + + samtools view -bf 0x2 -q 20 ${ToMapDir}/${name}_end.sam | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam + rm ${TMPDIR}/${name}_aln_sa1.sai ${TMPDIR}/${name}_aln_sa2.sai ${ToMapDir}/${name}_end.sam + done +else #defaul use mem + for name in ${QC_INPUT} + do + ## Align using BWA. + bwa mem -k 19 -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_1.fastq.gz ${ToMapDir}/${name}_2.fastq.gz | \ + samtools view -bf 0x2 -q 20 - | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam + done +fi + + +for name in ${QC_INPUT} + do + cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} & +done +wait + + + + + + + +## Cleanup +find ${TMPDIR} -name "*.sort.bam" -size -1024k -delete + +############################################# +## Write out the bigWigs. +echo " " +echo "Writing bigWigs:" +for f in ${TMPDIR}/*.sort.bam + do + j=`echo $f | awk -F"/" '{print $NF}' | rev | cut -d \. -f 3- |rev ` + echo $j > ${OUTPUT}/${j}.align.log + if [ "${RNA5}" == "R1_5prime" ] ; then + if [ "${OPP}" == "FALSE" ] ; then + if [ "${MAP5}" == "TRUE" ] ; then ## report The 5' end of the RNA. Danko lab leChRO-Seq protocol is on the 5' of _R1 readl, same strand of R1 ($9) + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/kill.warnings | awk 'BEGIN{OFS="\t"} ($9 == "+") {print $1,$2,$2+1,$7,$8,$9}; ($9 == "-") {print $1,$3-1,$3,$7,$8,$9}' | gzip > ${TMPDIR}/$j.bed.gz + else ## report The 3' end of the RNA. Danko lab leChRO-Seq protocol is on the 5 prime of _R2 read, opposite strand of R2 (R2 strand $10, R1 strand $9) + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "-") {print $1,$6-1,$6,$7,$8,$9}; ($10 == "+") {print $1,$5,$5+1,$7,$8,$9}' | gzip > ${TMPDIR}/$j.bed.gz + fi + elif [ "${OPP}" == "TRUE" ] ; then + if [ "${MAP5}" == "TRUE" ] ; then ## report The 5' end of the RNA. + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/kill.warnings | awk 'BEGIN{OFS="\t"} ($9 == "+") {print $1,$2,$2+1,$7,$8,$10}; ($9 == "-") {print $1,$3-1,$3,$7,$8,$10}' | gzip > ${TMPDIR}/$j.bed.gz + else ## report The 3' end of the RNA. + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "-") {print $1,$6-1,$6,$7,$8,$10}; ($10 == "+") {print $1,$5,$5+1,$7,$8,$10}' | gzip > ${TMPDIR}/$j.bed.gz + fi + fi + elif [ "${RNA5}" == "R2_5prime" ] ; then + if [ "${OPP}" == "FALSE" ] ; then + if [ "${MAP5}" == "TRUE" ] ; then #report the 5 prime end of RNA, in Engreitz data is 5 prime end of R2, same strand + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "+") {print $1,$5,$5+1,$7,$8,$10}; ($10 == "-") {print $1,$6-1,$6,$7,$8,$10}'|gzip > ${TMPDIR}/${j}.bed.gz + else ## report the 3-prime end of the RNA, in Engreitz data is the 5' end of R1 read, but opposite strand + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($9 == "+") {print $1,$2,$2+1,$7,$8,$10}; ($9 == "-") {print $1,$3-1,$3,$7,$8,$10}' |gzip > ${TMPDIR}/${j}.bed.gz + fi + elif [ "${OPP}" == "TRUE" ] ; then + if [ "${MAP5}" == "TRUE" ] ; then #report the 5 prime end of RNA, in Engreitz data is 5 prime end of R2, same strand + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($10 == "+") {print $1,$5,$5+1,$7,$8,$9}; ($10 == "-") {print $1,$6-1,$6,$7,$8,$9}'|gzip > ${TMPDIR}/${j}.bed.gz + else ## report the 3-prime end of the RNA, in Engreitz data is the 5' end of R1 read, but opposite strand + bedtools bamtobed -bedpe -mate1 -i $f 2> ${TMPDIR}/${j}.kill.warnings | awk 'BEGIN{OFS="\t"} ($9 == "+") {print $1,$2,$2+1,$7,$8,$9}; ($9 == "-") {print $1,$3-1,$3,$7,$8,$9}' |gzip > ${TMPDIR}/${j}.bed.gz + fi + fi + fi + + echo 'Number of mappable reads:' >> ${OUTPUT}/${j}.align.log + readCount=`zcat ${TMPDIR}/$j.bed.gz | grep "" -c` + echo ${readCount} >> ${OUTPUT}/${j}.align.log + + ## Remove rRNA and reverse the strand (PRO-seq). + zcat ${TMPDIR}/$j.bed.gz | grep "rRNA\|chrM" -v | grep "_" -v | sort-bed - | gzip > ${TMPDIR}/$j.nr.rs.bed.gz + echo 'Number of mappable reads (excluding rRNA):' >> ${OUTPUT}/${j}.align.log + echo `zcat ${TMPDIR}/$j.nr.rs.bed.gz | grep "" -c` >> ${OUTPUT}/${j}.align.log + + ## Convert to bedGraph ... Can't gzip these, unfortunately. + bedtools genomecov -bg -i ${TMPDIR}/$j.nr.rs.bed.gz -g ${CHINFO} -strand + > ${TMPDIR}/$j\_plus.bedGraph + bedtools genomecov -bg -i ${TMPDIR}/$j.nr.rs.bed.gz -g ${CHINFO} -strand - > ${TMPDIR}/$j\_minus.noinv.bedGraph + + ## Invert minus strand. + cat ${TMPDIR}/$j\_minus.noinv.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,-1*$4}' > ${TMPDIR}/$j\_minus.bedGraph ## Invert read counts on the minus strand. + + ## normalized by RPM + cat ${TMPDIR}/$j\_plus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_plus.rpm.bedGraph & + cat ${TMPDIR}/$j\_minus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_minus.rpm.bedGraph & + wait + ## Then to bigWig (nomalized and non-nomrmalized ones) + bedGraphToBigWig ${TMPDIR}/$j\_plus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.rpm.bw + bedGraphToBigWig ${TMPDIR}/$j\_minus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.rpm.bw + bedGraphToBigWig ${TMPDIR}/$j\_plus.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.bw + bedGraphToBigWig ${TMPDIR}/$j\_minus.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.bw +wait + + rm ${TMPDIR}/$j.nr.rs.bed.gz ${TMPDIR}/$j.bed.gz ${TMPDIR}/$j*.bedGraph + done + +fi #*/ + + +if [[ "$SEQ" == "SE" ]] ; then +############################################# +## Preprocess data. Remove adapters. Trim. +echo " " +echo "Preprocessing fastq files:" +mkdir ${TMPDIR}/noadapt +mkdir ${TMPDIR}/passQC +if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then + dedup_L=30 + mkdir ${TMPDIR}/noadapt/l${dedup_L} + mkdir ${TMPDIR}/noadapt/l${dedup_L}_nodups +fi + +for name in ${FQ_INPUT[@]} + do + old_name=${name} + name=`echo ${old_name} | rev | cut -d \/ -f 1| rev` + ## read stats + echo ${old_name} > ${TMPDIR}/${name}.QC.log + echo 'Number of original input reads:' >> ${TMPDIR}/${name}.QC.log + zcat ${old_name}.fastq.gz | grep @ -c >> ${TMPDIR}/${name}.QC.log + + ## Remove adapter, UMI barcode, additional barcode, and low quality (q=20) base from 3prime end of reads. Keep read length >=15 after trimmming + # Remove adapter + cutadapt -a ${ADAPT_SE} -e 0.10 --overlap 2 --output=${TMPDIR}/${name}_trim.fastq --untrimmed-output=${TMPDIR}/${name}_untrim.fastq ${old_name}.fastq.gz + # Read1 + # remove UMI2 and ADD_B2 from the 3 prime end of R1 + n2=$[UMI2+ADD_B2] + n1=$[UMI1+ADD_B1] + cutadapt --cut -${n2} --minimum-length=10 ${TMPDIR}/${name}_trim.fastq --output=${TMPDIR}/${name}_trim.${n2}Nremoved.fastq -q 20 & + cutadapt --minimum-length=10 ${TMPDIR}/${name}_untrim.fastq --output=${TMPDIR}/${name}_q20trim.fastq -q 20 & + wait + cat ${TMPDIR}/${name}_q20trim.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved.fastq | paste - - - - |LC_ALL=C sort --temporary-directory=${TMPDIR} --parallel=10 -k1,1 -S 10G | tr '\t' '\n' > ${TMPDIR}/noadapt/${name}_noadapt.fastq & + + wait + echo 'Number of reads after adapter removal and QC:' >> ${TMPDIR}/${name}.QC.log + cat ${TMPDIR}/noadapt/${name}_noadapt.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log + + + rm ${TMPDIR}/${name}_trim.fastq ${TMPDIR}/${name}_untrim.fastq ${TMPDIR}/${name}_q20trim.fastq ${TMPDIR}/${name}_trim.${n2}Nremoved.fastq + + + ## Collapse reads using prinseq-lite.pl. if there are UMI barcodes + if [[ ${UMI2} != 0 || ${UMI1} != 0 ]]; then + # Remove PCR duplciates. + # fastq with the first dedup_L nt + cat ${TMPDIR}/noadapt/${name}_noadapt.fastq | fastx_trimmer -Q33 -l ${dedup_L} -o ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}.fastq + # deduplicate using the first dedup_L nt + prinseq-lite.pl -fastq ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}.fastq -derep 1 -out_format 3 -out_bad null -out_good ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup -min_len 15 2> ${OUTPUT}/${name}.prinseq-pcrDups.gd + rm ${TMPDIR}/noadapt/l${dedup_L}/${name}_noadapt_l${dedup_L}.fastq + + # make a list of name from deduplicated fastq + cat ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup.fastq | awk '(NR%4==1){print substr($1,2)}' > ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt + rm ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup.fastq + + # generate fastq from the list of name + seqtk subseq ${TMPDIR}/noadapt/${name}_noadapt.fastq ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt > ${TMPDIR}/passQC/${name}_dedup_withBarcode.fastq + rm ${TMPDIR}/noadapt/l${dedup_L}_nodups/${name}_dedup_l${dedup_L}.txt + + # trim the UMI and additional barcode after dereplicate + prinseq-lite.pl -trim_left ${n1} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + # min_len 15 + prinseq-lite.pl -min_len 15 -fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_QC_end 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + rm ${TMPDIR}/passQC/${name}_dedup_withBarcode.fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved.fastq + echo 'Number of reads after PCR duplicates removal and QC:' >> ${TMPDIR}/${name}.QC.log + cat ${TMPDIR}/passQC/${name}_dedup_QC_end.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log + + elif [[ ${Force_deduplicate} == "TRUE" ]]; then + # Remove PCR duplciates. + prinseq-lite.pl -derep 1 -fastq ${TMPDIR}/noadapt/${name}_noadapt.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_withBarcode 2> ${OUTPUT}/${name}.prinseq-pcrDups.gd + # trim the UMI and additional barcode after dereplicate + prinseq-lite.pl -trim_left ${n1} -fastq ${TMPDIR}/passQC/${name}_dedup_withBarcode.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + # min_len 15 + prinseq-lite.pl -min_len 15 -fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_dedup_QC_end 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + rm ${TMPDIR}/passQC/${name}_dedup_withBarcode.fastq ${TMPDIR}/passQC/${name}_dedup_BarcodeRemoved.fastq + echo 'Number of reads after PCR duplicates removal and QC:' >> ${TMPDIR}/${name}.QC.log + cat ${TMPDIR}/passQC/${name}_dedup_QC_end.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log + else + # trim the additional barcode WITHOUT dereplicate. If no barcode, prinseq-lite.pl will remove unpair reads and reads that are length 0 + prinseq-lite.pl -trim_left ${ADD_B1} -fastq ${TMPDIR}/noadapt/${name}_noadapt.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_BarcodeRemoved 2>> ${OUTPUT}/${name}.prinseq-pcrDups.gd + # min_len 15 + prinseq-lite.pl -min_len 15 -fastq ${TMPDIR}/passQC/${name}_BarcodeRemoved.fastq -out_format 3 -out_bad null -out_good ${TMPDIR}/passQC/${name}_QC_end 2> ${OUTPUT}/${name}.prinseq-pcrDups.gd + rm ${TMPDIR}/passQC/${name}_BarcodeRemoved.fastq + echo 'Number of reads after final QC:' >> ${TMPDIR}/${name}.QC.log + cat ${TMPDIR}/passQC/${name}_QC_end.fastq | grep @ -c >> ${TMPDIR}/${name}.QC.log + fi + + cat ${OUTPUT}/${name}.prinseq-pcrDups.gd + wait +done + +# if there is a data-set wide length cutoff map_L +if [[ ${map_L} != 0 ]]; then + mkdir ${TMPDIR}/passQC_length_${map_L} + for f in ${TMPDIR}/passQC/*_QC_end.fastq + do name=`echo $f |rev |cut -d / -f 1 |cut -d _ -f 3-|rev` + cat ${f} | fastx_trimmer -Q33 -l ${map_L} -o ${TMPDIR}/passQC_length_${map_L}/${name}_l${map_L}_QC_end.fastq & + done + wait +fi + + +wait +## Cleanup. +rm -r ${TMPDIR}/noadapt +gzip ${TMPDIR}/passQC*/*.fastq + + +############################################# +## Align reads. + + +if [[ ${map_L} != 0 ]]; then + ToMapDir=${TMPDIR}/passQC_length_${map_L} +else + ToMapDir=${TMPDIR}/passQC +fi + +QC_INPUT=`ls ${ToMapDir}/*_QC_end.fastq.gz | awk -F"/" '{print $NF}' | rev | cut -d \. -f 3- | cut -d _ -f 2- | rev| sort | uniq` + +if [[ ${#QC_INPUT} == 0 ]]; then + echo "#########################################" + echo " Something went wrong with Preprocess." + echo " No fastq.gz files passed." + echo " Aborting." + echo "#########################################" + exit 1 +fi + +# premap if PREMAP_BWAIDX is given +if [[ ! -z "$PREMAP_BWAIDX" ]]; then + echo "Pre-mapping start ..." + mkdir ${TMPDIR}/Unmapped_to_premap + if [[ ! -z "$mem" ]]; then + for name in ${QC_INPUT} + do + bwa mem -k 19 -t 1 ${PREMAP_BWAIDX} ${ToMapDir}/${name}_end.fastq.gz |\ + samtools view -bf 0x4 | samtools sort -n -@ 1 - > ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam + bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq.gz + gzip ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq.gz + rm ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam + done + else # default use aln + for name in ${QC_INPUT} + do + echo " bwa aln -t 1 ${PREMAP_BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ + bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${PREMAP_BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz" + + bwa aln -t 1 ${PREMAP_BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ + bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${PREMAP_BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz + samtools view -bf 0x4 ${ToMapDir}/${name}_end.sam| samtools sort -n -@ 1 - > ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam + rm ${ToMapDir}/${name}_end.sam + bamToFastq -i ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam -fq ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq + gzip ${TMPDIR}/Unmapped_to_premap/${name}_end.fastq + rm ${TMPDIR}/Unmapped_to_premap/${name}_unmapped.bam + done + fi + echo "Pre-mapping end ..." + ToMapDir=${TMPDIR}/Unmapped_to_premap +fi + + +echo " " +echo "Mapping reads:" +echo "Reads for mapping is from : $ToMapDir" +# main map +if [[ -z "$DREG_MODEL" ]]; then + if [[ ! -z "$mem" ]]; then #use mem + for name in ${QC_INPUT} + do + ## Align using BWA. + bwa mem -k 19 -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ + samtools view -b -q 20 - | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam & + done + wait + for name in ${QC_INPUT} + do + cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} & ## Saves the sorted BAM in the output file. + done + wait + else #default use aln + for name in ${QC_INPUT} + do + ## Align using BWA aln +echo "bwa aln -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ + bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz " + + bwa aln -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ + bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz & + done + wait + for name in ${QC_INPUT} + do + echo "samtools view -b -q 20 ${ToMapDir}/${name}_end.sam | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam" + samtools view -b -q 20 ${ToMapDir}/${name}_end.sam | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam & + done + wait + for name in ${QC_INPUT} + do + rm ${ToMapDir}/${name}_end.sam + cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} & ## Saves the sorted BAM in the output file. + done + wait + fi +else #dREG + echo "Aligning reads using the same parameters with dREG" + for name in ${QC_INPUT} + do + ## Align using BWA aln , same as the original dreg model. + bwa aln -t ${thread} ${BWAIDX} ${ToMapDir}/${name}_end.fastq.gz | \ + bwa samse -n 1 -f ${ToMapDir}/${name}_end.sam ${BWAIDX} - ${ToMapDir}/${name}_end.fastq.gz & + done + + wait + + for name in ${QC_INPUT} + do + samtools view -b -q 0 ${ToMapDir}/${name}_end.sam | samtools sort -n -@ ${thread} - > ${TMPDIR}/${name}.sort.bam & + done + + wait + for name in ${QC_INPUT} + do + rm ${ToMapDir}/${name}_end.sam + cp ${TMPDIR}/${name}.sort.bam ${OUTPUT} & ## Saves the sorted BAM in the output file. + done + wait +fi + +wait + + +## Cleanup +find ${TMPDIR} -name "*.sort.bam" -size -1024k -delete + + +############################################# +## Write out the bigWigs. +echo " " +echo "Writing bigWigs:" +for f in ${TMPDIR}/*.sort.bam + do +#*/ + j=`echo $f | awk -F"/" '{print $NF}' | rev | cut -d \. -f 3- |rev ` + echo $j > ${OUTPUT}/${j}.align.log + +# in SE, MAP5 alwasys TRUE + + + #if [[ "${RNA5}" == "R1_5prime" && "${OPP}" == "FALSE" ]] ; then ## report The 5 prime end of the RNA. #like GRO-seq + if [[ "$SE_OUTPUT" == "G" ]] ; then + bedtools bamtobed -i $f 2> ${TMPDIR}/kill.warnings| awk 'BEGIN{OFS="\t"} ($5 > 0){print $0}' | \ + awk 'BEGIN{OFS="\t"} ($6 == "+") {print $1,$2,$2+1,$4,$5,$6}; ($6 == "-") {print $1,$3-1,$3,$4,$5,$6}' | gzip > ${TMPDIR}/$j.bed.gz + #elif [[ "${RNA3}" == "R1_5prime" && "${OPP}" == "TRUE" ]] ; then #like PRO-seq + elif [[ "$SE_OUTPUT" == "P" ]] ; then + bedtools bamtobed -i $f 2> ${TMPDIR}/kill.warnings| awk 'BEGIN{OFS="\t"} ($5 > 0){print $0}' | \ + awk 'BEGIN{OFS="\t"} ($6 == "+") {print $1,$2,$2+1,$4,$5,"-"}; ($6 == "-") {print $1,$3-1,$3,$4,$5,"+"}' | gzip > ${TMPDIR}/$j.bed.gz + fi + + + echo 'Number of mappable reads:' >> ${OUTPUT}/${j}.align.log + readCount=`zcat ${TMPDIR}/$j.bed.gz | grep "" -c` + echo ${readCount} >> ${OUTPUT}/${j}.align.log + + ## Remove rRNA and reverse the strand (PRO-seq). + zcat ${TMPDIR}/$j.bed.gz | grep "rRNA\|chrM" -v | grep "_" -v | sort-bed - | gzip > ${TMPDIR}/$j.nr.rs.bed.gz + echo 'Number of mappable reads (excluding rRNA):' >> ${OUTPUT}/${j}.align.log + echo `zcat ${TMPDIR}/$j.nr.rs.bed.gz | grep "" -c` >> ${OUTPUT}/${j}.align.log + + ## Convert to bedGraph ... Cannot gzip these, unfortunately. + bedtools genomecov -bg -i ${TMPDIR}/$j.nr.rs.bed.gz -g ${CHINFO} -strand + > ${TMPDIR}/$j\_plus.bedGraph + bedtools genomecov -bg -i ${TMPDIR}/$j.nr.rs.bed.gz -g ${CHINFO} -strand - > ${TMPDIR}/$j\_minus.noinv.bedGraph + + ## Invert minus strand. + cat ${TMPDIR}/$j\_minus.noinv.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,-1*$4}' > ${TMPDIR}/$j\_minus.bedGraph ## Invert read counts on the minus strand. + + ## normalized by RPM + cat ${TMPDIR}/$j\_plus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_plus.rpm.bedGraph & + cat ${TMPDIR}/$j\_minus.bedGraph | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4*1000*1000/'$readCount'/1}' > ${TMPDIR}/$j\_minus.rpm.bedGraph & + wait + ## Then to bigWig (nomalized and non-nomrmalized ones) + bedGraphToBigWig ${TMPDIR}/$j\_plus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.rpm.bw + bedGraphToBigWig ${TMPDIR}/$j\_minus.rpm.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.rpm.bw + wait + bedGraphToBigWig ${TMPDIR}/$j\_plus.bedGraph ${CHINFO} ${OUTPUT}/$j\_plus.bw + bedGraphToBigWig ${TMPDIR}/$j\_minus.bedGraph ${CHINFO} ${OUTPUT}/$j\_minus.bw +# rm ${TMPDIR}/$j.nr.rs.bed.gz ${TMPDIR}/$j.bed.gz ${TMPDIR}/$j*.bedGraph + done + + + +fi + +echo "QC" > ${OUTPUT}/proseq2.0_read_report_${tmp}.log +for old_name in ${FQ_INPUT[@]} + do + name=`echo ${old_name} | rev | cut -d \/ -f 1 |rev` + cat ${TMPDIR}/${name}.QC.log >> ${OUTPUT}/proseq2.0_read_report_${tmp}.log + done +echo "" >> ${OUTPUT}/proseq2.0_read_report_${tmp}.log +echo "Mapping" >> ${OUTPUT}/proseq2.0_read_report_${tmp}.log +for f in ${TMPDIR}/*.sort.bam + do j=`echo $f | awk -F"/" '{print $NF}' | rev | cut -d \. -f 3- |rev ` + cat ${OUTPUT}/${j}.align.log >> ${OUTPUT}/proseq2.0_read_report_${tmp}.log + done + + + +############################################# +## CLEANUP! +#rm -Rf ${TMPDIR}