IMPORTANT:
- That Singularity is now in a shared location (/hpc/apps/singularity/images). To use it, edit workflow_opts/singularity.json .
Change the line
"singularity_container" : "~/.singularity/chip-seq-pipeline-v1.1.6.simg"
to
"singularity_container" : "/hpc/apps/singularity/images/chip-seq-pipeline-v1.1.6.simg"
- If your users want to try the test samples suggested on the chip-seq-pipelin2 SGE tutorial page, they'll want to use either SGE singularity script, then make the following changes to the script.
- Around line 7 or 8, add a "#$ -cwd" to deposit job output in the directory the job is submitted from
- After the "module load java" line, add "module load singularity/2.5.2"
- After "module load singularity/2.5.2", add the line "CROMWELL='/hpc/apps/cromwell/34/lib/cromwell.jar'"
- Change the word "shm" to "smp" everywhere in the script
- In the line at the end of the file, change "$HOME/cromwell-34.jar" to "$CROMWELL"
The file "test_genome_database/hg38_chr19_chrM_local.tsv refers to a non-existent file "test_genome_database/hg38_chr19_chrM/hg38.chrom.sizes". Either change the filename to "hg38_chr19_chrM.chrom.sizes", or copy test_genome_database/hg38_chr19_chrM/hg38_chr19_chrM.chrom.sizes to test_genome_database/hg38_chr19_chrM/hg38.chrom.sizes.
-
there are example JSON and batch files in example_HPC folder
-
for long paired-end reads, we should use "bwa mem" for the alignment. The command line is shown below:
/common/genomics-core/anaconda2/bin/bwa mem -M -t 10 /home/wangyiz/genomics/apps/chip-seq-pipeline2/genome/GRCh38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta ./$1_R1.fastq.gz ./$1_R2.fastq.gz > $1.sam ###The reference genome should be the one used in the pipeline, be careful for the parameter setting.
samtools view -b -S $1.sam > $1.bam
samtools sort --output-fmt BAM -@ 10 -n -o $1.sorted.bam $1.bam ### bam could be used as input for the pipeline
This ChIP-Seq pipeline is based off the ENCODE (phase-3) transcription factor and histone ChIP-seq pipeline specifications (by Anshul Kundaje) in this google doc.
- Flexibility: Support for
docker
,singularity
andConda
. - Portability: Support for many cloud platforms (Google/DNAnexus) and cluster engines (SLURM/SGE/PBS).
- Resumability: Resume a failed workflow from where it left off.
- User-friendly HTML report: tabulated quality metrics including alignment/peak statistics and FRiP along with many useful plots (IDR/cross-correlation measures).
- Genomes: Pre-built database for GRCh38, hg19, mm10, mm9 and additional support for custom genomes.
This pipeline supports many cloud platforms and cluster engines. It also supports docker
, singularity
and Conda
to resolve complicated software dependencies for the pipeline. A tutorial-based instruction for each platform will be helpful to understand how to run pipelines. There are special instructions for two major Stanford HPC servers (SCG4 and Sherlock).
- Cloud platforms
- Web interface
- CLI (command line interface)
- Stanford HPC servers (CLI)
- Cluster engines (CLI)
- Local computers (CLI)
Output directory specification
There are some useful tools to post-process outputs of the pipeline.
This tool recursively finds and parses all qc.json
(pipeline's final output) found from a specified root directory. It generates a TSV file that has all quality metrics tabulated in rows for each experiment and replicate. This tool also estimates overall quality of a sample by a criteria definition JSON file which can be a good guideline for QC'ing experiments.
This tool parses a metadata JSON file from a previous failed workflow and generates a new input JSON file to start a pipeline from where it left off.
This tool downloads any type (FASTQ, BAM, PEAK, ...) of data from the ENCODE portal. It also generates a metadata JSON file per experiment which will be very useful to make an input JSON file for the pipeline.