hinosima

High-Noise Single-Molecule sequence Assembly pipeline

Description

Set of different tools to perfom an assembly using High-Noise Single-Molecule sequences (from PacBio or Nanopore).

First, it uses colormap or proovread to run a correction of pacbio reads (long reads) using illumina reads (short reads). Since it takes long time (colormap runs bwa two times during the process, proovread is faster but takes time anyway), the script splits the pacbio reads (by default in 999 sub-files), it runs the correction in parallel and finally it merges the results.

Second, canu performs the assembly.

Considerations

Replace the numJobs-threadsCorrection (colormap) and genomeSize (canu) as needed.
Script optimized to run on a slurm cluster.
python/3.5.0 required to run the snakemake script.

Usage

(dry run) $ snakemake -j 60 --snakefile snk_canupipe.py --cluster "sbatch --partition=compute --cpus-per-task=12 --time=14-0 --job-name=snkmk --mem=20GB" --config pfasta=pacbio.fasta ifasta=illumina.fastq fix=colormap -np

To use proovread instead:

(dry run) $ snakemake -j 60 --snakefile snk_canupipe.py --cluster "sbatch --partition=compute --cpus-per-task=12 --time=14-0 --job-name=snkmk --mem=20GB" --config pfasta=pacbio.fasta ifasta=illumina.fastq fix=proovread -np

Possible improvements

snk_quiver2.3

Once the assembly is obtained (from canu or falcon for example), the result can be polished using the PacBio reads and Quiver. The snk.quiver2.3.py script uses SMRTANALYSIS V2.3. The steps are:

Create fofn files from the input bax.h5.
Run pbalign on each fofn file.
Merge all the cmp.h5 files (pbh5tools).
Sort the cmp.h5 single file.
Run quiver.

Installation notes

Download and install SMRT ANAlYSIS V2.3. Before run the script, the SMRTANALYSIS installation path variable SMRTloc has to be edited.

Requirements

Raw reads in bax.h5 format
Assembly in fasta format.

Usage

In case we run the script in a single node, the available threads will be limited by the available cpus on the node:

(dry run) $ snakemake --snakefile snk_quiver2.3.py -j 1 --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np

It can be run also in multi-node mode (for example, 80 jobs at once, each one with 24 threads):

(dry run) $ snakemake -j 80 --snakefile snk_quiver2.3.py --cluster-config cluster.json --cluster "sbatch --partition=compute --cpus-per-task=1 --time=14-0 --job-name=snkmk --mem=10GB" --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np

Considerations

Last step (the quiver run itself) has high memory demand. It took ~7 days for a ~450Mbps genome using, at least, 1T of memory.

Snakemake cluster config file is attached.

Script based in the PacificBiosciences tutorial

Useful links 1 2 3 4 5 6

snk_quiver3.0

There are important differences using the SMRTANALYSIS V3.0, since quiver started to work with bam files instead h5 format. In this case, the steps are:

Convert bax.h5 files to bam
Run pbalign with each bam file.
Merge all the pbalign bam files output in a single bam file (bamtools).
(sort/index bam and index fasta)
run Quiver.

Run notes

Download and install pitchfork Before run the script, the SMRTANALYSIS installation path variable SMRTloc has to be edited.

Requirements

Raw reads in bax.h5 format
Canu/falcon assembly in fasta format.

Usage

In case we run the script in a single node, the available threads will be limited by the available cpus on the node:

(dry run) $ snakemake --snakefile snk_quiver3.0.py -j 1 --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np

It can be run also in multi-node mode (for example, 80 jobs at once, each one with 24 threads):

(dry run) $ snakemake -j 80 --snakefile snk_quiver3.0.py --cluster-config cluster.json --cluster "sbatch --partition=compute --cpus-per-task=1 --time=14-0 --job-name=snkmk --mem=10GB" --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np

Considerations

To convert the h5 files to BAM, it uses the recommended parammeters propagating additional pulse features (QVs).

Last step (the quiver run itself) has high memory demand. It took ~7 days for a ~450Mbps genome using, at least, 1T of memory.

Snakemake cluster config file is attached.

snk.pilon

To perform a polishing using Illumina reads instead PacBio reads we can use PILON.

BWA alignments using the pacbio assembly and the raw illumina reads.
Samtools conversion.
Pilon.

Requeriments

PacBio assembly (canu, falcon...)
Illumina (paired) raw reads.

Run notes

Edit PILONloc with the pilon location. Pilon tested using java-jdk/1.8.0_20. Run on a node with 24 threads available.

Usage

(dry run) $ snakemake --snakefile snk.pilon.py -j 1 --config rawRead1=reads1.fq rawRead2=reads2.fq assembly=assembly.fasta -np

Coming soon

~~colormap pre-correction (using illumina reads)~~.
~~pilon polishing (using illumina reads)~~.
Generate a falcon assembly and merge it with the canu one.
Run canu without pre-correction and merge it with the corrected one.
Generate quiver report.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hinosima

Description

Contents

snk_canupipe

snk_quiver2.3

snk_quiver3.0

snk.pilon

Coming soon

About

Releases

Packages

Languages

migrau/hinosima

Folders and files

Latest commit

History

Repository files navigation

hinosima

Description

Contents

snk_canupipe

snk_quiver2.3

snk_quiver3.0

snk.pilon

Coming soon

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages