Skip to content

High-Noise Single-Molecule sequence Assembly pipe

Notifications You must be signed in to change notification settings

migrau/hinosima

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

hinosima

High-Noise Single-Molecule sequence Assembly pipeline

Description

Set of different tools to perfom an assembly using High-Noise Single-Molecule sequences (from PacBio or Nanopore).

Contents

  • src/snk_canupipe.py . Run correction of pacbio reads and performs a canu assembly.
  • src/snk.quiver2.3.py . Polish canu assembly using Quiver 2.3.
  • src/snk.quiver3.0.py . Polish canu assembly using Quiver 3.0.
  • src/snk.pilon.py . Polish canu assembly using Pilon 1.20.

snk_canupipe

First, it uses colormap or proovread to run a correction of pacbio reads (long reads) using illumina reads (short reads). Since it takes long time (colormap runs bwa two times during the process, proovread is faster but takes time anyway), the script splits the pacbio reads (by default in 999 sub-files), it runs the correction in parallel and finally it merges the results.

Second, canu performs the assembly.

Considerations

  • Replace the numJobs-threadsCorrection (colormap) and genomeSize (canu) as needed.
  • Script optimized to run on a slurm cluster.
  • python/3.5.0 required to run the snakemake script.

Usage

(dry run) $ snakemake -j 60 --snakefile snk_canupipe.py --cluster "sbatch --partition=compute --cpus-per-task=12 --time=14-0 --job-name=snkmk --mem=20GB" --config pfasta=pacbio.fasta ifasta=illumina.fastq fix=colormap -np

To use proovread instead:

(dry run) $ snakemake -j 60 --snakefile snk_canupipe.py --cluster "sbatch --partition=compute --cpus-per-task=12 --time=14-0 --job-name=snkmk --mem=20GB" --config pfasta=pacbio.fasta ifasta=illumina.fastq fix=proovread -np

Possible improvements

  • 
    

snk_quiver2.3

Once the assembly is obtained (from canu or falcon for example), the result can be polished using the PacBio reads and Quiver. The snk.quiver2.3.py script uses SMRTANALYSIS V2.3. The steps are:

  1. Create fofn files from the input bax.h5.
  2. Run pbalign on each fofn file.
  3. Merge all the cmp.h5 files (pbh5tools).
  4. Sort the cmp.h5 single file.
  5. Run quiver.

Installation notes

Download and install SMRT ANAlYSIS V2.3. Before run the script, the SMRTANALYSIS installation path variable SMRTloc has to be edited.

Requirements

  • Raw reads in bax.h5 format
  • Assembly in fasta format.

Usage

In case we run the script in a single node, the available threads will be limited by the available cpus on the node:

(dry run) $ snakemake --snakefile snk_quiver2.3.py -j 1 --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np

It can be run also in multi-node mode (for example, 80 jobs at once, each one with 24 threads):

(dry run) $ snakemake -j 80 --snakefile snk_quiver2.3.py --cluster-config cluster.json --cluster "sbatch --partition=compute --cpus-per-task=1 --time=14-0 --job-name=snkmk --mem=10GB" --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np

Considerations

Last step (the quiver run itself) has high memory demand. It took ~7 days for a ~450Mbps genome using, at least, 1T of memory.

Snakemake cluster config file is attached.

Script based in the PacificBiosciences tutorial

Useful links 1 2 3 4 5 6

snk_quiver3.0

There are important differences using the SMRTANALYSIS V3.0, since quiver started to work with bam files instead h5 format. In this case, the steps are:

  1. Convert bax.h5 files to bam
  2. Run pbalign with each bam file.
  3. Merge all the pbalign bam files output in a single bam file (bamtools).
  4. (sort/index bam and index fasta)
  5. run Quiver.

Run notes

Download and install pitchfork Before run the script, the SMRTANALYSIS installation path variable SMRTloc has to be edited.

Requirements

  • Raw reads in bax.h5 format
  • Canu/falcon assembly in fasta format.

Usage

In case we run the script in a single node, the available threads will be limited by the available cpus on the node:

(dry run) $ snakemake --snakefile snk_quiver3.0.py -j 1 --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np

It can be run also in multi-node mode (for example, 80 jobs at once, each one with 24 threads):

(dry run) $ snakemake -j 80 --snakefile snk_quiver3.0.py --cluster-config cluster.json --cluster "sbatch --partition=compute --cpus-per-task=1 --time=14-0 --job-name=snkmk --mem=10GB" --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np

Considerations

To convert the h5 files to BAM, it uses the recommended parammeters propagating additional pulse features (QVs).

Last step (the quiver run itself) has high memory demand. It took ~7 days for a ~450Mbps genome using, at least, 1T of memory.

Snakemake cluster config file is attached.

snk.pilon

To perform a polishing using Illumina reads instead PacBio reads we can use PILON.

  1. BWA alignments using the pacbio assembly and the raw illumina reads.
  2. Samtools conversion.
  3. Pilon.

Requeriments

  • PacBio assembly (canu, falcon...)
  • Illumina (paired) raw reads.

Run notes

Edit PILONloc with the pilon location. Pilon tested using java-jdk/1.8.0_20. Run on a node with 24 threads available.

Usage

(dry run) $ snakemake --snakefile snk.pilon.py -j 1 --config rawRead1=reads1.fq rawRead2=reads2.fq assembly=assembly.fasta -np

Coming soon

  • colormap pre-correction (using illumina reads).
  • pilon polishing (using illumina reads).
  • Generate a falcon assembly and merge it with the canu one.
  • Run canu without pre-correction and merge it with the corrected one.
  • Generate quiver report.

About

High-Noise Single-Molecule sequence Assembly pipe

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages