Skip to content

splaisan/ngs-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(BITS-VIB) ngstools - NGS-Tools

Table of Contents

Analysis tools
Formatting tools for FASTA data
Formatting tools for FASTQ data
Formatting tools for SAM / BAM data
Formatting tool for annotation data

Analysis tools

TOP

locus2genes

A script using R packages to query biomaRt and fetch genes in a given locus (loci) before computing GO enrichment on the gene list. Please read the dedicated page for more info.

Formatting tools for FASTA data

TOP

restrict2bed.pl

The BIO-perl script restrict2bed.pl will parse a multifasta file and look for one or several combined RE sites on both strands; matches are reported in BED format to be used with BedTools (requires BioPerl and several cpan modules to work, see header). Can be used to compute nicking label count per 100kb if combined to bedtools (labeldensity.pl)

## Usage: restrict2bed.pl <-i fasta-file>
# <-n 'nicker(s) consensus', multiple allowed separated by ',')>
#  'Nt-BspQI' => 'GCTCTTC'
#  'Nt-BbvCI' => 'CCTCAGC'
#  'Nb-BsMI'  => 'GAATGC'
#  'Nb-BsrDI' => 'GCAATG'
#  'Nb-BssSI' => 'CACGAG'
# Additional optional parameters are:
# <-l minimal length for dna sequence (20000)>
# <-h to display this help>

fasta2chromsizes.pl

The BIO-perl script fasta2chromsizes.pl create a file reporting chromosome lengths from a multifasta file. Such file is required for BedTools to operate on intervals (requires BioPerl for parsing fasta).

## Usage: fasta2chromsizes.pl <-i fasta-file>
# Additional optional parameters are:
# <-l minimal length for dna sequence (20000)>
# <-h to display this help>

dedupFastaSeq.pl

The BIO-perl script dedupFastaSeq.pl will parse a multifasta file and keep only one copy of each sequence based on its name (no sequence comparison is operated). Requires BioPerl to work.

## Usage: dedupFastaSeq.pl <fasta_input file> <output file name>

fastaFindGaps.pl

The BIO-perl script fastaFindGaps.pl will parse a multifasta file and look for stretches of N's of at least 'l-length' and report the found hits in BED5 format. Gaps are good to compare to coverage results in IGV or subtract from captured or selected regions using Bedtools.

## Usage: fastaFindGaps.pl <-i fasta-file> 
# Additional optional parameters are:
# <-o BED output (optional, deduced from input file)>
# <-l minsize in bps (default to 100bps)>
# <-h to display this help>

fastaFiltLength.pl

The BIO-perl script fastaFiltLength.pl will filter a multifasta file and keep only sequence with length >min and <max values. Was created to filter genome assemblies containing multiple small files.

## Usage: fastaFiltLength.pl <-i fasta_file (required)>
# script version:2.0
# Additional optional parameters are:
# <-o outfile_name (filtered_)>
# <-m minsize (undef)>
# <-x maxsize (undef)>
# <-z zip results (default OFF)>
# <-h to display this help>

fastaRevComp.pl

The BioPerl script fastaRevComp.pl reverse and complements fasta sequences (multifasta too).

## Usage: fastaRevComp.pl <-i fasta_file (required)>
# script version:1.0
# Additional optional parameters are:
# <-o outfile_name (revcomp_)>
# <-z zip results (default OFF)>
# <-h to display this help>

fastaSortLength.pl

The BIO-perl script fastaSortLength.pl will sorts a multifasta file by decreasing or increasing order. It also allows filtering by size and exclude sequences that woulsd be too small or too large. Was created to clean input fasta files before applying Knicker (BionanoGenomics).

## Usage: fastaSortlength.pl <-i fasta-file> 
# <-o size-order ('i'=increasing | 'd'=decreasing)>
# script version:2.0
# Additional optional parameters are:
# <-m minsize (undef)>
# <-x maxsize (undef)>
# <-z zip results (default OFF)>
# <-h to display this help>

Formatting tools for FASTQ data

TOP

fastq_detect.pl

The perl script fastq_detect.pl is parsing n-lines of fastQ data to identify the range of ascii score used and matching them to what is expected for the main flavors known today. The result is a list of compatible fastQ versions.

## Usage: fastq_detect.pl <fastq file> <opt:sample-size (100)>

CLC-to-BAM

The bash script CLC-to-BAM.sh takes a BAM file and two fastq files containing unmapped reads (paired and single, all tree files were exported from a CLC genomic reference mapping experiment) and combines them all into ONE BAM file. Some validation and fix are applied but the BAM is not 100% clean (although sufficiently for GCAT analysis).

ngstools

avgQdist2linePlot.R

The R script avgQdist2linePlot.R is taking output from the popular fastx toolkit to plot a normalized line graph (PDF) of base frequencies. This once was needed to identify base bias across reads. One example output is saved here.

isFastqUniq.sh

The awk script isFastqUniq.sh is parsing fastQ data to identify duplicate read names and prints out names of reads present more than once. This is a very basic script.

deduplicateFastq.pl

The perl script deduplicateFastq.pl is parsing two paired fastQfiles (can be flat or .gz) and filters out reads found more than once based on their exact names. This script was developped for data extracted from BAM that presented the same reads multiple times due to alternate mapping results. The script will end if pair sync is not valid (same name for both mates) or if fastq 4-line structure is lost.

Formatting tools for SAM / BAM data

TOP

uniq_mappings.pl

The Perl uniq_mappings.pl is reading from a name-sorted BAM file (verified from the presence of 'SO:queryname' in the first header line) and outputting 'uniquely mapped' and 'multiple-mapped' reads to two separate SAM files with adapted headers.

Usage: This script was created to extract uniquely mapped reads from a public BAM file and convert the mapping data back to FastQ. The obtained reads where then re-mapped to another reference genome build.

bam_re-pair.pl

The Perl bam_re-pair.pl script gets data from a piped samtools command and filters paired reads only to create a new BAM file with help of samtools.

Usage: samtools view -h <name_sorted.bam> | 
	bam_re-pair.pl | 
	samtools view -bSo <name_sorted.filtered.bam> -

Formatting tool for annotation data

TOP

bed2picard-list.sh

Some Picard tools, like CalculateHsMetrics require list as input. The Picard list format is a hybrid format including a sam header and almost-BED data. The BED data has 5-columns and the start coordinate increased by 1 to reflect the 1-closed coordinate expectations of Picard. A simple bash script bed2picard-list.sh was created to streamline the process of creating a list file from a bed5 and dict files.

TOP

Please send comments and feedback to [email protected]


Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

About

Scripts to help analysis of NGS data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published