Pamir detects and genotypes novel sequence insertions in single or multiple datasets of paired-end WGS (Whole Genome Sequencing) Illumina reads by jointly analyzing one-end anchored (OEA) and orphan reads.
Prerequisite. You will need g++ 5.2 and higher to compile the source code.
The first step to install Pamir is to download the source code from our
GitHub repository. After downloading,
change the current directory to the source directory pamir
and run make
and make install
in
terminal to create the necessary binary files.
git clone https://github.com/vpc-ccg/pamir.git --recursive
cd pamir
make
make install
Pamir's pipeline requires a number of external programs. You can either manually install them or
take advantage of pamir's conda environment.yaml
to
install all the dependencies except the assembler:
conda env create -f environment.yaml
source activate pamir-deps
Dependencies | Version |
---|---|
Python | 3.x |
samtools | >= 1.9 |
mrsfast | >= 3.4.0 |
BLAST | >= 2.9.0+ |
bedtools | >= 2.26.0 |
bwa | >= 0.7.17 |
snakemake | >= 5.3.0 |
RepeatMasker | >= 4.0.9 |
minia | >= 3.2.0 * |
abyss | >= 2.2.3 * |
spades | >= 3.13 * |
*Note: You only need to install one of the assemblers.
In order to run pamir, you need to create a project configuration file namely config.yaml
.
This configuration consists of a number mandatory settings and some optional advance settings.
Below is the list of the all the settings that you can set in your project.
config-paramater-name | Type | Description |
---|---|---|
path | Mandatory | Full path to project directory. |
raw-data | Mandatory | Location of the input files (crams or bams) relative to path . |
population | Mandatory | Populuation/cohort name. Note that name cannot contain any space characters. |
reference | Mandatory | Full path to the reference genome. |
input | Mandatory | A list of input files per individual. Pamir 2.0 accepts BAM and CRAM files as input. |
analysis-base | Optional | Location of intermediate files relative to path . default: {path}/analysis |
results-base | Optional | Location of final results relative to the path . default: {path}/results |
assembler | Optional | External assembler to use (minia , abyss , spades ) default: minia |
assembler_k | Optional | kmer to use for external assembler. default: 47 |
pamir_partitition_per_thread | Optional | Number of internal pamir jobs to be completed per thread. This is an advanced settings, modifying this can heavily affect the performance. Too small or too large may affect the performance negatively. default: 1000 |
blastdb | Optional | Full path to blast database to remove possible contaminants from the data. |
centromeres | Optional | Full path to the file in bed format that contains centromeres locations. The calls in these regions will not be reported |
align_threads | Optional | number of threads to use for alignment jobs. default: 16 |
assembly_threads | Optional | number of threads to use for assembly jobs. default: 62 |
other_threads | Optional | number of threads to use for other jobs. default: 16 |
minia_min_abundance | Optional | minia's internal assembly parameter. default: 5 |
min_contig_len | Deprecated | Minimum contig length from the external assembler to use. We know calculate this on the go. |
read_length | Deprecated | Read length of the input reads. We know calculate this on the go. |
The following a an example of config-yaml
with two individuals.
path:
/full/path/to/project-directory
raw-data:
raw-data
reference:
/full/path/to/the/reference.fa
population:
my-pop
input:
"samplename1":
- A.cram
"samplename2":
- B.bam
Now, to run pamir on such a config file, you have to run the following command.
pamir.sh --configfile /path/to/config.yaml
Since, pamir.sh
is internally utilizing snakemake
, you can pass any additionak snakemake
parameters to pamir.sh
. Here are some examples:
pamir.sh --configfile /path/to/config.yaml -j [number of threads]
pamir.sh --configfile /path/to/config.yaml -np [Dry Run]
pamir.sh --configfile /path/to/config.yaml --forceall [rerun all steps regardless of the current stage]
Pamir will generate the following structure. Pamir generates a VCF file for detected novel sequence insertions.
[path]/
├── raw-data -> OR [raw-data]
│ ├── A.cram
│ ├── B.bam
├── analysis -> OR [analysis-base]
│ └── my-pop
└── results -> OR [results-base]
└── my-pop
├── index.html -> Summary fo events
├── summary.js -> Summary required by index.html
├── data.js -> Data required by index.html
├── events.repeat.bed -> annotation of repeats for detected eveents
├── events.fa -> all the detected events with 1000bp flanking region
├── events.fa.fai -> index of events.fa
└── ind
├── A
│ ├── events.bam -> mapping of the reads in the events region
│ ├── events.bam.bai -> index
│ ├── events.bed -> location of events
│ └── events.vcf -> genotyped insertion calls
├── B
│ ├── events.bam
│ ├── events.bam.bai
│ ├── events.bed
│ └── events.vcf
curl -L https://ndownloader.figshare.com/files/22813988 --output example.tar.gz
tar xzvf example.tar.gz
cd example
chmod +x configure.sh
./configure.sh
pamir.sh -j16 --configfile config.yaml
index.html
provides a quick way of looking at general overview of events. It is an alternative to working with vcf files in a friendly fashion.
If you start your IGV, you can easily jump back and forth investigating your events from index.html
.
Discovery and genotyping of novel sequence insertions in many sequenced individuals. P. Kavak*, Y-Y. Lin*, I. Numanagić, H. Asghari, T. Güngör, C. Alkan‡, F. Hach‡. Bioinformatics (ISMB-ECCB 2017 issue), 33 (14): i161-i169, 2017.
Feel free to drop any inquiry at the issue page .