Pipeline Analysing sequences of Nucleic acids DEsigned for MonItoring of sequenCing runs
This repository contains a monitoring tool primarily designed for monitoring of SARS-CoV-2 sequencing runs.
config files
creates a PipelineRunner instance and runs the pipeline.
contains the PipelineRunner class which is a wrapper for the snakemake pipeline
the run() method in the class repeats the following steps in a loop:
- create n batches (min<n<max) by linking files from a specified directory
- run snakemake on new n batches
- merge processed batches into the summary,
- find variants
- remove links
Snakemake pipeline, run from PipelineRunner class (pipeline.run() called in main.py)
rules for the snakemake pipeline
helper scripts for the pipeline
(if you don't already have some conda distribution installed)
installers can be found at: https://docs.conda.io/en/latest/miniconda.html#linux-installers
run the .sh
installer downloaded from the website
(optional: run conda config --set auto_activate_base false
if you don't want to auto-activate base conda environment each time you open the terminal)
git clone https://github.com/fmfi-compbio/pande-mic.git
go inside the cloned repository:
cd pande-mic
run:
conda env create -f env.yaml
conda activate monitoring_env
full instructions can be found here: https://github.com/fmfi-compbio/deepnano-blitz
in short:
install rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
(you may also need to run source $HOME/.cargo/env
)
ask for the nightly version:
rustup default nightly-2021-01-03
IMPORTANT!: our-pipeline-specific - we want to install blitz into the environment we have created, which already contains the other tools needed for the pipeline
to do so,
activate monitoring env:
conda activate monitoring_env
clone deepnano-blitz repo:
git clone https://github.com/fmfi-compbio/deepnano-blitz.git
go inside the cloned dir:
cd deepnano-blitz
run the installer:
python setup.py install
extract the compressed directory, note the path where it is stored, we will need it later
go inside pande-mic/config
directory
create a new directory containing files amplicons.json
, config.yaml
, mut.txt
,
ref.fasta
, setup_guppy
, summary.yaml
Note: the files amplicons.json
, mut.txt
, ref.fasta
, setup_guppy
can also be placed somewhere else (see config.yaml
) and can have different names.
You might copy those from some of the pre-existing example configs
Now you have to edit those files.
this file contains the list of amplicons in JSON format, should have a structure similar to this:
{
"name": "SARS-CoV-2 primer scheme 2000bp",
"amplicons": [
[30, 2079],
[3580, 5548],
[7092, 9123],
[10676, 12679],
[14176, 15978],
[17571, 19485],
[20883, 22996],
[24649, 26542],
[27914, 29790],
[1925, 3737],
[5394, 7255],
[8879, 10837],
[12519, 14328],
[15827, 17754],
[19310, 21241],
[22850, 24812],
[26386, 28351]
]
}
it should contain a name of the amplicon scheme and a list of amplicons.
this is the main configuration file for the pipeline. Please copy this sample version and adjust the paths and parameters.
##############################
# general pipeline settings
##############################
#path to scripts directory (pipeline root dir)
scripts_dir: /<where is the clone of the repository>/pande-mic/
#where to store pipeline output
# IMPORTANT! - the path should end with /
output_dir: /<anywhere>/
# # of barcodes
# number of sequenced samples
barcodes: 12
# minimal number of new batches needed to run the pipeline
min_batches: 1
# maximal number of batches to be processed at one time
max_batches: 4
# number of files in a batch
batch_size: 25
#maximal number of cores provided for the snakemake pipeline, NOTE: recomm.: max_batches =< cores
cores: 4
#whether the output directory should be cleaned before the pipeline run
# BE AWARE, setting this to True will remove any data from the directory specified as output_dir each time the application is started!
clear_annotated: False
##########################################
# settings for rules and specific scripts
##########################################
#basecalling
# IMPORTANT! - the path should end with /
input_path: /<where are the input .fast5 files>/fast5/
#where to store temporary directories with linked files (batches to be processed)
# IMPORTANT! - the path should end with /
batch_path: /<anywhere>/batches/
#deepnano blitz params
deepnanoblitz_path: /<path to the clone of deepnano-blitz repository>/deepnano-blitz/scripts/
blitz_beam_size: 5
blitz_threads: 1
blitz_network_type: 48
#debarcoding
guppy_setup_path: /<path to the config directory containing the setup-guppy config file>/setup-guppy
#guppy_arrangements_files: barcode_arrs_nb96.cfg #outdated? - depends on the version of guppy
guppy_config: configuration.cfg #guppy configutarion file, it will be looked for inside the guppy dir specified in setup-guppy
guppy_barcode_kits: EXP-NBD196 #barcode kits used
guppy_threads: 1 #recommended value is 1
#alignment
#the reference genome
reference_genome: /<anywhere>/ref.fasta
#base counting script
alignment_low_cutoff: 50 #coverage below this value will be considered low
#find SNPs
coverage_threshold: 50 #coverage below this value will be considered low
coverage_fraction: 0.5 #fraction of mutated bases that need to be at a specific position compared to the coverage at this position to denote a SNP
#variant calling
mut_file: /<anywhere>/mut.txt #path to the configuration file containing definitions of SNPs for the variants of the virus
variant_calling_threshold: 50 #coverage below this value will be considered low
configuration file for variant calling
The file defines the variants.
This file specifies the variants (of SARS-CoV-2 in our case) to look for and mutations that are specific for a variant.
Each line starts with a label of a variant at the beginning,
followed by exactly one space and then a number of mutations that we want to match so that we can say that a barcode corresponds to this variant.
Then there are mutations that are typical for a variant separated by spaces.
Lines starting with #
are comments and are ignored when parsing the file.
UK 5 C3267T C5388A ... G28280C A28281T T28282A
this line specifies a variant that should be displayed as "UK",
the line starts with "UK", which is our label for this variant.
the label is followed by a number, 5, which says that "if at least 5 of the following mutations are present in the sample, classify the sample as this variant"
the number is followed by mutations (for example C is changed to T at the position 3267 mapped to the reference genome) that are typical for this variant, separated by spaces.
You can also provide other variants in a tree-like structure, using syntax start_sub
and end_sub
in separate lines:
#UK variant
UK 5 C3267T ... T28282A
#more specific variants for UK
start_sub
UK-subvariant_1 1 A17615G
#subvariants for UK-subvariant_1
start_sub
.
.
UK-subvariant_1-Poland 4 C5301T C7420T C9693T G23811T C25350T C28677T G29348T
UK-subvariant_1-Gambia 3 T6916C T18083C G22132A C23929T
end_sub
end_sub
#CZ variant
CZ 3 G12988T G15598A G18028T T24910C T26972C
This means that we will look for the UK variant, and if we will find at least 5 mutations from the list provided in the UK line, we will also continue searching for other more specific variants.
For example, if the UK variant is matched, we will check whether there is also a mutation in position A17615G,
if it is, then we will check if there are some of the mutations specified in its subsection - UK-subvariant_1-Poland or UK-subvariant_1-Gambia.
We will stop searching at the point when there are no more subsections specified or when less than the required count of mutations was found for a sample.
We will look for the CZ variant too. This one has no subvariants specified in this example file, so no further search would be made.
the reference genome in FASTA format
>MN908947.3
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCT.....
Important note: for reference genomes that are split into multiple contigs, you can use the the multiref_to_one.py
script, as this version of the pipeline does not support genomes that have multiple contigs/chromosomes. It will create a new .fasta file with all the contigs concateneted into one long string, where the contigs will be separated by regions of 100 Ns to avoid alignemnts in between the contigs. It will also create an amplicons.json
file containing the start and end positions of the contigs that can be used for marking the positions in the Flask application (although the name of the file is misleading in this case, as the file is primarily used for another purpose - to mark the amplicons in the covarage plot).
guppy setup file. This file is there to configure the guppy path, set the LD_LIBRARY_PATH, or anything that you need to set up to run guppy on your machine. The file should be written in bash and is executed before the guppy debarcoder command an example:
export GUPPYDIR=/<anywhere>/<where_is_guppy>
export LD_LIBRARY_PATH=$GUPPYDIR/lib:$LD_LIBRARY_PATH
#or this, if the above does not work: export LD_LIBRARY_PATH=$GUPPYDIR/lib:${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export PATH=$GUPPYDIR/bin:$PATH
list of summary subdirectories (recommended to leave as is, unless you want to extend the pipeline)
summary_subdirs:
- basecalled
- debarcoded
- aligned
- variants
- coverage_per_base
- SNPs
- barcode_summary
https://github.com/fmfi-compbio/pande-mic-dashboard
follow the instructions in the repository
First, you have to install and configure the tool and its dependencies. If you don't already have it done, please follow the instructions above.
To start the analysis, go inside the cloned repository and run this command:
python -u ./python/main.py --config_dir <path to the pipeline directory>/configs/<your configuration directory>/
The main script. It loads the main configuration file and starts the analysis.
usage: main.py [-h] -conf CONFIG_DIR
optional arguments:
-h, --help show this help message and exit
-conf CONFIG_DIR, --config_dir CONFIG_DIR
path to the config dir
A class instance of which is created in the main.py file. The purpose of this class is to create the batches and run snakemake in a loop.
short overview of its essential methods:
init()
: set params, [clear annotated], print init. values, create summary dir and subdirs and empty .csv files
create_config_copy()
: creates a copy of the config directory inside the output directorycreate_dirs()
: creates the directories and files structure for the output files of the pipeline inside the output directoryload_processed()
: loads the list of files that are already processed, counts and sets the number of iterations and processed batches
run()
: while True: create batches, run snakemake, clean batches, merge results to the summary
create_batches()
: find unprocessed (compare listdir with the array of previously batched files), wait till we have enough files to create a batch, create batch(es) - link files (os.symlink), add files to the array of batched filesrun_snakemake()
: run snakemake (os.system(snakemake [params])), uses scripts_dir/Pipeline, scripts_dir/rulesclean_batches()
: remove processed batch folders with symlinkswrite_batches()
: write lists of processed files for each processed batch into processed.csv
used for handling keyboard interrupts in PipelineRunner
(state d.p.)
a helper script/function for merging two .csv files. supports "aggregation" over one column - "sum()" applied to other columns
supported arguments:
-f1 FIRST_FILE_NAME, --first_file_name FIRST_FILE_NAME
first file to merge/or file names of multiple files with same
structure separated by comma
-f2 SECOND_FILE_NAME, --second_file_name SECOND_FILE_NAME
second file to merge
-cf COUNT_FIRST, --count_first COUNT_FIRST
columns to be summed in first file seperated by ,
-gpf GROUP_BY_FIRST, --group_by_first GROUP_BY_FIRST
column to use for grupping in first file or -1
-hf HAS_HEADER_FIRST, --has_header_first HAS_HEADER_FIRST
1 if first file has a header, 0 otherwise
-cs COUNT_SECOND, --count_second COUNT_SECOND
columns to be summed in second file seperated by ,
-gps GROUP_BY_SECOND, --group_by_second GROUP_BY_SECOND
column to use for grupping in second file or -1
-hs HAS_HEADER_SECOND, --has_header_second HAS_HEADER_SECOND
1 if second file has a header, 0 otherwise
-gs GROUP_START, --group_start GROUP_START
whether the output should start with group (1) or not (0)
-o OUT, --out OUT output file
function merge(group_start, count_first, count_second, has_header_first, has_header_second, group_by_first, group_by_second, fnames, fname2, output)
:
params:
group_start
: whether the output should start with group (1) or not (0)count_first
: array of columns to be summed in first file (0-based)count_second
: array of columns to be summed in second file (0-based)has_header_first
: 1 if first file has a header, 0 otherwisehas_header_second
: 1 if second file has a header, 0 otherwisegroup_by_first
: column to use for grupping in first file or -1 if nonegroup_by_second
: column to use for grupping in second file or -1 if nonefnames
: list of files (they sholud have the same structure) to be mergedfname2
: second file to mergeoutput
: output file
<pande-mic repository>/Pipeline
, config in <pande-mic repository>/configs/<some config dir>/config.yaml
( should be specified as a parameter for the main.py script),
has 3 phases:
for all batches in batch_path
(form config.yaml
), samples
= barcode01,..., barcodeXY (XY = barcodes from config.yaml
) create SNPs = output_dir/batch/SNPs/sample.csv (output dir form config.yaml
)
at the end of the iteartion,{batch}.done
files in batch_done
directory are created for each processed batch
for all batches that were marked as summarized in the previous iteration (do the {batch}.done
files exist for them) create {batch}.summarized
file in batches_done
directory - this forces the run of the summarization process for each of the batches processed in the previous iteration
variant calling from the summary directory
There are 3 directories containing the rules for the pipeline
- the
batches
directory contains the rules that need to be done for each batch before summarization - the
summary
directory contains the rules for summarization of the batches - the
from_summary
directory contains the rules for tasks that should be done after the summarization
fast5_dir = config["batch_path"]+"{batch}"
-> fastq = config["output_dir"]+"{batch}/basecalled/{batch}.fastq"
dependency: deepnano blitz
bash: count_reads_and_bases.sh
fastq = config["output_dir"]+"{batch}/basecalled/{batch}.fastq"
-> out = directory(config["output_dir"]+"{batch}/debarcoded/{batch}")
dependency: guppy v4.4.1
bash: count_reads_and_bases.sh
debarcoding_out = config["output_dir"]+"{batch}/debarcoded/{batch}"
-> barcode_summary_files = [config["output_dir"]+"{batch}"+barcode for barcode in expand("/barcode_summary/{barcode}.fastq",barcode=barcodes)]+[config["output_dir"]+"{batch}/barcode_summary/unclassified.fastq"]
bash: concat_barcodes.sh
fastq = config["output_dir"]+"{batch}/barcode_summary/{sample}.fastq"
-> bam = config["output_dir"]+"{batch}/aligned/{sample}.bam"
,
dependency: minimap2, samtools
bam = config["output_dir"]+"{batch}/aligned/{sample}.bam"
-> observed_counts = config["output_dir"]+"{batch}/coverage_per_base/{sample}.csv"
dependency: python: count_observed_counts.py
observed_counts = config["output_dir"]+"{batch}/"+"coverage_per_base/{sample}.csv"
-> SNPs = config["output_dir"]+"{batch}/"+"SNPs/{sample}.csv"
dependency: python: find_SNPs_for_csv.py
directory {batch}/basecalled
, containing basecalled (.fastq) file (1 for all .fast5 files in a batch together) -> {batch}/basecalled/{batch}.fastq
{batch}/basecalled/count.csv
-> file path, num of reads, num of bases
(one line)
directory {batch}/debarcoded/{batch}
containing up to <num of barcodes +1> directories (one for each barcode (barcodeXY
) + unclassified
) with debarcoded .fastq
files
barcoding_summary.txt
(guppy), ignored ... maybe we can use it somehow
{batch}/debarcoded/count.csv
(one line for each .fastq -> file path, num of reads, num of bases, barcode)
directory {batch}/barcode_summary
containing barcodeXY.fastq
(for each barcode), unclassified.fastq
- conctenated .fastq
files from guppy output (from {batch}/debarcoded/{batch}
+ empty .fastq
files for barcodes that are not in guppy output)
{batch}/barcode_summary/summary.csv
(one line for each .fastq
-> file path, num of reads, num of bases, barcode)
directory {batch}/aligned
containing barcodeXY.bam
, barcodeXY.bam.bai
for each barcode, unlassified.bam
, unclassified.bam.bai
directory {batch}/coverage_per_base
containig barcodeXY.csv
for each barcode, unclassified.csv
-
containting lines [position,A,C,G,T] for each position in the reference genome (counts of bases mapped to the reference)
directory {batch}/SNPs
, containing barcodeXY.csv
, for each barcode, unclassified.csv
-
containting lines [snp,same_as_ref,mutated] - snp, num of bases same as ref, num of bases same as mutation
+ log directory (<output directory>/{batch}/log/{align,basecalling,concat_barcodes,count_bases,debarcoding,find_SNPs}/
)
+ timer files in timers directory (`<output_directory>/timers/...)
generated using summary
and from_summary
rules of the Snakemake pipeline
<output directory>/processed.csv
- processed batches (with lists of files), that were successfully merged into the summary and deleted
This directory can be used as the input for the pande-mic dashboard
aligned/mapped.csv
: barcode, num of mapped reads
basecalled/count.csv
: num of reads, num of bases
barcode_summary/summary.csv
: num of reads, num of bases, barcode
coverage_per_base/{barcodeXY, unlassified}.csv
[position, A,C,G,T]
SNPs/{barcodeXY, unlassified}.csv
[snp, same_as_ref, mutated]
variants/{barcode, unclassified}.json
[{
"name": <variant name>,
"mutations":
[
{"position": 3267,
"from": "C",
"to": "T",
"same_as_reference": 44,
"same_as_mutation": 220},
{..}.{..}
],
"subs": [{"name:"... }]
}]