ORI (Oxford nanopore Reads Identification) is a software using long nanopore reads to identify bacteria present in a sample at the strain level.
There are two sub-parts in the ORI program: (1) the creation of the index containing the reference genomes of the interest species and (2) the query of this index with long reads from Nanopore sequencing in order to identify the strain(s).
The index is based on the structure implemented in HowDeSBT [1] modified in order to use qgrams (word from spaced seeds) instead of kmers.
As previously said, we replaced kmers by qgrams. To get qgrams we use spaced seeds which introduce don’t care positions in kmers that won’t be disturbed by sequencing errors. To select the best seed pattern for classification of long reads we used the iedera software [2][3]. The best seed for our classification tools seems to be the following: size 15 and weight 13, 111111001111111. This seed is then applied to words of size 15 resulting in qgrams used instead of kmers.
The preconstructed indexes for Streptococcus thermophilus strains are available in the ORI github in the directory: preconstructed_indexes directory of ORI.
1. Robert S Harris and Paul Medvedev, Improved representation of sequence bloom trees, Bioinformatics, btz662 <10.1093/bioinformatics/btz662>
2. Noe L., Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds, Algorithms for Molecular Biology, 12(1). 2017 http://doi.org/10.1186/s13015-017-0092-1
3. Kucherov G., Noe L., Roytberg, M., A unifying framework for seed sensitivity and its application to subset seeds, Journal of Bioinformatics and Computational Biology, 4(2):553-569, 2006 http://doi.org/10.1142/S0219720006001977
The easiest way to install ORI is through conda.
From Conda
conda create -p ori_env
conda activate ori_env
conda install -c gsiekaniec -c conda-forge ori
The seedfile.txt can be found in the seed directory of ORI.
You can also create the seedfile.txt in your chosen repertory:
cd path/to/repertory
touch seedfile.txt
echo "111111001111111" > seedfile.txt
If you have the fasta files distributed in several subfolders, you have to redirect them to a single directory.
Repertory containing fasta files (must be the current directory) to run the following scripts:
cd path/to/the/fasta/files/repertory
As the last quantification step requires the size of the genomes, it is preferable to calculate it when we have our genomes:
ORI.py length -g path/to/the/genomes -o length.txt
Parameters
Parameters | Description | Required |
---|---|---|
-g/--genomes | path to the repertory containing genome (.fna or .fasta). | Yes |
-o/--outfile | output file containing length of each genome. | No. Default: length.txt |
-fpr/--false_positive_rate | false positive percent (between 0 and 1) wanted for the bloom filter containing the largest genomes. | No. Default: 0.01 |
-s/--seed_size | size of the spaced seed you want to use (not the weight). | No. Default: None |
The ORI.py length
step also allows to calculate an effective size for the bloom filters. Use the -seed_size
and -false_positive_rate
options . The bloom filter size is given in the bf_min_size.txt file.
Then we create the bloom filters for all genomes:
howdesbt makebfQ --k=15 --qgram=path/to/seedfile.txt --bits=0.25G *.fasta
Parameters
Parameters | Description |
---|---|
--k | length of the seed given in --qgram. You may modify your seedfile.txt and this parameter. |
--qgram | file containing the used spaced seed. It's a text file containing the seed used on the first line (here 111111001111111). |
--bits | size of bloom filters. A size that is too small will give too many false positives to be usable and a size that is too large will take up too much space and greatly increase the computation times. |
Then we get the names of the bf (bloom filter) files used to create the tree:
ls *.bf > leafname
Now that the bloom filters are created it is no longer necessary to keep the fastas files. If it is not necessary to keep them, they can be deleted to save space. In addition, if the fasta files cannot be completely downloaded on the machine due to lack of space, it is possible to download them little by little and create the filters as you go by deleting the fasta files once in the form of a filter (.bf).
2) If you want to cluster closely related strains to identify a fine cluster of strains rather than a mixed list of single strains:
howdesbt distance --list=leafname
Parameters
Parameters | Description |
---|---|
--list | list of the bloom filters names (one per line). |
ORI.py threshold
). It gives a figure threshold.png as output containing the distribution of the distances between the strains of the index (distances are multiplied by 1e05 in the figure). More generally, if your cluster of strains is too large and gives you to many possibilities of identification, try a lower t value (e.g. i know that the strains number 205, 51 and 55 are really closed on a phylogenic tree, but a bit farther away to strains 54 and 78; if a threshold of 0.0002 (default value) gives you a cluster containing the 5 strains, you can lower to -t 0.0001 to obtained two separated clusters).
ORI.py threshold -m hamming_matrix.tsv -t 0.0002
Parameters
Parameters | Description | Required |
---|---|---|
-m/--matrix | path to the hamming distance matrix. It's the output of the first howdesbt distance | Yes |
-t/--threshold | threshold that we want to set to merge close genomes. Be careful not to set this threshold too high or too low. Floating number between 0 and 1. | No. Default: 0.0002 |
The default 0.0002 value is the value used to merge Streptococcus thermophilus strains using filters of size 0.5G. This value must be modified in the case of using another species and/or another filter size.
Once you have defined your own t value, merge your strains in adapted clusters:
howdesbt distance --list=leafname --threshold=0.0002 --matrix=hamming_matrix.bin --merge
Parameters
Parameters | Description |
---|---|
--list | list of the bloom filters names (one per line). |
--matrix | path to the hamming distance matrix. It's the output of the first howdesbt distance |
--threshold | hamming distance threshold between bloom filter for merging them. Floating number between 0 and 1. |
--merge | merge maximal cliques ? |
ORI.py clean_merge -n leafname -r path/to/repository/with/bf/files -o list_number_file.txt
Parameters
Parameters | Description | Required |
---|---|---|
-n/--names | list of the bloom filters names (one per line). | Yes |
-r/--repository | repertory containing the bloom filter file (.bf). | Yes |
-o/--outfile | output file. Out file containing one id number, one genome name and the corresponding number of sequence per line. | Yes |
ls *.bf > leafname_merge
Since the genomes of some strains have been merged, the size of these clusters must also be recalculated:
ORI.py merge_length -b leafname_merge -l length.txt -c list_number_file.txt -o merge_length.txt
Parameters
Parameters | Description | Required |
---|---|---|
-b/--bflist | list of the bloom filters names (one per line). | Yes |
-l/--lengthfile | file containing length of each genome. It's the output of ORI.py length (default : length.txt). | Yes |
-c/--correspondance | file containing correspondance between numbers and genomes. It's the output of ORI.py clean_merge (list_number_file.txt). | Yes |
-o/--outfile | output file containing length of each genome or genomes cluster. | No. Default: length_merge.txt |
To run these commands you must be in the directory containing the .bf files.
howdesbt cluster --list=leafname/or/leafname_merge --tree=union.sbt --nodename=node{number} --cull
Parameters
Parameters | Description |
---|---|
--list | list of the bloom filters names (one per line). |
--tree | name for tree toplogy file. |
--nodename | filename template for internal tree nodes this must contain the substring {number}. |
--cull | remove nodes from the binary tree; remove those for which saturation of determined is more than 2 standard deviations. |
howdesbt build --howde --tree=union.sbt --outtree=howde.sbt
Parameters
Parameters | Description |
---|---|
--howde | create tree nodes as determined/how, but only store active bits. Create the nodes as rrr-compressed bit vector(s). |
--tree | name for tree toplogy file. |
--outtree | name of topology file to write tree consisting of the filters built. |
Once the compressed bloom filters have been created, we can delete those that are not compressed:
ls | grep -Pv 'detbrief.rrr.' | grep '.bf' | xargs rm --
In order to facilitate identification it may be wise to remove reads of too poor quality. For this it is possible to use:
ORI.py suppr_bad_reads -fq fastq -q min_quality_value -l min_length_value
Parameters
Parameters | Description | Required |
---|---|---|
-fq/--fastq | fastq file. | Yes |
-q/--qualityMin | Minimum quality (Phred score) threshold to save a read. | No. Default: 9 |
-l/--lengthMin | minimum length threshold to save a read. | No. Default: 2000 |
--gzip | use of compressed fastq: fastq.gz. | No |
As we show in the publication (coming soon), ORI's identification is better with 4000 reads than with 16000 due to noise related to sequencing errors. It is therefore advisable to reduce the number of reads with for example:
head -n 16000 fastq_file_better_than_number.fastq > fastq_file_4000_reads.fq
Then we can start the identification:
howdesbt queryQ --sort --qgram=path/to/seedfile.txt --tree=path/to/howde.sbt --threshold=0.5 fastq_file_4000_reads > path/to/results_howde.txt
Parameters
Parameters | Description |
---|---|
--sort | sort matched leaves by the number of query qgrams present, and report the number of qgrams present. |
--qgram | file containing the used spaced seed. It's a text file containing the seed used on the first line (here 111111001111111). |
--tree | name of the tree toplogy file (howde.sbt). |
--threshold | fraction of query qgrams that must be present in a leaf to be considered as a match; this must be between 0 and 1. |
ORI.py matrix -f path/to/results/from/HowDeSBT -l path/to/leafname/or/leafname_merge -o path/to/results/matrix.tsv
Parameters
Parameters | Description | Required |
---|---|---|
-f/--file | Results file from HowDe output. | Yes |
-l/--list_name | list of the bloom filters names (one per line) in the same order than in the matrix. | Yes |
-out/--output | output {strains x reads} matrice file.. | No. Default: matrice.tsv |
ORI.py identification -m path/to/matrix.tsv -f path/to/results/from/HowDeSBT -le path/to/length.txt/or/merge_length.txt -l path/to/leafname/or/leafname_merge -c path/to/clingo/or/$(which clingo)(with the conda installation)
Parameters
Parameters | Description | Required |
---|---|---|
-m/--matrix | {strains X reads} matrice file. | Yes |
-f/--file | results file from Howde output. | Yes |
-le/--length | file with one genome and is length per line. It's the output of ORI.py length or ORI.py merge_length. | Yes |
-l/--listname | list of the bloom filters names (one per line) in the same order than in the matrix. | Yes |
-c/--clingo_path | clingo path. With a conda installation this path is in $(which clingo). | Yes |
-o/--output | output results file. | No. Default: out.txt |
-t/--threshold | Minimum percent value in the matrix for association between reads and species (between 0 and 100). | No. Default: 50 |
-n/--nbchoices | Only the nbchoices maximum values of a row are considered. Warning, must be less or equal to the number of species. | No. Default: 12 |
If the genomes of close strains were not merged during the creation of the index:
num=0; for i in `cat path/to/leafname`;do echo -e "${num} \t ${i}" >> lnf.txt; let num++; done
The results are not very readable (especially in case of merge of close strains), it is possible to have cleaner results:
ORI.py beautiful_results -f path/to/results/from/ORI -n path/to/lnf.txt/or/list_number_file.txt --pie_chart
Parameters
Parameters | Description | Required |
---|---|---|
-f/--file | results file from ORI. | Yes |
-n/--number_name_list | file containing correspondance between numbers and genomes. It's the output of ORI.py clean_merge (merge_length.txt). | Yes |
-o/--output | output file. | No. Default: clean_results.txt |
--pie_chart | create a pie chart of the results in png format. | No |