Bacterial genome analysis using short and long reads.
________ _______________ ______
___ __ )_____ _________ ____/__ /___ _____ ____ /______
__ __ | __ `/ ___/_ /_ __ /_ / / /_ |/_/_ / ___/ /_
_ /_/ // /_/ // /__ _ __/ _ / / /_/ /__> < _ /__/_ __/
/_____/ \__,_/ \___/ /_/ /_/ \__,_/ /_/|_| /_____//_/
BacFluxL+ v1.1.2
August 2024
AIT Austrian Institute of Technology, Center for Health & Bioresources
- Livio Antonielli
- Dominik K. Großkinsky
- Hanna Koch
- Friederike Trognitz
IPK Leibniz Institute of Plant Genetics and Crop Plant Research, Cryo and Stress Biology
- Manuela Nagel
- Alexa Sanchez Mejia
BacFluxL+
is a comprehensive and automated bioinformatics workflow specifically designed for the processing and analysis of bacterial genomic data sequenced with Illumina and Oxford Nanopore Technologies. It leverages the advantages offered by both short and long reads, implementing a series of rules as a Snakemake script.
The pipeline accepts paired-end reads along with long reads as input. These are subjected to a series of analyses, including quality control, error correction, replicon sequence re-orientation, assessment of genome completeness and contamination, taxonomic placement, and annotation. Additionally, it infers secondary metabolites, screens for antimicrobial resistance and virulence genes, and investigates the presence of plasmids.
BacFluxL+
is an enhanced version of BacFlux
.
- Quick Start
- Rationale
- Description
- Installation
- Configuration
- Running BacFluxL+
- Output
- Acknowledgements
- Citation
- References
This guide gets you started with BacFluxL+
. Here's a quick guide:
-
Download the latest release:
#git command git clone https://github.com/iLivius/BacFluxLplus.git
-
Configure the
config.yaml
:-
Specify the input directory containing:
- Illumina reads: paired-end files, e.g., strain-1_R1.fq.gz, strain-1_R2.fq.gz.
- ONT reads: long sequencing counterpart, e.g., strain-1_ont.fq.qz.
-
Provide the desired location for the analysis outputs and the path to the following databases:
- blast_db: path to the NCBI core nt database directory
- eggnog_db: path to the eggNOG diamond database directory
- gtdbtk_db: path to the GTDB R220 database directory
- bakta_db: path to the Bakta database directory
- platon_db: path to the Platon database directory
-
-
Install Snakemake (if not installed already) and activate the environment:
#optional, if not installed already mamba create -c conda-forge -c bioconda -n snakemake snakemake #activate Snakemake environment conda activate snakemake
-
Run
BacFluxL+
. Within the main workflow directory, launch the pipeline as follows:snakemake --sdm conda --keep-going --ignore-incomplete --keep-incomplete --cores 50
This command uses the following options:
--sdm: uses conda for dependency management
--keep-going: continues execution even if errors occur in some steps
--ignore-incomplete: ignores rules with missing outputs
--keep-incomplete: keeps incomplete intermediate files
--cores 50: cap the amount of local CPUs at this value (adjust as needed).
Now you're all set to run BacFluxL+
! Refer to the installation, configuration and running BacFluxL+ sections for detailed instructions.
The analysis of bacterial Whole Genome Sequencing (WGS) data is a process that requires the integration of multiple bioinformatics tools. BacFluxL+
is a follow-up version of BacFlux
that takes this process a step further by leveraging the strengths of both Illumina short reads and Oxford Nanopore Technologies (ONT) long reads. The integration of short and long reads in BacFluxL+
can offer an improvement in terms of accuracy and completeness of the assembled genomes.
BacFluxL+
incorporates the best bioinformatics tools into a comprehensive and automated Snakemake workflow, allowing researchers to focus on interpreting the biological significance of their data, rather than on the technical aspects of data analysis.
Here's a breakdown of the BacFluxL+
workflow:
-
Preprocessing of Short Reads:
-
Assembly of Short Reads:
- Filtered reads are assembled into contigs with SPAdes.
-
Quality Control, Decontamination and Long Read Correction:
- Contigs are filtered based on a minimum length of 500 bp and a coverage of 2x.
- Filtered reads are mapped back to contigs using bowtie2 and samtools. The resulting BAM file is analyzed with QualiMap.
- Local alignments of contigs are performed against the NCBI nt database using BLAST+.
- Contaminant contigs are checked with BlobTools. Unless otherwise specified (see configuration section for more details), the output of this step will be parsed automatically to discard contaminants based on the relative taxonomic composition of the contigs.
- Genome assembly quality is evaluated with Quast.
- Filtered reads are mapped back to selected contigs using bowtie2 and samtools. Reads matching with selected contigs will be used in the next step for long read correction.
- ONT reads are trimmed and error corrected using Illumina reads with Filtlong.
-
Assembly of Long Reads:
- Error-corrected ONT reads are assembled using Flye.
-
Correction of Contigs:
- Medaka is used to generate a consensus sequence from the assembled contigs and the original long reads. This consensus sequence should have a higher accuracy than the original assembled contigs, but this is not always the case, especially if ONT reads were base-called with the latest versions of the super accurate model of Dorado. For a deeper insight, please refer to Ryan Wick's bioinformatics blog.
-
Reorientation of Replicons:
- Bacterial chromosomes are reoriented using dnaapler, to start canonically with the dnaA sequence. Other replicons like plasmids and bacteriophages are also reoriented, using repA and terL, respectively, as starting point.
-
Polishing with Short-Reads:
- Reoriented replicons are polished with short reads using Polypolish.
-
Differences between long-read assembly and short-read assembly:
- Decontaminated contigs obtained from short-read assembly are used as reference. Differences as SNPs and indels between the reference and each of the following long-read assembled contigs are inspected with Snippy: a) Contigs output by Flye; b) Medaka long-read curated contigs; c) Replicons reoriented by dnaapler; d) Polypolish short-read corrected contigs.
-
Evaluation of Completennes and Contamination:
- Genome completeness and contamination of short-read assembled and long-read assembled bacterial chromosomes are assessed with CheckM using taxon-specific markers.
-
Taxonomic Analysis:
-
Annotation:
-
Antimicrobial Resistance (AMR):
-
Plasmids:
-
Prophages:
- Contigs are screened for viral sequences with VirSorter2, followed by CheckV for refinement.
-
Reporting:
- Results are parsed and aggregated to generate a report using MultiQC.
BacFluxL+
downloads automatically all dependencies and several databases. However, some external databases require manual download before running the workflow.
-
Download BacFluxL+:
Head over to the Releases section of the repository. Download the latest archive file (typically in .zip or .tar.gz format). This archive contains the
BacFluxL+
Snakefile script, theconfig.yaml
configuration file, and anenvs
environment directory. Extract the downloaded archive into your desired location. This will create a directory structure with the necessary files and directories. Alternatively, download via command line as:#git command git clone https://github.com/iLivius/BacFluxLplus.git
-
Install Snakemake:
BacFluxL+
relies on Snakemake to manage the workflow execution. Find the official and complete set of instructions here. To install Snakemake as a Conda environment:#install Snakemake in a new Conda environment mamba create -c conda-forge -c bioconda -n snakemake snakemake
-
Databases:
While
BacFluxL+
automates the installation of all software dependencies, some external databases need to be downloaded manually. Unless you have installed them already. In that case, skip this paragraph and jump to the configuration section.Here are the required databases and instructions for obtaining them.
-
NCBI core nt
database, adapted from here:#create a list of all core nt links in the directory designated to host the database (recommended) rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/core_nt.*.gz | grep '.tar.gz' | awk '{print "ftp.ncbi.nlm.nih.gov/blast/db/" $NF}' > nt_links.list #alternatively, create a list of nt links for bacteria only rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/nt_prok.*.gz | grep '.tar.gz' | awk '{print "ftp.ncbi.nlm.nih.gov/blast/db/" $NF}' > nt_prok_links.list #download in parallel, without overdoing it cat nt*.list | parallel -j4 'rsync -h --progress rsync://{} .' #decompress with multiple CPUs find . -name '*.gz' | parallel -j4 'echo {}; tar -zxf {}' #get NCBI taxdump wget -c 'ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz' tar -zxvf taxdump.tar.gz #get NCBI BLAST taxonomy wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz' tar -zxvf taxdb.tar.gz #get NCBI accession2taxid file wget -c 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz' gunzip nucl_gb.accession2taxid.gz
NOTE: the complete NCBI core nt database and taxonomy-related files should take around 223 GB of hard drive space.
-
eggNOG diamond
database:#the easiest way is to install a Conda environment with eggnog-mapper, first conda create -n eggnog-mapper eggnog-mapper=2.1.12 #activate the environment conda activate eggnog-mapper #then, create a directory where you want to install the diamond database for eggnog-mapper mkdir /data/eggnog_db #change /data/eggnog_db with your actual PATH #finally, download the diamond db in the newly created directory download_eggnog_data.py --data_dir /data/eggnog_db -y
NOTE: the eggNOG database requires ~50 GB of space.
-
GTDB
database:#move first inside the directory where you want to place the database, then download and decompress either the full package or the split package version # full package wget -c https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz tar xzvf gtdbtk_r220_data.tar.gz rm gtdbtk_r220_data.tar.gz # split package (alternative) base_url="https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/split_package/gtdbtk_r220_data.tar.gz.part_" suffixes=(aa ab ac ad ae af ag ah ai aj ak) printf "%s\n" "${suffixes[@]}" | xargs -n 1 -P 11 -I {} wget "${base_url}{}" cat gtdbtk_r220_data.tar.gz.part_* > gtdbtk_r220_data.tar.gz tar xzvf gtdbtk_r220_data.tar.gz rm gtdbtk_r220_data.tar.gz
NOTE: compressed archive size ~102 GB, decompressed archive size ~108 GB.
-
Bakta
database:#Bakta database comes in two flavours. To download the full database, use the following link (recommended): wget -c https://zenodo.org/records/10522951/files/db.tar.gz tar -xzf db.tar.gz rm db.tar.gz #alternatively, download a lighter version wget https://zenodo.org/record/10522951/files/db-light.tar.gz tar -xzf db-light.tar.gz rm db-light.tar.gz #if the AMRFinderPlus db gives an error, update it by activating the Bakta Conda env and running the following command by targeting the Bakta db directory: amrfinder_update --force_update --database db/amrfinderplus-db/
NOTE: according to the source the light version should take 1.4 GB compressed and 3.4 GB decompressed, whereas the full database should get 37 GB zipped and 71 GB unzipped.
-
Platon
database:#download the database in a directory of your choice wget https://zenodo.org/record/4066768/files/db.tar.gz tar -xzf db.tar.gz rm db.tar.gz
NOTE: according to the source, the zipped version occupies 1.6 GB and 2.8 GB when unzipped.
-
Before running BacFluxL+
, you must edit the config.yaml
file with a text editor. The file is organized in different sections: links
, directories
, resources
and parameters
, respectively.
-
links
This section should function as expected without modifications. Therefore, it is recommended to change the links only if they are not working or if there is a need to update the database versions:
- phix_link: Path to the PhiX genome reference used by Illumina for sequencing control.
- card_link: Path to the Comprehensive Antibiotic Resistance Database (
CARD
) - checkv_link: Path to the
CheckV
database for viral genome quality assessment
-
directories
Update paths based on your file system:
-
fastq_dir: This is the directory containing the Illumina paired-end reads and the ONT long reads for each provided genome, in FASTQ format. You can provide as many files as you like, subject to the following conditions:
-
Files must have one of the following extensions:
fastq
,fq
,fastq.gz
orfq.gz
. -
You can provide multiple samples, but all files should have the same extension. In other words, do not mix files with different extensions.
-
Sample names should be formatted as follows: strain-1_R1.fq, strain-1_R2.fq, and strain-1_ont.fq. In this example,
BacFluxL+
will interpret the name of the strain as “strain-1”. Strain names cannot contain underscores. Also, the name of each strain should be followed by “_R1”, “_R2”, and “_ont”, to identify Illumina PE reads and ONT reads, respectively. Here’s an example of how your input directory might look if it contains the PE reads and ONT reads of one strain, CDRTa11:ahab@pequod:~/data$ ls -lh total 4,8G -rw-rw-r-- 1 ahab ahab 1.8G May 8 12:00 CDRTa11_ont.fastq -rw-rw-r-- 1 ahab ahab 1.6G May 8 12:00 CDRTa11_R1.fastq -rw-rw-r-- 1 ahab ahab 1.6G May 8 12:00 CDRTa11_R2.fastq
-
-
out_dir: This directory will serve as the storage location for all output files generated by
BacFluxL+
. By default, all necessary software and databases will be installed in this directory, within Conda environments. If you reuse this output directory for future runs, it will prevent the need for reinstalling everything from scratch. -
blast_db: Path to the whole
NCBI nt
(recommended) or prokaryotic database only, and related taxonomic dependencies, see installation. -
eggnog_db: Path to the diamond database for
eggNOG
. -
gtdbtk_db: Path to the R220 release of
GTDB
. -
bakta_db: Path to either the light or full (recommended) database of
Bakta
. -
platon_db: Path to the
Platon
database.
-
-
resources
In this section you can specify the hardware resources available to the workflow:
- threads: Max number of CPUs used by each rule
- ram_gb: Max amount of RAM used (SPAdes only).
-
parameters
-
Database selection:
BacFluxL+
requires specifying the version of theNCBI nt
database forBLAST
operations. You can choose between thecore_nt
andnt_prok
versions. By default, theconfig.yaml
configuration file is set to use thecore_nt
database. For instructions on installing theBLAST
database, refer to the installation. -
Medaka model: This refers to the version of the
medaka_model
used for basecalling the long reads. If left blank, the default used by Medaka v1.11.3 isr1041_e82_400bps_sup_v4.3.0
. -
Genus filtering:
BacFluxL+
includes an optional parameter to specify the bacterialgenus
of contigs you wish to retain in the final assembly. If left blank,BacFluxL+
will automatically keep contigs associated with the most abundant taxon, based on relative composition determined throughBLAST
analysis. While this approach generally works well, it has limitations, such as reduced resolution at the species level due to reliance on the cumulative best scores ofBLAST
hits. Additionally, this method may be problematic if the contaminant organism belongs to the same genus as your target organism, or if you are working with co-cultured closely related species or strains. If thegenus
parameter introduces more issues than benefits, simply remove thegenus
option from theconfig.yaml
file.-
Using the
genus
parameter: if a contaminant is ascertained to be more abundant than your target organism, you can re-run the workflow after reviewing the assembly output. Specify thegenus
of the desired bacterial taxon you want to keep in during the re-run. -
Disabling the
genus
filtering: if either the automatic inference of contaminant contigs or the manual selection of the desired taxon are still not working for you, simply delete thegenus
option from theparameters
. In this case, only contigs tagged as "no-hit" afterBLAST
search will be filtered out.
-
-
BacFluxL+
can be executed as simply as a Snakefile. Please refer to the official Snakemake documentation for more details.
# first, activate the Snakemake Conda environment
conda activate snakemake
# navigate inside the directory where the BacFluxL+ archive was downloaded and decompressed
# launch the workflow
snakemake --sdm conda --cores 50
NOTE: Starting from Snakemake version 8.4.7, the --use-conda option has been deprecated. Instead, you should now use --software-deployment-method conda or --sdm conda.
The workflow output reflects the steps described in the description section. Here's a breakdown of the subdirectories created within the main output folder, along with explanations of their contents:
-
01.pre-processing
: QC and statistics of Illumina raw reads before and after quality filtering and trimming, by fastp (v0.23.4). -
02.Illumina_assembly
: Content produced by SPAdes (v4.0.0). In addition to the raw contigs, you will also find filtered contigs (greater than 500bp and with at least 2x coverage) and decontaminated contigs chosen after a BLAST search (see parameters in the configuration section above). The completeness of these selected contigs will be examined later. They will also be used for Antimicrobial Resistance (AMR) detection, as detailed in the following sections. -
03.post-processing
: Contains the following sub-directories:- mapping_evaluation: QualiMap (v2.3) output based on short-read assembled filtered contigs.
- contaminants: Short-read assembled contig selection based on BLAST+ (v2.15.0) search and BlobTools (1.1.1) analysis. Check the
composition
text file for a quick overview of the relative composition of your assembly. - assembly_evaluation: Quast (v5.2.0) output based on short-read assembled selected contigs.
- completenness_evaluation: CheckM (1.2.3) output based on short-read assembled contigs and long-read assembled contigs, after decontamination, re-orientation, and error correction.
-
04.ONT_assembly
: Long-read assembly performed by Flye (v2.9.4) after sequence filtering and short-read correction with Filtlong (v0.2.1). -
05.ONT_consensus
: Long-read assembled contigs are error corrected with long reads using Medaka (v1.11.3). -
06.fix_start
: Replicons are reoriented by dnaapler (v0.8.0) as follows: bacterial chromosomes will start with the dnaA gene, plasmids with repA and phages with terL. -
07.Illumina_correction
: Contains the reoriented long-read assembled contigs after curation with short reads using Polypolish (v0.6.0). -
08.SNPs
: Identification of variants (SNPs and indels) are conducted by Snippy (v4.6.0) between refined short-read assembled contigs and the following: a) Long-read assembly; b) Long-read assembly corrected with long reads; c) Reoriented replicons; d) Reoriented replicons corrected with short-reads. -
09.taxonomy
: Taxonomic placement of short-read assembled selected contigs and long-read assembled curated contigs, performed by GTDB-Tk (v2.4.0). -
10.annotation
: Based on long-read assembled, reoriented, error corrected contigs. Contains the following sub-directories: -
11.AMR
: Antimicrobial resistance features are investigated with two complementary approaches:- AMR_mapping: short reads filtered by fastp (v0.23.4) are mapped to the CARD database (v3.2.9.) using BBMap (v39.06) with minimum identiy = 0.99. Mapping results are parsed and features with a covered length of at least 70% are reported in the
AMR legend
file. - ABRicate: short-read assembled selected contigs are screened for the presence of AMR elements and virulence factors, using ABRicate (v1.0.1).
- AMR_mapping: short reads filtered by fastp (v0.23.4) are mapped to the CARD database (v3.2.9.) using BBMap (v39.06) with minimum identiy = 0.99. Mapping results are parsed and features with a covered length of at least 70% are reported in the
-
12.plasmids
: Curated long-read assembled contigs are screened for the presence of plasmid replicons with Platon and results verified by BLAST search to avoid false positive. Contigs ascertained as plasmids are reported in theverified plasmids
file. -
13.phages
: Short-read assembled filtered contigs are screened for the presence of viral sequences using VirSorter2 (v2.2.4), followed by CheckV (v1.0.3) for refinement:- virsorter: Following the instructions provided here, viral groups (i.e. dsDNA phage, NCLDV, RNA, ssDNA, and lavidaviridae) are detected with a loose cutoff of 0.5 for maximal sensitivity. Original sequences of circular and (near) fully viral contigs are preserved and passed to the next tool.
- checkv: This second step serves to quality control the results of the previous step to avoid the presence of non-viral sequences (false positive) and to trim potential host regions left at the ends of proviruses.
-
14.report
: MultiQC (v1.23) is used to parse and aggregate the results of the following tools:
This work was supported by the Austrian Science Fund (FWF) [Project I6030-B].
Antonielli, L., Nagel, M., Sanchez Mejia, A., Koch, H., Trognitz, F., & Großkinsky, D. K. (2024). BacFluxL+: Bacterial genome analysis using short and long reads. Zenodo. https://doi.org/10.5281/zenodo.11199081
- Alcock, B. P., Huynh, W., Chalil, R., Smith, K. W., Raphenya, A. R., Wlodarski, M. A., Edalatmand, A., Petkau, A., Syed, S. A., Tsang, K. K., Baker, S. J. C., Dave, M., McCarthy, M. C., Mukiri, K. M., Nasir, J. A., Golbon, B., Imtiaz, H., Jiang, X., Kaur, K., … McArthur, A. G. (2023). CARD 2023: Expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Research, 51(D1), D690–D699. https://doi.org/10.1093/nar/gkac920
- Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., Pyshkin, A. V., Sirotkin, A. V., Vyahhi, N., Tesler, G., Alekseyev, M. A., & Pevzner, P. A. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5), 455–477. https://doi.org/10.1089/cmb.2012.0021
- Blin, K., Shaw, S., Augustijn, H. E., Reitz, Z. L., Biermann, F., Alanjary, M., Fetter, A., Terlouw, B. R., Metcalf, W. W., Helfrich, E. J. N., van Wezel, G. P., Medema, M. H., & Weber, T. (2023). antiSMASH 7.0: New and improved predictions for detection, regulation, chemical structures and visualisation. Nucleic Acids Research, 51(W1), W46–W50. https://doi.org/10.1093/nar/gkad344
- Bouras, G., Grigson, S. R., Papudeshi, B., Mallawaarachchi, V., & Roach, M. J. (2024). Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93), 5968. https://doi.org/10.21105/joss.05968
- Bushnell, B. (2014). BBMap: A Fast, Accurate, Splice-Aware Aligner. https://escholarship.org/uc/item/1h3515gn
- Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: Architecture and applications. BMC Bioinformatics, 10, 421. https://doi.org/10.1186/1471-2105-10-421
- Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P., & Huerta-Cepas, J. (2021). eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Molecular Biology and Evolution, 38(12), 5825–5829. https://doi.org/10.1093/molbev/msab293
- Challis, R., Richards, E., Rajan, J., Cochrane, G., & Blaxter, M. (2020). BlobToolKit – Interactive Quality Assessment of Genome Assemblies. G3 Genes|Genomes|Genetics, 10(4), 1361–1374. https://doi.org/10.1534/g3.119.400908
- Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P., & Parks, D. H. (2022). GTDB-Tk v2: Memory friendly classification with the genome taxonomy database. Bioinformatics, 38(23), 5315–5316. https://doi.org/10.1093/bioinformatics/btac672
- Chen, L., Zheng, D., Liu, B., Yang, J., & Jin, Q. (2016). VFDB 2016: Hierarchical and refined dataset for big data analysis--10 years on. Nucleic Acids Research, 44(D1), D694-697. https://doi.org/10.1093/nar/gkv1239
- Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560
- Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008
- Doster, E., Lakin, S. M., Dean, C. J., Wolfe, C., Young, J. G., Boucher, C., Belk, K. E., Noyes, N. R., & Morley, P. S. (2020). MEGARes 2.0: A database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data. Nucleic Acids Research, 48(D1), D561–D569. https://doi.org/10.1093/nar/gkz1010
- Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England), 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354
- Feldgarden, M., Brover, V., Haft, D. H., Prasad, A. B., Slotta, D. J., Tolstoy, I., Tyson, G. H., Zhao, S., Hsu, C.-H., McDermott, P. F., Tadesse, D. A., Morales, C., Simmons, M., Tillman, G., Wasilenko, J., Folster, J. P., & Klimke, W. (2019). Validating the AMRFinder Tool and Resistance Gene Database by Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of Isolates. Antimicrobial Agents and Chemotherapy, 63(11), e00483-19. https://doi.org/10.1128/AAC.00483-19
- Guo, J., Bolduc, B., Zayed, A. A., Varsani, A., Dominguez-Huerta, G., Delmont, T. O., Pratama, A. A., Gazitúa, M. C., Vik, D., Sullivan, M. B., & Roux, S. (2021). VirSorter2: A multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome, 9(1), 37. https://doi.org/10.1186/s40168-020-00990-y
- Gupta, S. K., Padmanabhan, B. R., Diene, S. M., Lopez-Rojas, R., Kempf, M., Landraud, L., & Rolain, J.-M. (2014). ARG-ANNOT, a New Bioinformatic Tool To Discover Antibiotic Resistance Genes in Bacterial Genomes. Antimicrobial Agents and Chemotherapy, 58(1), 212–220. https://doi.org/10.1128/AAC.01310-13
- Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: Quality assessment tool for genome assemblies. Bioinformatics (Oxford, England), 29(8), 1072–1075. https://doi.org/10.1093/bioinformatics/btt086
- Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., Mende, D. R., Letunic, I., Rattei, T., Jensen, L. J., von Mering, C., & Bork, P. (2019). eggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research, 47(D1), D309–D314. https://doi.org/10.1093/nar/gky1085
- Ingle, D. J., Valcanis, M., Kuzevski, A., Tauschek, M., Inouye, M., Stinear, T., Levine, M. M., Robins-Browne, R. M., & Holt, K. E. (2016). In silico serotyping of E. coli from short read data identifies limited novel O-loci but extensive diversity of O:H serotype combinations within and between pathogenic lineages. Microbial Genomics, 2(7), e000064. https://doi.org/10.1099/mgen.0.000064
- Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P., Tsang, K. K., Lago, B. A., Dave, B. M., Pereira, S., Sharma, A. N., Doshi, S., Courtot, M., Lo, R., Williams, L. E., Frye, J. G., Elsayegh, T., Sardar, D., Westman, E. L., Pawlowski, A. C., … McArthur, A. G. (2017). CARD 2017: Expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research, 45(D1), D566–D573. https://doi.org/10.1093/nar/gkw1004
- Kolmogorov, M., Yuan, J., Lin, Y., & Pevzner, P. A. (2019). Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology, 37(5), 540–546. https://doi.org/10.1038/s41587-019-0072-8
- Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), Article 4. https://doi.org/10.1038/nmeth.1923
- Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., & Köster, J. (2021). Sustainable data analysis with Snakemake (10:33). F1000Research. https://doi.org/10.12688/f1000research.29032.2
- Nayfach, S., Camargo, A. P., Schulz, F., Eloe-Fadrosh, E., Roux, S., & Kyrpides, N. C. (2021). CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nature Biotechnology, 39(5), Article 5. https://doi.org/10.1038/s41587-020-00774-7
- Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292–294. https://doi.org/10.1093/bioinformatics/btv566
- Oxford Nanopore Technologies. (2023). Medaka [Python]. https://github.com/nanoporetech/medaka
- Oxford Nanopore Technologies. (2025). Dorado [C++]. https://github.com/nanoporetech/dorado
- Parks, D. H., Chuvochina, M., Rinke, C., Mussig, A. J., Chaumeil, P.-A., & Hugenholtz, P. (2022). GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50(D1), D785–D794. https://doi.org/10.1093/nar/gkab776
- Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. https://doi.org/10.1101/gr.186072.114
- Schwengers, O., Barth, P., Falgenhauer, L., Hain, T., Chakraborty, T., & Goesmann, A. (2020). Platon: Identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microbial Genomics, 6(10), mgen000398. https://doi.org/10.1099/mgen.0.000398
- Schwengers, O., Jelonek, L., Dieckmann, M. A., Beyvers, S., Blom, J., & Goesmann, A. (2021). Bakta: Rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11), 000685. https://doi.org/10.1099/mgen.0.000685
- Seemann, T. (2014). Prokka: Rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068–2069. https://doi.org/10.1093/bioinformatics/btu153
- Seemann, T. (2023). ABRicate [Perl]. https://github.com/tseemann/abricate
- Wick, R. (2021). Filtlong [C++]. https://github.com/rrwick/Filtlong
- Wick, R. (2023). Yet another ONT accuracy test: Dorado v0.5.0. Ryan Wick’s Bioinformatics Blog. https://doi.org/10.5281/zenodo.10397818
- Wick, R., & Holt, K. E. (2022). Polypolish: Short-read polishing of long-read bacterial genome assemblies. PLOS Computational Biology, 18(1), e1009802. https://doi.org/10.1371/journal.pcbi.1009802
- Zankari, E., Hasman, H., Cosentino, S., Vestergaard, M., Rasmussen, S., Lund, O., Aarestrup, F. M., & Larsen, M. V. (2012). Identification of acquired antimicrobial resistance genes. The Journal of Antimicrobial Chemotherapy, 67(11), 2640–2644. https://doi.org/10.1093/jac/dks261