A workflow for bacterial short reads assembly, QC, annotation, and more.
________ _______________
___ __ )_____ _________ ____/__ /___ _____ __
__ __ | __ `/ ___/_ /_ __ /_ / / /_ |/_/
_ /_/ // /_/ // /__ _ __/ _ / / /_/ /__> <
/_____/ \__,_/ \___/ /_/ /_/ \__,_/ /_/|_|
BacFlux v1.1.8
July 2024
AIT Austrian Institute of Technology, Center for Health & Bioresources
- Livio Antonielli
- Dominik K. Großkinsky
- Hanna Koch
- Friederike Trognitz
IPK Leibniz Institute of Plant Genetics and Crop Plant Research, Cryo and Stress Biology
- Manuela Nagel
- Alexa Sanchez Mejia
BacFlux
is a comprehensive and automated bioinformatics workflow designed specifically for the processing and analysis of bacterial genomic data sequenced with Illumina technology. It integrates several powerful tools, each performing a specific task, into a seamless workflow managed by a Snakemake script.
The pipeline accepts paired-end reads as input and subjects them to a series of analyses including steps for quality control, assessment of genome completeness and contamination, taxonomic placement, annotation, inference of secondary metabolites, screening for antimicrobial resistance and virulence genes, and investigation of plasmid presence.
- Quick Start
- Rationale
- Description
- Installation
- Configuration
- Running BacFlux
- Output
- Acknowledgements
- Citation
- References
This guide gets you started with BacFlux
. Here's a quick guide:
-
Download the latest release:
#git command git clone https://github.com/iLivius/BacFlux.git
-
Configure the
config.yaml
: Specify the input directory containing the raw sequencing data (i.e. paired-end FASTQ files: strain-1_R1.fq.gz, strain-1_R2.fq.gz) and the desired location for the analysis outputs, respectively.BacFlux
relies on external databases for some analyses. Some of them are not automatically installed and theconfig.yaml
must be edited with the path to the following downloaded databases:- blast_db: path to the NCBI core nt database directory
- eggnog_db: path to the eggNOG diamond database directory
- gtdbtk_db: path to the GTDB database directory
- bakta_db: path to the Bakta database directory
- platon_db: path to the Platon database directory
-
Install Snakemake (if not installed already) and activate the environment:
#optional, if not installed already conda create -c conda-forge -c bioconda -n snakemake snakemake #activate Snakemake environment conda activate snakemake
Now you're all set to run BacFlux
! Refer to the installation, configuration and running BacFlux sections for detailed instructions.
-
Here is an example. Within the main workflow directory, launch the pipeline as follows:
snakemake --sdm conda --keep-going --ignore-incomplete --keep-incomplete --cores 50
This command uses the following options:
--sdm: uses conda for dependency management
--keep-going: continues execution even if errors occur in some steps
--ignore-incomplete: ignores rules with missing outputs
--keep-incomplete: keeps incomplete intermediate files
--cores 50: cap the amount of local CPUs at this value (adjust as needed).
The analysis of bacterial WGS data often involves a complex series of steps using various bioinformatic tools. Manual execution of this process can be time-consuming, error-prone, and difficult to reproduce. BacFlux
addresses these challenges by providing a comprehensive and automated Snakemake workflow that streamlines bacterial genomic data analysis.
BacFlux
integrates several best-in-class bioinformatic tools into a cohesive pipeline, automating tasks from quality control and assembly to annotation, taxonomic classification, identification of resistance genes and viral sequences.
By providing a user-friendly and automated solution, BacFlux
empowers researchers to focus on interpreting the biological meaning of their data.
Here's a breakdown of the BacFlux
workflow:
-
Preprocessing:
-
Assembly:
- Assembles filtered reads into contigs with SPAdes.
-
Quality Control, Contamination and Completeness Assessment:
- Filters contigs based on minimum length (at least 500 bp) and coverage (2x).
- Maps filtered reads back to contigs, using bowtie2 and samtools, and analyzes the resulting BAM file with QualiMap.
- Performs local alignments of contigs against the NCBI core nt database using BLAST+.
- Checks for contaminant contigs with BlobTools. Unless otherwise specified (see configuration section for more details), the output of this step will be parsed automatically to discard contaminants based on the relative taxonomic composition of the contigs.
- Evaluates genome assembly quality with Quast.
- Assesses genome completeness and contamination with CheckM using taxon-specific markers.
-
Taxonomic Analysis:
-
Annotation:
-
Antimicrobial Resistance (AMR):
-
Plasmids:
-
Prophages:
- Screens contigs for viral sequences with VirSorter2, followed by CheckV for refinement.
-
Reporting:
- Parses and aggregates results to generate a report using MultiQC.
BacFlux downloads automatically all dependencies and several databases. However, some external databases require manual download before running the workflow.
-
Download BacFlux:
Head over to the Releases section of the repository. Download the latest archive file (typically in .zip or .tar.gz format). This archive contains the
BacFlux
Snakefile script, theconfig.yaml
configuration file, and anenvs
environment directory. Extract the downloaded archive into your desired location. This will create a directory structure with the necessary files and directories. Alternatively, download via command line as:#git command git clone https://github.com/iLivius/BacFlux.git
-
Install Snakemake:
BacFlux
relies on Snakemake to manage the workflow execution. Find the official and complete set of instructions here. To install Snakemake as a Conda environment:#install Snakemake in a new Conda environment (alternatively, use mamba) conda create -c conda-forge -c bioconda -n snakemake snakemake
-
Databases:
While
BacFlux
automates the installation of all software dependencies, some external databases need to be downloaded manually. Unless you have installed them already. In that case, skip this paragraph and jump to the configuration section.Here are the required databases and instructions for obtaining them.
-
NCBI core nt
database, adapted from here:#create a list of all core nt links in the directory designated to host the database (recommended) rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/core_nt.*.gz | grep '.tar.gz' | awk '{print "ftp.ncbi.nlm.nih.gov/blast/db/" $NF}' > nt_links.list #alternatively, create a list of nt links for bacteria only rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/nt_prok.*.gz | grep '.tar.gz' | awk '{print "ftp.ncbi.nlm.nih.gov/blast/db/" $NF}' > nt_prok_links.list #download in parallel, without overdoing it cat nt*.list | parallel -j4 'rsync -h --progress rsync://{} .' #decompress with multiple CPUs find . -name '*.gz' | parallel -j4 'echo {}; tar -zxf {}' #get NCBI taxdump wget -c 'ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz' tar -zxvf taxdump.tar.gz #get NCBI BLAST taxonomy wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz' tar -zxvf taxdb.tar.gz #get NCBI accession2taxid file wget -c 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz' gunzip nucl_gb.accession2taxid.gz
NOTE: the complete NCBI core nt database and taxonomy-related files should take around 223 GB of hard drive space.
-
eggNOG diamond
database:#the easiest way is to install a Conda environment with eggnog-mapper, first conda create -n eggnog-mapper eggnog-mapper=2.1.12 #activate the environment conda activate eggnog-mapper #then, create a directory where you want to install the diamond database for eggnog-mapper mkdir /data/eggnog_db #change /data/eggnog_db with your actual PATH #finally, download the diamond db in the newly created directory download_eggnog_data.py --data_dir /data/eggnog_db -y
NOTE: the eggNOG database requires ~50 GB of space.
-
GTDB
database:#move first inside the directory where you want to place the database, then download and decompress either the full package or the split package version # full package wget -c https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz tar xzvf gtdbtk_r220_data.tar.gz rm gtdbtk_r220_data.tar.gz # split package (alternative) base_url="https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/split_package/gtdbtk_r220_data.tar.gz.part_" suffixes=(aa ab ac ad ae af ag ah ai aj ak) printf "%s\n" "${suffixes[@]}" | xargs -n 1 -P 11 -I {} wget "${base_url}{}" cat gtdbtk_r220_data.tar.gz.part_* > gtdbtk_r220_data.tar.gz tar xzvf gtdbtk_r220_data.tar.gz rm gtdbtk_r220_data.tar.gz
NOTE: compressed archive size ~102 GB, decompressed archive size ~108 GB.
-
Bakta
database:#Bakta database comes in two flavours. To download the full database, use the following link (recommended): wget -c https://zenodo.org/records/10522951/files/db.tar.gz tar -xzf db.tar.gz rm db.tar.gz #alternatively, download a lighter version wget https://zenodo.org/record/10522951/files/db-light.tar.gz tar -xzf db-light.tar.gz rm db-light.tar.gz #if the AMRFinderPlus db gives an error, update it by activating the Bakta Conda env and running the following command by targeting the Bakta db directory: amrfinder_update --force_update --database db/amrfinderplus-db/
NOTE: according to the source the light version should take 1.4 GB compressed and 3.4 GB decompressed, whereas the full database should get 37 GB zipped and 71 GB unzipped.
-
Platon
database:#download the database in a directory of your choice wget https://zenodo.org/record/4066768/files/db.tar.gz tar -xzf db.tar.gz rm db.tar.gz
NOTE: according to the source, the zipped version occupies 1.6 GB and 2.8 GB when unzipped.
-
Before running BacFlux
, you must edit the config.yaml
file with a text editor. The file is organized in different sections: links
, directories
, resources
and parameters
, respectively.
-
links
This section should work fine as it is, therefore it is recommandable to change the
links
only if not working or to update the database versions:- phix_link: Path to the PhiX genome reference used by Illumina for sequencing control.
- card_link: Path to the Comprehensive Antibiotic Resistance Database (
CARD
) - checkv_link: Path to the
CheckV
database for viral genome quality assessment
-
directories
Update paths based on your file system:
-
fastq_dir: directory containing the paired-end reads of your sequenced strains, in FASTQ format. You can provide as many as you like but at the following conditions:
- Files can only have the following extensions: either
fastq
,fq
,fastq.gz
orfq.gz
. - You can provide multiple samples but the extension should be the same for all files. So, don't mix files with different extensions.
- Sample names should be formatted as follows: mystrain_R1.fq and mystrain_R2.fq, strain-1_R1.fq and strain-1_R2.fq, strain2_R1.fq, strain2_R2.fq. In this example,
BacFlux
will interpret the name of each strain as: "mystrain", "strain-1", and "strain2", respectively. Strain names cannot contain underscores. See another example, below:#the input dir contains the PE reads of two strains, PE212-1 and PE253-B, respectively ahab@pequod:~/data$ ls -lh total 1,6G -rw-rw-r-- 1 ahab ahab 379M Apr 8 16:48 PE212-1_R1.fastq.gz -rw-rw-r-- 1 ahab ahab 385M Apr 8 16:48 PE212-1_R2.fastq.gz -rw-rw-r-- 1 ahab ahab 421M Apr 8 16:48 PE253-B_R1.fastq.gz -rw-rw-r-- 1 ahab ahab 433M Apr 8 16:48 PE253-B_R2.fastq.gz
- Files can only have the following extensions: either
-
out_dir: This directory will store all output files generated by
BacFlux
. Additionally, by default,BacFlux
will install required software and databases here, within Conda environments. Reusing this output directory for subsequent runs avoids reinstalling everything from scratch. -
blast_db: path to the whole
NCBI core nt
(recommended) or prokaryotic database only, and related taxonomic dependencies, see installation. -
eggnog_db: path to the diamond database for
eggNOG
. -
gtdbtk_db: path to the R220 release of
GTDB
. -
bakta_db: path to either the light or full (recommended) database of
Bakta
. -
platon_db: path to the
Platon
database.
-
-
resources
In this section you can specify the hardware resources available to the workflow:
- threads: max number of CPUs used by each rule
- ram_gb: max amount of RAM used (SPAdes only).
-
parameters
-
Database selection:
BacFlux
requires specifying the version of theNCBI nt
database forBLAST
operations. You can choose between thecore_nt
andnt_prok
versions. By default, theconfig.yaml
configuration file is set to use thecore_nt
database. For instructions on installing theBLAST
database, refer to the installation. -
Genus filtering:
BacFlux
includes an optional parameter to specify the bacterialgenus
of contigs you wish to retain in the final assembly. If left blank,BacFlux
will automatically keep contigs associated with the most abundant taxon, based on relative composition determined throughBLAST
analysis. While this approach generally works well, it has limitations, such as reduced resolution at the species level due to reliance on the cumulative best scores ofBLAST
hits. Additionally, this method may be problematic if the contaminant organism belongs to the same genus as your target organism, or if you are working with co-cultured closely related species or strains. If thegenus
parameter introduces more issues than benefits, simply remove thegenus
option from theconfig.yaml
file.-
Using the
genus
parameter: if a contaminant is ascertained to be more abundant than your target organism, you can re-run the workflow after reviewing the assembly output. Specify thegenus
of the desired bacterial taxon you want to keep in during the re-run. -
Disabling the
genus
filtering: if either the automatic inference of contaminant contigs or the manual selection of the desired taxon are still not working for you, simply delete thegenus
option from theparameters
. In this case, only contigs tagged as "no-hit" afterBLAST
search will be filtered out.
-
-
BacFlux
can be executed as simply as a Snakefile. Please refer to the official Snakemake documentation for more details.
# first, activate the Snakemake Conda environment
conda activate snakemake
# navigate inside the directory where the BacFlux archive was downloaded and decompressed
# launch the workflow
snakemake --sdm conda --cores 50
NOTE: Starting from Snakemake version 8.4.7, the --use-conda option has been deprecated. Instead, you should now use --software-deployment-method conda or --sdm conda.
The workflow output reflects the steps described in the description section. Here's a breakdown of the subdirectories created within the main output folder, along with explanations of their contents:
-
01.pre-processing
: QC and statistics of raw reads and trimmed reads, produced by fastp (v0.23.4). -
02.assembly
: Content output by SPAdes (v4.0.0). In addiction to the raw contigs, you will find also the filtered contigs (>500bp and at least 2x) and the selected contigs, which are the contigs selected after BLAST search and decontamination (seeparameters
in the configuration section above). The follow-up applications used during the worflow will either use selected contigs (i.e. for annotation purposes) or raw, filtered and selected contigs (i.e. to evaluate the genome completenness and contamination). -
03.post-processing
: Contains the following sub-directories:- mapping_evaluation: QualiMap (v2.3) output based on filtered contigs.
- contaminants: Contig selection based on BLAST+ (v2.15.0) search and BlobTools (1.1.1) analysis. Check the
composition
text file for a quick overview of the relative composition of your assembly. - assembly_evaluation: Quast (v5.2.0) output based on selected contigs.
- completenness_evaluation: CheckM (1.2.3) output based on raw, filtered and selected contigs.
-
04.taxonomy
: Taxonomic placement of raw, filtered and selected contigs, performed by GTDB-Tk (v2.4.0). -
05.annotation
: Contains the following sub-directories: -
06.AMR
: Antimicrobial resistance features are investigated with two complementary approaches:- AMR_mapping: reads filtered by fastp (v0.23.4) are mapped to the CARD database (v3.2.9.) using BBMap (v39.06) with minimum identiy = 0.99. Mapping results are parsed and features with a covered length of at least 70% are reported in the
AMR legend
file. - ABRicate: selected contigs are screened for the presence of AMR elements and virulence factors, using ABRicate (v1.0.1).
- AMR_mapping: reads filtered by fastp (v0.23.4) are mapped to the CARD database (v3.2.9.) using BBMap (v39.06) with minimum identiy = 0.99. Mapping results are parsed and features with a covered length of at least 70% are reported in the
-
07.plasmids
: Selected contigs are screened for the presence of plasmid replicons with Platon (v1.7) and results verified by BLAST search to avoid false positive. Contigs ascertained as plasmids are reported in theverified plasmids
file. -
08.phages
: Filtered contigs are screened for the presence of viral sequences using VirSorter2 (v2.2.4), followed by CheckV (v1.0.3) for refinement:- virsorter: Following the instructions provided here, viral groups (i.e. dsDNA phage, NCLDV, RNA, ssDNA, and lavidaviridae) are detected with a loose cutoff of 0.5 for maximal sensitivity. Original sequences of circular and (near) fully viral contigs are preserved and passed to the next tool.
- checkv: This second step serves to quality control the results of the previous step to avoid the presence of non-viral sequences (false positive) and to trim potential host regions left at the ends of proviruses.
-
09.report
: MultiQC (v1.23) is used to parse and aggregate the results of the following tools:
This work was supported by the Austrian Science Fund (FWF) [Project I6030-B].
Antonielli, L., Großkinsky, D. K., Koch, H., Trognitz, F., Sanchez Mejia, A., & Nagel, M. (2024). BacFlux: A workflow for bacterial short reads assembly, QC, annotation, and more. Zenodo. https://doi.org/10.5281/zenodo.11143917
-
Alcock, B. P., Huynh, W., Chalil, R., Smith, K. W., Raphenya, A. R., Wlodarski, M. A., Edalatmand, A., Petkau, A., Syed, S. A., Tsang, K. K., Baker, S. J. C., Dave, M., McCarthy, M. C., Mukiri, K. M., Nasir, J. A., Golbon, B., Imtiaz, H., Jiang, X., Kaur, K., … McArthur, A. G. (2023). CARD 2023: Expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Research, 51(D1), D690–D699. https://doi.org/10.1093/nar/gkac920
-
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., Pyshkin, A. V., Sirotkin, A. V., Vyahhi, N., Tesler, G., Alekseyev, M. A., & Pevzner, P. A. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5), 455–477. https://doi.org/10.1089/cmb.2012.0021
-
Blin, K., Shaw, S., Augustijn, H. E., Reitz, Z. L., Biermann, F., Alanjary, M., Fetter, A., Terlouw, B. R., Metcalf, W. W., Helfrich, E. J. N., van Wezel, G. P., Medema, M. H., & Weber, T. (2023). antiSMASH 7.0: New and improved predictions for detection, regulation, chemical structures and visualisation. Nucleic Acids Research, 51(W1), W46–W50. https://doi.org/10.1093/nar/gkad344
-
Bushnell, B. (2014). BBMap: A Fast, Accurate, Splice-Aware Aligner. https://escholarship.org/uc/item/1h3515gn
-
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: Architecture and applications. BMC Bioinformatics, 10, 421. https://doi.org/10.1186/1471-2105-10-421
-
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P., & Huerta-Cepas, J. (2021). eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Molecular Biology and Evolution, 38(12), 5825–5829. https://doi.org/10.1093/molbev/msab293
-
Challis, R., Richards, E., Rajan, J., Cochrane, G., & Blaxter, M. (2020). BlobToolKit – Interactive Quality Assessment of Genome Assemblies. G3 Genes|Genomes|Genetics, 10(4), 1361–1374. https://doi.org/10.1534/g3.119.400908
-
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P., & Parks, D. H. (2022). GTDB-Tk v2: Memory friendly classification with the genome taxonomy database. Bioinformatics, 38(23), 5315–5316. https://doi.org/10.1093/bioinformatics/btac672
-
Chen, L., Zheng, D., Liu, B., Yang, J., & Jin, Q. (2016). VFDB 2016: Hierarchical and refined dataset for big data analysis--10 years on. Nucleic Acids Research, 44(D1), D694-697. https://doi.org/10.1093/nar/gkv1239
-
Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560
-
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008
-
Doster, E., Lakin, S. M., Dean, C. J., Wolfe, C., Young, J. G., Boucher, C., Belk, K. E., Noyes, N. R., & Morley, P. S. (2020). MEGARes 2.0: A database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data. Nucleic Acids Research, 48(D1), D561–D569. https://doi.org/10.1093/nar/gkz1010
-
Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England), 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354
-
Feldgarden, M., Brover, V., Haft, D. H., Prasad, A. B., Slotta, D. J., Tolstoy, I., Tyson, G. H., Zhao, S., Hsu, C.-H., McDermott, P. F., Tadesse, D. A., Morales, C., Simmons, M., Tillman, G., Wasilenko, J., Folster, J. P., & Klimke, W. (2019). Validating the AMRFinder Tool and Resistance Gene Database by Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of Isolates. Antimicrobial Agents and Chemotherapy, 63(11), e00483-19. https://doi.org/10.1128/AAC.00483-19
-
Guo, J., Bolduc, B., Zayed, A. A., Varsani, A., Dominguez-Huerta, G., Delmont, T. O., Pratama, A. A., Gazitúa, M. C., Vik, D., Sullivan, M. B., & Roux, S. (2021). VirSorter2: A multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome, 9(1), 37. https://doi.org/10.1186/s40168-020-00990-y
-
Gupta, S. K., Padmanabhan, B. R., Diene, S. M., Lopez-Rojas, R., Kempf, M., Landraud, L., & Rolain, J.-M. (2014). ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrobial Agents and Chemotherapy, 58(1), 212–220. https://doi.org/10.1128/AAC.01310-13
-
Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: Quality assessment tool for genome assemblies. Bioinformatics (Oxford, England), 29(8), 1072–1075. https://doi.org/10.1093/bioinformatics/btt086
-
Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., Mende, D. R., Letunic, I., Rattei, T., Jensen, L. J., von Mering, C., & Bork, P. (2019). eggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research, 47(D1), D309–D314. https://doi.org/10.1093/nar/gky1085
-
Ingle, D. J., Valcanis, M., Kuzevski, A., Tauschek, M., Inouye, M., Stinear, T., Levine, M. M., Robins-Browne, R. M., & Holt, K. E. (2016). In silico serotyping of E. coli from short read data identifies limited novel O-loci but extensive diversity of O:H serotype combinations within and between pathogenic lineages. Microbial Genomics, 2(7), e000064. https://doi.org/10.1099/mgen.0.000064
-
Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P., Tsang, K. K., Lago, B. A., Dave, B. M., Pereira, S., Sharma, A. N., Doshi, S., Courtot, M., Lo, R., Williams, L. E., Frye, J. G., Elsayegh, T., Sardar, D., Westman, E. L., Pawlowski, A. C., … McArthur, A. G. (2017). CARD 2017: Expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research, 45(D1), D566–D573. https://doi.org/10.1093/nar/gkw1004
-
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), Article 4. https://doi.org/10.1038/nmeth.1923
-
Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., & Köster, J. (2021). Sustainable data analysis with Snakemake (10:33). F1000Research. https://doi.org/10.12688/f1000research.29032.2
-
Nayfach, S., Camargo, A. P., Schulz, F., Eloe-Fadrosh, E., Roux, S., & Kyrpides, N. C. (2021). CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nature Biotechnology, 39(5), Article 5. https://doi.org/10.1038/s41587-020-00774-7
-
Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292–294. https://doi.org/10.1093/bioinformatics/btv566
-
Parks, D. H., Chuvochina, M., Rinke, C., Mussig, A. J., Chaumeil, P.-A., & Hugenholtz, P. (2022). GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50(D1), D785–D794. https://doi.org/10.1093/nar/gkab776
-
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. https://doi.org/10.1101/gr.186072.114
-
Schwengers, O., Barth, P., Falgenhauer, L., Hain, T., Chakraborty, T., & Goesmann, A. (2020). Platon: Identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microbial Genomics, 6(10), mgen000398. https://doi.org/10.1099/mgen.0.000398
-
Schwengers, O., Jelonek, L., Dieckmann, M. A., Beyvers, S., Blom, J., & Goesmann, A. (2021). Bakta: Rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11), 000685. https://doi.org/10.1099/mgen.0.000685
-
Seemann, T. (2014). Prokka: Rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068–2069. https://doi.org/10.1093/bioinformatics/btu153
-
Seemann, T. (2020). ABRicate. https://github.com/tseemann/abricate
-
Zankari, E., Hasman, H., Cosentino, S., Vestergaard, M., Rasmussen, S., Lund, O., Aarestrup, F. M., & Larsen, M. V. (2012). Identification of acquired antimicrobial resistance genes. The Journal of Antimicrobial Chemotherapy, 67(11), 2640–2644. https://doi.org/10.1093/jac/dks261