Integrated workflow for fungal genome assembly and annotation.
__________ _______________
___ ____/___ ___________ ____/__ /___ _____ __
__ /_ _ / / /_ __ \_ /_ __ /_ / / /_ |/_/
_ __/ / /_/ /_ / / / __/ _ / / /_/ /__> <
/_/ \__,_/ /_/ /_//_/ /_/ \__,_/ /_/|_|
FunFlux v1.0.3
August 2024
AIT Austrian Institute of Technology, Center for Health & Bioresources
- Livio Antonielli
- Günter Brader
- Stéphane Compant
FunFlux
is a Snakemake workflow designed for the genome assembly and annotation of fungal short reads sequenced with Illumina technology. It also supports the analysis of pre-assembled contigs. The workflow includes features such as contig selection and decontamination, genome completeness assessment, ITS extraction with taxonomic assignment, and precise gene prediction and annotation.
- Rationale
- Description
- Installation
- Configuration
- Running FunFlux
- Output
- Acknowledgements
- Citation
- References
The analysis of fungal whole-genome sequencing (WGS) data involves a complex series of bioinformatic steps that can be challenging to execute manually. This process is often time-consuming, prone to errors, and difficult to reproduce. FunFlux
addresses these challenges by offering a comprehensive and automated Snakemake workflow specifically designed for fungal genomic data analysis.
FunFlux
is designed to streamline the annotation process with funannotate, even in the absence of RNA sequencing evidence. It relies on ab initio annotation and incorporates protein FASTA sequences from organisms of the same species or genus to enhance the accuracy of gene prediction and annotation.
Here's a breakdown of the FunFlux
workflow:
-
Preprocessing:
-
Assembly:
- Filtered reads are assembled into contigs with SPAdes.
-
QC, Decontamination, Completeness Assessment, and ITS extraction:
- Contigs are filtered based on a minimum length of 500 bp and a coverage of 2x.
- Filtered reads are mapped back to contigs using bowtie2 and samtools. The resulting BAM file is analyzed with QualiMap.
- Local alignments of contigs are performed against the NCBI core nt database using BLAST+.
- Contaminant contigs are checked with BlobTools. Unless otherwise specified (see configuration section for more details), the output of this step will be parsed automatically to discard contaminants based on the relative taxonomic composition of the contigs.
- Genome assembly quality is evaluated with Quast.
- Genome completeness is assessed with BUSCO using taxon-specific markers.
- ITS markers are detected and extracted with ITSx.
- ITS2 taxonomic assignment is performed with SINTAX re-implemented in VSEARCH using the UNITE database as reference.
-
Gene Prediction:
FunFlux
is optimized to leverage the funannotate pipeline in cases where RNA sequencing data is not available. Instead, it utilizes external protein evidence along with robust ab initio prediction methods to produce accurate gene models for fungal genomes. Below is a step-by-step breakdown of the workflow:-
Preprocessing the genome assembly
-
N50 calculation and contig duplication checking: As part of the cleaning process, the N50 value is calculated, and contigs shorter than this value are checked for duplication. Only unique, non-redundant contigs are retained, ensuring that the assembly is as clean and representative as possible.
-
Sorting and renaming FASTA headers: The assembled contigs are sorted by length and headers are renamed to ensure compatibility with follow-up tools.
-
Repeat masking: Before gene prediction, the genome assembly is softmasked using the tantan software to obscure repetitive elements, which helps in preventing spurious gene predictions in these regions.
-
-
Incorporating protein evidence
- Protein alignment: DIAMOND is used to quickly search for homologies between the genome and provided protein sequences of closely related taxa, as well as the UniProt database. These matches are then refined with Exonerate, which aligns the protein sequences to the genome with high precision, providing evidence for gene structures.
-
Ab initio gene prediction
- GeneMark-ES: This tool performs self-training on the genome sequence to predict genes without the need for external training data, making it especially useful for identifying genes in regions lacking homology-based evidence.
-
Ortholog detection and model training
-
BUSCO: Based on conserved orthologous genes, it provides high-quality evidence for training gene prediction tools. Conserved genes are passed to Augustus to improve its predictive accuracy.
-
Augustus training: It works with the closest taxon model available, as well as the evidence from BUSCO, DIAMOND/Exonerate, and the outputs from other ab initio predictors like SNAP and GlimmerHMM. This comprehensive training enables Augustus to generate highly accurate gene predictions.
-
-
Combining predictions with EVidenceModeler
- EVidenceModeler (EVM): The predictions from various ab initio tools, such as Augustus, SNAP, GlimmerHMM are combined to generate consensus gene models.
-
Refining steps
-
Gene model filtering: The gene models generated by EVM are subjected to further filtering to remove short, low-confidence predictions, models spanning gaps, and potential transposable elements.
-
tRNA prediction: tRNA genes are predicted using tRNAscan-SE, ensuring comprehensive annotation of both protein-coding and non-coding genes.
-
NCBI submission preparation: Generation of an NCBI-compatible annotation table (.tbl format) and conversion to GenBank format using tbl2asn. The workflow also includes a validation step to parse NCBI error reports and alert users to any gene models that need manual correction.
-
-
-
Gene Annotation:
A comprehensive gene annotation process assigns functional information to the identified genes. This process integrates multiple annotation tools and culminates in a final annotation round performed by funannotate. Below is an overview of the workflow:
-
InterProScan (v5.65-97.0): This tool is employed to assign protein domains and predict functional sites within the gene models. It integrates data from multiple databases such as
Pfam
,SMART
,PANTHER
andPROSITE
, providing a rich set of functional annotations. -
EggNOG-mapper (v2.1.12): This software is used to predict orthology and functional annotations based on the
EggNOG
database (v5.0). It helps in assigning Gene Ontology (GO) terms, enzyme codes, and pathway annotations to the gene models, offering insights into the biological roles of the proteins. -
antiSMASH (v7.1): For fungal genomes, secondary metabolite gene clusters related to antibiotics or toxins are of particolar interest.
-
HMMer for PFAM database (v36.0)
-
CAZyme annotation with dbCAN (v12.0).
-
-
Report:
- Results are parsed and aggregated to generate a report using MultiQC.
FunFlux
automatically downloads all dependencies and several databases. However, some external databases require manual download before running the workflow.
-
Download FunFlux:
Download via command line as:
# Clone the directory git clone https://github.com/iLivius/FunFlux.git
-
Install Snakemake:
FunFlux
relies on Snakemake to manage the workflow execution. Find the official and complete set of instructions here. To install Snakemake as a Conda environment:# Install Snakemake in a new Conda environment mamba create -c conda-forge -c bioconda -n snakemake snakemake
-
Databases:
While
FunFlux
automates the installation of all software dependencies, some external databases need to be downloaded manually. If you have already installed these databases, you can skip this paragraph and proceed to the configuration section.Here are the required databases and software that need manual installation.
-
NCBI core nt
database, adapted from here:# Create a list of all core nt links in the directory designated to host the database (recommended) rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/core_nt.*.gz | grep '.tar.gz' | awk '{print "ftp.ncbi.nlm.nih.gov/blast/db/" $NF}' > nt_links.list # Alternatively, create a list of nt links for bacteria only rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/nt_prok.*.gz | grep '.tar.gz' | awk '{print "ftp.ncbi.nlm.nih.gov/blast/db/" $NF}' > nt_prok_links.list # Download in parallel, without overdoing it cat nt*.list | parallel -j4 'rsync -h --progress rsync://{} .' # Decompress with multiple CPUs find . -name '*.gz' | parallel -j4 'echo {}; tar -zxf {}' # Get NCBI taxdump wget -c 'ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz' tar -zxvf taxdump.tar.gz # Get NCBI BLAST taxonomy wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz' tar -zxvf taxdb.tar.gz # Get NCBI accession2taxid file wget -c 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz' gunzip nucl_gb.accession2taxid.gz
NOTE: Skip the download if you provide assembled contigs as input. The complete NCBI core nt database and taxonomy-related files should take around 223 GB of hard drive space.
-
UNITE
database:- Visit the UNITE USEARCH/UTAX release for eukaryotes.
- Download the
utax_reference_dataset_all_04.04.2024.fasta.gz
file and decompress it. - Add the PATH to the FASTA file in the
config.yaml
file. See the configuration section.
-
eggNOG diamond
database:# Create a Conda environment with eggnog-mapper, first conda create -n eggnog-mapper eggnog-mapper=2.1.12 # Activate the environment conda activate eggnog-mapper # Create a directory where you want to install the diamond database for eggnog-mapper (example) mkdir /data/eggnog_db # Finally, download the diamond db in the newly created directory download_eggnog_data.py --data_dir /data/eggnog_db -y
NOTE: the eggNOG database requires ~50 GB of space.
-
Download and set up
Genemark-ES/ET
:-
Visit the
GeneMark
download page here. -
Follow the instructions to download
GeneMark-ES/ET
. -
Change the shebang line in Perl scripts, as follows:
# After downloading, navigate to the GeneMark directory (example): cd /gmes_linux_64_4 # Change the shebang line in all Perl scripts to use /usr/bin/env perl: find . -type f -name "*.pl" -exec sed -i '1s|^#!/usr/bin/perl|#!/usr/bin/env perl|' {} + # Test the software: ./gmes_petap.pl
-
-
Download and set up
InterProScan
:# Download this version although probably also more recent ones should work: wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.65-97.0/interproscan-5.65-97.0-64-bit.tar.gz # Exctract the tarball: tar -pxvzf interproscan-5.65-97.0-64-bit.tar.gz # From inside the iprscan dir, index the hmm models: python3 setup.py -f interproscan.properties # Check the shell script, inside the iprscan dir: ./interproscan.sh
-
Before running FunFlux
, you must edit the config.yaml
file with a text editor. The file is organized in different sections: links
, directories
, files
, resources
and parameters
, respectively.
If you have the FASTA files of ly assembled genomes as input, you must edit the config_funnotator.yaml
file, instead.
-
links
This section should work fine as it is, therefore it is recommanded to change the
links
only if necessary or to update the database versions:- phix_link: Path to the PhiX genome reference used by Illumina for sequencing control. It is not needed in
config/config_funnotator.yaml
.
- phix_link: Path to the PhiX genome reference used by Illumina for sequencing control. It is not needed in
-
directories
Update paths based on your file system:
-
fastq_dir: Directory containing the paired-end reads of your sequenced strains, in FASTQ format. You can provide as many as you like but at the following conditions:
- Files can only have the following extensions:
fastq
,fq
,fastq.gz
, orfq.gz
. - You can provide multiple samples but the extension should be the same for all files. So, don't mix files with different extensions.
No underscores
are allowed in sample names.- Use
_R1
and_R2
to define PE reads of each sample. i.e.sample_R1.fastq.gz
,sample_R2.fastq.gz
.
- Files can only have the following extensions:
-
input_dir : Alternatively, if you want to analyze already available contigs, provide the directory path to the FASTA files, in the
config/config_funnotator.yaml
. Also in this case, you can provide as many genomes as you like but at the following conditions:- Files can only have the following extensions:
fasta
,fa
, orfna
. - You can provide multiple samples but the extension should be the same for all files. So, don't mix files with different extensions.
No underscores
are allowed in sample names. See example:sample-1.fasta
,sample2.fasta
.
- Files can only have the following extensions:
-
out_dir: This directory will store all output files generated by
FunFlux
. Additionally, by default,FunFlux
will install required software and databases here, within Conda environments. Reusing this output directory for subsequent runs avoids reinstalling everything from scratch. -
blast_db: Path to the whole
NCBI core nt
. See installation. Not needed forconfig/config_funnotator.yaml
, if you work with FASTA files of previously assembled contigs. -
eggnog_db: Path to the diamond database for
eggNOG
. Download details in installation. -
genemark_dir: Path to
gmes_linux_64_4
directory. To get and configureGeneMark-ES/ET
, see the installation, above. -
funannotate_db: Provide a path for funannotate to automatically install the following databases:
$ funannotate database Funannotate Databases currently installed: Database Type Version Date Num_Records Md5checksum merops diamond 12.0 2017-10-04 5009 a6dd76907896708f3ca5335f58560356 uniprot diamond 2024_01 2024-01-24 570830 c7507ea16b3c4807971c663994cad329 dbCAN hmmer3 12.0 2023-08-02 699 fb112af319a5001fbf547eac29e7c3b5 pfam hmmer3 36.0 2023-07 20795 0725495ccf049a4f198fcc0a92f7f38c repeats diamond 1.0 2024-03-12 11950 4e8cafc3eea47ec7ba505bb1e3465d21 go text 2024-01-17 2024-01-17 47729 7e6b9974184dda306e6e07631f1783af mibig diamond 1.4 2024-03-12 31023 118f2c11edde36c81bdea030a0228492 interpro xml 98.0 2024-01-25 40768 502ea05009761b893dedb56d5ea89c48 busco_outgroups outgroups 1.0 2024-03-12 8 6795b1d4545850a4226829c7ae8ef058 gene2product text 1.92 2023-10-02 34459 32a4a80987720e0872377de3207dc0f5
-
-
files
-
its_db: Path to the UNITE USEARCH/UTAX release for eukaryotes decompressed FASTA. Find the download instructions in the installation paragraph.
-
annotation_params: Path to a tab-delimited annotation parameter file, as displayed below.
An example is provided in the
config
directory asannotation_parameters.tsv
:#Sample Species Proteins Model ARSEF3097 Beauveria bassiana /path/to/proteins.faa fusarium_graminearum 150-1 Lecanicillium fungicola /path/to/proteins.faa fusarium_graminearum FJII-L10-SW-P1 Parengyodontium torokii /path/to/proteins.faa fusarium_graminearum HWLR35 Lecanicillium psalliotae /path/to/proteins.faa fusarium_graminearum JC-1038 Gamszarea kalimantanensis /path/to/proteins.faa fusarium_graminearum MBC-099 Lecanicillium aphanocladii /path/to/proteins.faa fusarium_graminearum MBC-350 Akanthomyces uredinophilus /path/to/proteins.faa fusarium_graminearum MBC-401 Cordyceps farinosa /path/to/proteins.faa fusarium_graminearum MBC-695 Akanthomyces uredinophilus /path/to/proteins.faa fusarium_graminearum MBC-701 Akanthomyces dipterigenus /path/to/proteins.faa fusarium_graminearum -
iprscan: Path to the InterProScan shell script. See the installation section for more details.
-
-
resources
In this section you can specify the hardware resources available to the workflow:
- threads: max number of CPUs used by each rule
- ram_gb: max amount of RAM used by SPAdes. Not necessary in
config/config_funnotator.yaml
.
-
parameters
Genus filtering:
FunFlux
includes an optional parameter to specify the fungalgenus
of contigs you wish to retain in the final assembly. If left blank,FunFlux
will automatically keep contigs associated with the most abundant taxon, based on relative composition determined throughBLAST
analysis. While this approach generally works well, it has limitations, such as reduced resolution at the species level due to reliance on the cumulative best scores ofBLAST
hits. Additionally, this method may be problematic if the contaminant organism belongs to the same genus as your target organism, or if you are working with co-cultured closely related species or strains. If thegenus
parameter introduces more issues than benefits, simply remove thegenus
option from theconfig.yaml
file. This feature is not available inconfig/config_funnotator.yaml
.-
Using the
genus
parameter: if a contaminant is ascertained to be more abundant than your target organism, you can re-run the workflow after reviewing the assembly output. Specify thegenus
of the desired fungal taxon you want to keep in during the re-run. -
Disabling the
genus
filtering: if either the automatic inference of contaminant contigs or the manual selection of the desired taxon are still not working for you, simply delete thegenus
option from theparameters
. In this case, only contigs tagged as "no-hit" afterBLAST
search will be filtered out.
-
FunFlux
can be executed as simply as a Snakefile
. Please refer to the official Snakemake documentation for more details.
# First, activate the Snakemake Conda environment.
conda activate snakemake
# Navigate inside the FunFlux downloaded directory.
# Customize the "config.yaml" configuration file in the "config" sub-directory
# Launch the workflow
snakemake --sdm conda --cores 50 --jobs 2
IMPORTANT: If you need to analyze previously assembled fungal genomes, provided as FASTA files, use Funnotator
, instead:
# Activate the Snakemake Conda environment
conda activate snakemake
# Navigate inside the FunFlux directory
# Customize the "config_funnotator.yaml" configuration file in the "config" sub-directory
# Launch the workflow specifying the "Funnotator" Snakefile
snakemake --snakefile workflow/Funnotator --sdm conda --cores 50 --jobs 2
Here's a breakdown of the sub-directories created by FunFlux
within the main output folder, along with explanations of their contents. Please notice that Funnotator
will produce a similar, simplified output.
├── 01.pre-processing
├── 02.assembly
├── 03.post-processing
├── 04.annotation
├── logs
└── report
-
01.pre-processing
: QC and statistics of raw reads and trimmed reads, produced by fastp (v0.23.4). -
02.assembly
: Content output by SPAdes (v4.0.0). In addition to the raw contigs, you will also find the filtered contigs (>500bp and at least 2x) and the selected contigs, which are the contigs selected after BLAST search and decontamination (seeparameters
in the configuration section above). The follow-up applications used during the worflow will either use selected contigs (i.e. for annotation purposes) or raw, filtered and selected contigs (i.e. to evaluate the genome completenness and contamination). -
03.post-processing
: Contains the following sub-directories:- mapping_evaluation: QualiMap (v2.3) output based on filtered contigs.
- contaminants: Contig selection based on BLAST+ (v2.15.0) search and BlobTools (1.1.1) analysis. Check the
composition
text file for a quick overview of the relative composition of your assembly. - assembly_evaluation: Quast (v5.2.0) output based on selected contigs.
- completenness_evaluation: BUSCO (v5.5.0) output based on selected contigs.
- ITS_extraction: ITSx (v1.1.3) output based on raw contigs and classified using the SINTAX algorithm re-implemented in VSEARCH (v2.28.1). It is recommendable to use the latest UNITE database, as reference.
-
04.annotation
: Contains the following sub-directories:- iprscan: Annotation output by InterProScan (v5.65-97.0), in XML format.
- eggnog: Functional annotation produced by eggNOG mapper (v2.1.12).
- antismash: Secondary metabolites inferred by antiSMASH (v7.1.0).
- funannotate: Prediction and annotation directories output by funannotate (v1.8.15).
├── annotate_misc ├── annotate_results ├── logfiles ├── predict_misc └── predict_results
-
report
: MultiQC (v1.23) is used to parse and aggregate the results of the following tools:
This work was supported by BeXyl (Beyond Xylella, Integrated Management Strategies for Mitigating Xylella fastidiosa impact in Europe) - HORIZON-CL6-2021-FARM2FORK-01-04, grant ID 101060593.
Antonielli, L., Brader, G., & Compant, S. (2024). FunFlux: Integrated workflow for fungal genome assembly and annotation. Zenodo. https://doi.org/10.5281/zenodo.13612159
-
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., Pyshkin, A. V., Sirotkin, A. V., Vyahhi, N., Tesler, G., Alekseyev, M. A., & Pevzner, P. A. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5), 455–477. https://doi.org/10.1089/cmb.2012.0021
-
Bengtsson-Palme, J., Ryberg, M., Hartmann, M., Branco, S., Wang, Z., Godhe, A., De Wit, P., Sánchez-García, M., Ebersberger, I., de Sousa, F., Amend, A., Jumpponen, A., Unterseher, M., Kristiansson, E., Abarenkov, K., Bertrand, Y. J. K., Sanli, K., Eriksson, K. M., Vik, U., … Nilsson, R. H. (2013). Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data. Methods in Ecology and Evolution, 4(10), 914–919. https://doi.org/10.1111/2041-210X.12073
-
Blin, K., Shaw, S., Augustijn, H. E., Reitz, Z. L., Biermann, F., Alanjary, M., Fetter, A., Terlouw, B. R., Metcalf, W. W., Helfrich, E. J. N., van Wezel, G. P., Medema, M. H., & Weber, T. (2023). antiSMASH 7.0: New and improved predictions for detection, regulation, chemical structures and visualisation. Nucleic Acids Research, 51(W1), W46–W50. https://doi.org/10.1093/nar/gkad344
-
Blum, M., Chang, H.-Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., Nuka, G., Paysan-Lafosse, T., Qureshi, M., Raj, S., Richardson, L., Salazar, G. A., Williams, L., Bork, P., Bridge, A., Gough, J., Haft, D. H., Letunic, I., Marchler-Bauer, A., … Finn, R. D. (2021). The InterPro protein families and domains database: 20 years on. Nucleic Acids Research, 49(D1), D344–D354. https://doi.org/10.1093/nar/gkaa977
-
Borodovsky, M., & Lomsadze, A. (2011). Eukaryotic Gene Prediction Using GeneMark.hmm-E and GeneMark-ES. Current Protocols in Bioinformatics / Editoral Board, Andreas D. Baxevanis ... [et Al.], CHAPTER, Unit-4.610. https://doi.org/10.1002/0471250953.bi0406s35
-
Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. https://doi.org/10.1038/nmeth.3176
-
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: Architecture and applications. BMC Bioinformatics, 10, 421. https://doi.org/10.1186/1471-2105-10-421
-
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P., & Huerta-Cepas, J. (2021). eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Molecular Biology and Evolution, 38(12), 5825–5829. https://doi.org/10.1093/molbev/msab293
-
Challis, R., Richards, E., Rajan, J., Cochrane, G., & Blaxter, M. (2020). BlobToolKit – Interactive Quality Assessment of Genome Assemblies. G3 Genes|Genomes|Genetics, 10(4), 1361–1374. https://doi.org/10.1534/g3.119.400908
-
Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560
-
Edgar, R. C. (2016). SINTAX: A simple non-Bayesian taxonomy classifier for 16S and ITS sequences (p. 074161). bioRxiv. https://doi.org/10.1101/074161
-
Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England), 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354
-
Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J., & Punta, M. (2014). Pfam: The protein families database. Nucleic Acids Research, 42(Database issue), D222–D230. https://doi.org/10.1093/nar/gkt1223
-
Frith, M. C. (2011). A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Research, 39(4), e23. https://doi.org/10.1093/nar/gkq1212
-
Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: Quality assessment tool for genome assemblies. Bioinformatics (Oxford, England), 29(8), 1072–1075. https://doi.org/10.1093/bioinformatics/btt086
-
Haas, B. J., Salzberg, S. L., Zhu, W., Pertea, M., Allen, J. E., Orvis, J., White, O., Buell, C. R., & Wortman, J. R. (2008). Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology, 9(1), R7. https://doi.org/10.1186/gb-2008-9-1-r7
-
Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., Mende, D. R., Letunic, I., Rattei, T., Jensen, L. J., von Mering, C., & Bork, P. (2019). eggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research, 47(D1), D309–D314. https://doi.org/10.1093/nar/gky1085
-
Jonathan M. Palmer, & Jason Stajich. (2020). Funannotate v1.8.1: Eukaryotic genome annotation [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.4054262
-
Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A. F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S.-Y., Lopez, R., & Hunter, S. (2014). InterProScan 5: Genome-scale protein function classification. Bioinformatics, 30(9), 1236–1240. https://doi.org/10.1093/bioinformatics/btu031
-
KorfLab/SNAP. (2024). [C]. The Korf Lab. https://github.com/KorfLab/SNAP (Original work published 2017)
-
Köster, J., & Rahmann, S. (2012). Snakemake—A scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520–2522. https://doi.org/10.1093/bioinformatics/bts480
-
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), Article 4. https://doi.org/10.1038/nmeth.1923
-
Letunic, I., Khedkar, S., & Bork, P. (2021). SMART: Recent updates, new developments and status in 2020. Nucleic Acids Research, 49(D1), D458–D460. https://doi.org/10.1093/nar/gkaa937
-
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352
-
Lowe, T. M., & Eddy, S. R. (1997). tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research, 25(5), 955–964.
-
Majoros, W. H., Pertea, M., & Salzberg, S. L. (2004). TigrScan and GlimmerHMM: Two open source ab initio eukaryotic gene-finders. Bioinformatics (Oxford, England), 20(16), 2878–2879. https://doi.org/10.1093/bioinformatics/bth315
-
Nilsson, R. H., Larsson, K.-H., Taylor, A. F. S., Bengtsson-Palme, J., Jeppesen, T. S., Schigel, D., Kennedy, P., Picard, K., Glöckner, F. O., Tedersoo, L., Saar, I., Kõljalg, U., & Abarenkov, K. (2019). The UNITE database for molecular identification of fungi: Handling dark taxa and parallel taxonomic classifications. Nucleic Acids Research, 47(D1), D259–D264. https://doi.org/10.1093/nar/gky1022
-
Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292–294. https://doi.org/10.1093/bioinformatics/btv566
-
Rawlings, N. D., Waller, M., Barrett, A. J., & Bateman, A. (2014). MEROPS: The database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Research, 42(D1), D503–D509. https://doi.org/10.1093/nar/gkt953
-
Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: A versatile open source tool for metagenomics. PeerJ, 4, e2584. https://doi.org/10.7717/peerj.2584
-
Sigrist, C. J. A., de Castro, E., Cerutti, L., Cuche, B. A., Hulo, N., Bridge, A., Bougueleret, L., & Xenarios, I. (2013). New and continuing developments at PROSITE. Nucleic Acids Research, 41(Database issue), D344-347. https://doi.org/10.1093/nar/gks1067
-
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 31(19), 3210–3212. https://doi.org/10.1093/bioinformatics/btv351
-
Slater, G. S. C., & Birney, E. (2005). Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics, 6(1), 31. https://doi.org/10.1186/1471-2105-6-31
-
Stanke, M., Keller, O., Gunduz, I., Hayes, A., Waack, S., & Morgenstern, B. (2006). AUGUSTUS: Ab initio prediction of alternative transcripts. Nucleic Acids Research, 34(Web Server issue), W435–W439. https://doi.org/10.1093/nar/gkl200
-
The UniProt Consortium. (2023). UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531. https://doi.org/10.1093/nar/gkac1052
-
Thomas, P. D., Ebert, D., Muruganujan, A., Mushayahama, T., Albou, L.-P., & Mi, H. (2022). PANTHER: Making genome-scale phylogenetics accessible to all. Protein Science, 31(1), 8–22. https://doi.org/10.1002/pro.4218
-
Zheng, J., Ge, Q., Yan, Y., Zhang, X., Huang, L., & Yin, Y. (2023). dbCAN3: Automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Research, 51(W1), W115–W121. https://doi.org/10.1093/nar/gkad328