Skip to content

Latest commit

 

History

History
201 lines (142 loc) · 9.82 KB

VIRUSBreakend_Readme.md

File metadata and controls

201 lines (142 loc) · 9.82 KB

Release Language [Language] [License]

VIRUSBreakend - Viral Integration Recognition Using Single Breakends

VIRUSBreakend is a high-speed viral integration detection tool designed to be incorporated in the whole genome sequence piplines with minimal additional cost. VIRUSBreakend takes on average 1 hour to run on a 100x coverage human sample. Recommended job/VM size is 4 core / 64Gb memory.

This tool is part of GRIDSS - the Genomic Rearrangement IDentification Software Suite.

Citation

Please cite:

Pre-requisites

VIRUSBreakend is part of the GRIDSS software suite. It can be downloaded from https://github.com/PapenfussLab/gridss/releases

All tools used by VIRUSBreakend must be on PATH including:

  • java
  • GRIDSS
  • Kraken2
  • RepeatMasker
  • samtools
  • bcftools
  • bwa

Set the GRIDSS_JAR environment variable to the location of the GRIDSS jar file

Reference data

The VIRUSBreakend reference data can be downloaded from AWS: https://virusbreakend.s3.us-east-2.amazonaws.com/virusbreakenddb_20210401.tar.gz

The VIRUSBreakend reference data can also be downloaded from https://melbourne.figshare.com/articles/dataset/virusbreakenddb_20210401_tar_gz/14782299

Reference data contents

VIRUSBreakend uses a standard Kraken2 database augmented with NCBI viral neighbour genomes and associated metadata files.

The reference file database contains the following files:

file description
library/viral/library.fna NCBI RefSeq viral genomes
library/added/*.fna NCBI viral neighbour genomes
*.fai samtools index of associated .fna file
*.fna.dict Picard tools sequence dictionary of associated .fna file
hash.k2d Kraken2 database file
taxo.k2d Kraken2 database file
opts.k2d Kraken2 database file
taxonomy/nodes.dmp Snapshot of NCBI taxonomy nodes at time of database creation
taxonomy/names.dmp Snapshot of NCBI taxonomy names at time of database creation. Not technically required by VIRUSBreakend but useful for looking up the name of the assigned taxid
seqid2taxid.map Kraken2 mapping of sequences to NCBI taxonomy IDs
taxid10239.nbr Table of viral neighbour genomes and their associated RefSeq genome

The exact sources for these files can be determined from the paths and queries in virusbreakend-build.

Regenerating the VIRUSBreakend database

Run virusbreakend-build --db virusbreakenddb to download and generate the reference data. This download the NCBI taxonomic information, sequences, virushostdb, the kraken2-build build process, and generates indexes. The index is around 54GB in size. Be aware that the kraken2 build process requires around 150GB of intermediate disk space to download from NCBI and build the index.

WARNING: downloading from NCBI fails regularly. It is likely you will have to manually rerun each failing step in virusbreakend-build to get a functioning database.

virusbreakend-build requires:

  • An internet connection
  • GRIDSS
  • samtools
  • Kraken2
    • dustmasker (part of the NCBI BLAST+ package)
  • EDirect utilities

These can be installed manually, or the BioConda samtools kraken2 and entrez-direct packages can be used.

The the generated directory (virusbreakenddb) contains the files used in the build process. The file virusbreakend.db.virusbreakenddb.tar.gz contains the minimal subset of the virusbreakenddb data required by VIRUSBreakend.

Running VIRUSBreakend

To run VIRUSbreakend, ensure that the database has been build and run the following command:

virusbreakend \
	--kraken2db virusbreakenddb \
	--output sample.virusbreakend.vcf \
	--reference host_reference.fa \
	sample.bam

Reference genomes with viral sequences

Some reference genomes include viral sequence. hs37d5, hs38DH, and GRCh38_full_analysis_set_plus_decoy_hla all include EBV. For VIRUSBreakend to report EBV infection, reads aligned to the EBV sequence ( or chrEBV)

By default, VIRUSBreakend considered any read mapped if it maps to chrEBV or any viral reference sequence in the VIRUSBreakend database (library/viral/*.fna.fai or library/added/*.fna.fai). When checking for matches, VIRUSBreakend uses the exact name, the kraken stripped name, and also the version stripped named. For example, the kraken2 database contains kraken:taxid|10376|NC_007605.1 so VIRUSBreakend will exclude kraken:taxid|10376|NC_007605.1, NC_007605.1, and NC_007605 (thus picking up the NC_007605 EBV reference in hs37d5.fa)

A custom list can be supplied with the --viralreferences command line parameter.

Output

VIRUSBreakend outputs:

  • A VCF containing the integration breakpoints
  • The kraken2 report of the virus(es) for which viral integration was run upon
  • Coverage statistics of the virus(es) for which viral integration was run upon

summary.tsv files

field meaning
taxid_genus NCBI taxonomy ID of genus of viral reference
name_genus NCBI taxonomy genus name
reads_genus_tree Number of reads assigned to any virus in that genus by kraken2
taxid_species NCBI taxonomy ID of species of viral reference
reads_species_tree Number of reads assigned to that species (and any sub-species) by kraken2
name_species NCBI taxonomy species name
taxid_assigned NCBI taxonomy ID of viral reference
name_assigned NCBI taxonomy name of viral reference
reads_assigned_tree Number of reads assigned to the viral reference or taxonomic descendents by kraken2
reads_assigned_direct Number of reads assigned to the viral reference taxid by kraken2
reference name of viral reference used.
reference_taxid NCBI taxonomy ID of viral reference used. This can be a child taxid of taxid_assigned.
reference_kmer_count count of read kmers matching reference used
alternate_kmer_count count of read kmers matching next best reference candidate
#rname name of adjusted viral reference
startpos start position of adjusted viral reference. Always 1
endpos end position of adjusted viral reference. Always equal to the viral contig length
numreads Number of reads mapped to adjusted viral reference
covbases Number of bases with at least 1 read mapped
coverage Percentage of viral sequence with at least one read mapped
meandepth Mean alignment depth
meanbaseq Mean base quality of bases mapped to adjusted viral reference
meanmapq Mean mapping quality of reads mapped to adjusted viral reference
integrations Number of integration breakpoints found.
QCStatus QC status of viral integration

Each viral integration should have 2 integration breakpoints (one for the start, one for the end) although there may be more if the integration site is rearranged or less if one side of the integration site was missed.

QCStatus

QCStatus can be one of the following:

  • "" The empty string indicates no warning or errors associated with this virus

  • LOW_VIRAL_COVERAGE

Less than 10% of the virus has any coverage. This typically occurs when the viral sequence has a short sequence homology with non-viral sequence. This record is likely a false positive and should be ignored.

  • EXCESSIVE_VIRAL_COVERAGE

Regions of the virus have depth of coverage so high, that GRIDSS was unable to call integrations in these regions. The exact bounds of excluded regions can be found in the *.coverage.blacklist.bed files in the virusbreakend.working directory

  • ASSEMBLY_DOWNSAMPLED

Viral coverage was sufficiently high that some regions were downsampled in the GRIDSS assembly process. With the assembly downsampling changes in GRIDSS 2.11.1, viral integration have in these region have probably been correctly called but the QUAL score is lower than it should be. The exact bounds of excluded regions can be found in the *.bed.excluded_*.bed files in the virusbreakend.working directory

  • CHILD_TAXID_REFERENCE

The viral genome chosen by VIRUSBreakend does not match the NCBI taxonomic classification assigned by Kraken2. This typically occurs in scenarios such as HPV-45 where Kraken2 assigns reads to the parent taxon due to sequence commonality between HPV-45 and HPV-18. This is expected behavour for these viral taxa. The taxon assigned reference_taxid

  • UNCLEAR_TAXID_ASSIGNMENT

Less than 60% of reads assigned to nominal kraken2 assigned taxon were directly assigned (the rest were assigned to child taxa). The taxonomic assignment may be unclear - possibly due to co-infection by multiple viruses within the genus.

Ambiguous insertions

The key differentiator of VIRUSBreakend is the ability to detect and classify integration sites in repetative sequences such as centromeres. Due to the repetative nature of these region, such integration sites cannot be unambigously placed in the host genome. In such cases, the mapq encoded in the BEALN field will be 0 and the field may contain multiple candidicate integration sites. Integration sites in which the reported position is ambiguous have a LOW_MAPQ FILTER applied.

The INSRM field contains the repeat sequences identifed in the integration site host sequences. These annotations can be used to classify ambigous integration sites.

By default, VIRUSBreakend filters out potential integrations with signficant overlap with simple or low complexity repeats since such integrations are typically false positives. If a more sensitive call set is required, this filter can be removed by rerunning the final VirusBreakendFilter steps with the MINIMUM_REPEAT_OVERLAP parameter set to any value greater than one (ie require more than 100% repeat overlap).