Merge pull request #78 from raufs/main

Update to v1.5.0
Kalan-Lab · Oct 25, 2024 · 68ab76f · 68ab76f
2 parents 6d996c1 + df30bb6
commit 68ab76f
Show file tree

Hide file tree

Showing 28 changed files with 1,837 additions and 415 deletions.
diff --git a/README.md b/README.md
@@ -1,58 +1,50 @@
 # *zol (& fai)*
 
-[![Preprint](https://img.shields.io/badge/Preprint-bioRxiv-darkblue?style=flat-square&maxAge=2678400)](https://www.biorxiv.org/content/10.1101/2023.06.07.544063v2)
+[![Preprint](https://img.shields.io/badge/Preprint-bioRxiv-darkblue?style=flat-square&maxAge=2678400)](https://www.biorxiv.org/content/10.1101/2023.06.07.544063v3)
 [![Documentation](https://img.shields.io/badge/Documentation-Wiki-darkgreen?style=flat-square&maxAge=2678400)](https://github.com/Kalan-Lab/zol/wiki)
 [![Docker](https://img.shields.io/badge/Docker-DockerHub-darkred?style=flat-square&maxAge=2678400)](https://hub.docker.com/r/raufs/zol)
 [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/zol/README.html) [![Conda](https://img.shields.io/conda/dn/bioconda/zol.svg)](https://anaconda.org/bioconda/zol/files)
 [![Anaconda-Server Badge](https://anaconda.org/bioconda/zol/badges/latest_release_date.svg)](https://anaconda.org/bioconda/zol)
 [![Anaconda-Server Badge](https://anaconda.org/bioconda/zol/badges/platforms.svg)](https://anaconda.org/bioconda/zol)
 [![Anaconda-Server Badge](https://anaconda.org/bioconda/zol/badges/license.svg)](https://anaconda.org/bioconda/zol)
 
-*zol (& fai)* are tools to search for gene clusters (sets of co-located genes - e.g. viruses/phages or biosynthetic gene clusters) in a target set of (meta-)genomes and to subsequently simplify the identification of interesting functional, evolutionary, and conservation patterns through creating detailed and color-formatted XLSX spreadsheets that can summarize information across 100s to 1000s of homologous gene cluster instances where visualization-based approaches might be overwhelming or computationally intensive to render.
+***zol (& fai)*: tools for targeted searching and evolutionary investigations of gene clusters (sets of co-located genes - e.g. biosynthetic gene clusters, viruses/phages, operons, etc.).**
 
-1. [Program Descriptions](#program-description)
-2. [Installation](#installation)
-3. [Overview of Major Results](https://github.com/Kalan-Lab/zol/wiki/0.-overview-of-major-result-files)
-4. [Short note on resource requirements](#short-note-on-resource-requirements)
-5. [Test Case](#test-case)
-6. [Documetation](https://github.com/Kalan-Lab/zol/wiki)
-7. [Example Usages](https://github.com/Kalan-Lab/zol/wiki/4.-basic-usage-examples)
-8. [Tutorial with Tips and Tricks](https://github.com/Kalan-Lab/zol/wiki/5.-tutorial-%E2%80%90-a-detailed-walkthrough)
-9. [Premade Target Genome Databases](https://github.com/Kalan-Lab/zol/wiki/7.-premade-prepTG-dbs)
-10. [Dependencies](https://github.com/Kalan-Lab/zol/wiki/6.-dependencies)
-11. [Assessing the conservation of a focal sample's BGC-ome, phage-ome, and plasmid-ome using abon, atpoc, and apos](https://github.com/Kalan-Lab/zol/wiki/0.-overview-of-major-result-files#abon-atpoc-and-apos-results)
-12. [(***New***) Summary visualization of 1000s of gene clusters using cgc](https://github.com/Kalan-Lab/zol/wiki/5.3-visualization-of-1000s-of-gene-clusters-using-cgc)
-13. [(***New***) Assessing support for lateral gene transfer using salt](https://github.com/Kalan-Lab/zol/wiki/5.4-horizontal-or-lateral-transfer-assessment-of-gene-clusters-using-salt)
-
-![image](https://github.com/Kalan-Lab/zol/assets/4260723/23c8eae2-ed2f-4c58-bf69-89506c258d9a)
+First, fai allows users to search for homologous/orthologous instances of a query gene cluster in a database of (meta-)genomes. There are some other similar tools, including convenient webservers, to fai (which we highlight and recommend as altneratives on [this documentation page](https://github.com/Kalan-Lab/zol/wiki/5.1-tutorial-for-using-zol-with-output-from-fast.genomics-and-CAGECAT)); but, fai also has some unique/rarer options. Mainly, fai pays special attention to see whether gene cluster hits in target (meta-)genomes are on scaffold/contig edges and takes consideration of this, during both detection and downstream assessment. E.g. fai will mark individual coding genes and gene cluster instances if they are on the edge of a scaffold/contig, which can then be used as a filter in zol. *This is important for calculation of conservation of genes across homologous gene clusters!* 
 
+After finding homologous sets of gene clusters - using fai or other software - users often wish to investigate their similarity. This is often performed using pairwise similarity assessment via visualization with tools such as clinker, gggenomes, etc. While these tools are great, **if you found 100s or 1000s of gene cluster instances** such visualizations can get overwhelming and computationally expensive to render. To simplify the identification of interesting functional, evolutionary, and conservation patterns across 100s to 1000s of homologous gene cluster instances, we developed zol to be able to perform *de novo* ortholog group predictions and create detailed color-formatted XLSX spreadsheets summarizing information. More recently, we have also introduced scalable visualization tools (*cgc & cgcg*) that allow for simpler assessment of information represented across thousands of homologous gene cluster instances.
 
-**Citation:**
-```
-zol & fai: large-scale targeted detection and evolutionary investigation of gene clusters
-
-R Salamzade, PQ Tran, C Martin, AL Manson, 
-MS Gilmore, AM Earl, K Anantharaman, LR Kalan
-bioRxiv 2023.06.07.544063; doi: https://doi.org/10.1101/2023.06.07.544063
-```
+<p align="center">
+<img src="https://github.com/user-attachments/assets/b0ec16bf-f302-4018-a7eb-91ff8a8b7817" width="600">
+</p>
 
-In addition, please cite important [dependency software or databases](https://github.com/Kalan-Lab/zol/wiki/6.-dependencies) for your specific analysis accordingly.
+### Citation:
+> [zol & fai: large-scale targeted detection and evolutionary investigation of gene clusters](https://www.biorxiv.org/content/10.1101/2023.06.07.544063v3). *bioRxiv 2023.* Rauf Salamzade, Patricia Q Tran, Cody Martin, Abigail L Manson, Michael S Gilmore, Ashlee M Earl, Karthik Anantharaman, Lindsay R Kalan
 
-## Program Descriptions:
+*In addition, please cite important [dependency software or databases](https://github.com/Kalan-Lab/zol/wiki/6.-dependencies) for your specific analysis accordingly.*
 
-### Prepare Target Genomes (prepTG)
+## Main Contents:
 
-**`prepTG`** processes and performs gene-calling or gene-mapping on an input set of genomes to ease and optimize downstream searches using fai.
+1. [Documetation](https://github.com/Kalan-Lab/zol/wiki)
+2. [Overview of Major Results](https://github.com/Kalan-Lab/zol/wiki/0.-overview-of-major-result-files)
+3. [Short note on resource requirements](#short-note-on-resource-requirements)
+4. [Installation](#installation)
+5. [Test Case](#test-case)
+6. [Example Usages](https://github.com/Kalan-Lab/zol/wiki/4.-basic-usage-examples)
+7. [Tutorial with Tips and Tricks](https://github.com/Kalan-Lab/zol/wiki/5.-tutorial-%E2%80%90-a-detailed-walkthrough)
 
-### Find Additional Instances (fai)
+### Auxiliary tools within the suite:
+* [abon, atpoc, and apos: Assessing the conservation of a focal sample's BGC-ome, phage-ome, and plasmid-ome](https://github.com/Kalan-Lab/zol/wiki/0.-overview-of-major-result-files#abon-atpoc-and-apos-results)
+* [(***New***) cgc: Summary visualization of 1000s of gene clusters](https://github.com/Kalan-Lab/zol/wiki/5.3-visualization-of-1000s-of-gene-clusters-using-cgc)
+* [(***New***) cgcg: Network visualization of ortholog groups across 1000s of gene clusters](https://github.com/Kalan-Lab/zol/wiki/5.3-visualization-of-1000s-of-gene-clusters-using-cgc)
+* [(***New***) salt: Assessing support for lateral gene transfer](https://github.com/Kalan-Lab/zol/wiki/5.4-horizontal-or-lateral-transfer-assessment-of-gene-clusters-using-salt)
 
-**`fai`** is a program to search for additional instances of a gene-cluster or genomic locus in some set of target genomes. Inspired by cblaster, CORASON, ClusterFinder, MultiGeneBlast, etc. It leverages DIAMOND alignment similar to [cblaster](https://github.com/gamcil/cblaster) and runs fairly rapidly (allowing it to scale to thousands of genomes and even work on metagenomic assemblies). fai features some key differentiating options relative to other software: (i) can assess syntenic similarity of candidate homologous gene clusters to the query gene cluster, (ii) can allow for looser criteria thresholds for gene cluster detection in target genomes if multiple neighborhoods are identified as homologous and on scaffold edges (thus improving fragmented gene cluster identification due to assembly issues) - similar to lsaBGC-Expansion, (iii) filter secondary neighborhoods - e.g. homologous gene neighborhoods to the query which meet the criteria but are not the best match.
 
-### Zoom on Locus (zol)
+## Short Note on Resource Requirements:
 
-**`zol`** is a program to create table reports showing ortholog group conservation, annotation, and evolutionary stats for any gene-cluster or locus of interest. At it's core it performs ortholog group inference de novo across gene-cluster instances similar to [CORASON](https://github.com/nselem/corason), but uses an InParanoid-like algorithm. Tables are similar but currently more in-depth and feature some different statistics than lsaBGC-PopGene reports. zol produces a basic heatmap, but for visualizations of gene-clusters we recommend other tools such as [clinker](https://github.com/gamcil/clinker), [pyGenomeViz](https://github.com/moshi4/pyGenomeViz), [CORASON](https://github.com/nselem/corason), and [gggenomes](https://github.com/thackl/gggenomes), which we think the in-depth spreadsheet complements nicely. We also provide examples of how zol and skani can be used to select representative gene clusters for such visual investigations. 
+Different programs in the zol suite have different resource requirements. Moving forward, the default settings in the `zol` program itself should usually allow for low memory usage and faster runtime. For thousands of gene cluster instances, we recommend to either use the dereplication/reinflation approach (see manuscript for comparison on evolutionary statistics between this approach and a full processing) or using CD-HIT clustering (a greedy incremental clustering approach - which is nicely illustrated/explained on the [MMSeqs2 wiki](https://github.com/soedinglab/MMseqs2/wiki#clustering-modes)) to determine protein clusters/families (not true ortholog groups). Disk space is generally not a huge concern for zol analysis, but if working with thousands of gene clusters things can temporarily get large. 
 
-Critically, ***with the development of some key options, together, fai and zol enable high-throughput detection of orthologs across multi-species datasets comprising of thousands of genomes.***
+Available disk space is the primary concern however for `fai` and `prepTG`. This is mostly the case for users interested in the construction and searching of large databases (containing over a thousand genomes). Generally, `prepTG` and `fai` are designed to work on metagenomic as well as genomic datasets and do not have a high memory usage, but genomic files stack up in space and DIAMOND alignment files can quite get large as well.
 
 ## Installation:
 
@@ -68,12 +60,13 @@ conda create -n zol_env -c conda-forge -c bioconda zol
 conda activate zol_env
 
 # 2. depending on internet speed, this can take 20-30 minutes
-# end product will be ~36 GB! You can also run in minimal mode
-# (which will only download PGAP HMM models < 5 GB) using -m. 
+# end product will be ~40 GB! You can also run in minimal mode
+# (which will only download PGAP HMM models ~8.5 GB) using -m. 
 setup_annotation_dbs.py
 ```
 
->Note, when you create a conda environment using `-n`, the environment will typically be stored in your home directory. However, because the databases can be large, you might prefer to instead setup the conda environment somewhere else with more space on your system using `-p`. For instance, `conda create -p /path/to/drive_with_more_space/zol_conda_env/ -c conda-forge -c bioconda zol`. Then, next time around you would simply activate this environment by providing the path to it: `conda activate /path/to/drive_with_more_space/zol_conda_env/`
+> [!NOTE]
+> When you create a conda environment using `-n`, the environment will typically be stored in your home directory. However, because the databases can be large, you might prefer to instead setup the conda environment somewhere else with more space on your system using `-p`. For instance, `conda create -p /path/to/drive_with_more_space/zol_conda_env/ -c conda-forge -c bioconda zol`. Then, next time around you would simply activate this environment by providing the path to it: `conda activate /path/to/drive_with_more_space/zol_conda_env/`
 
 #### Docker:
 
@@ -92,34 +85,6 @@ chmod a+x ./run_ZOL.sh
 ./run_ZOL.sh
 ```
 
-#### Conda Manual:
-
-```bash
-# 1. clone Git repo and change directories into it!
-git clone https://github.com/Kalan-Lab/zol
-cd zol/
-
-# 2. create conda environment using yaml file and activate it!
-conda env create -f zol_env.yml -n zol_env
-conda activate zol_env
-
-# 3. complete python installation with the following commands:
-python setup.py install
-pip install -e .
-
-# 4. depending on internet speed, this can take 20-30 minutes
-# end product will be 28 GB! You can also run in minimal mode
-# (which will only download PGAP HMM models < 5 GB) using -m.
-# within zol Git repo with conda environment activated, run:
-setup_annotation_dbs.py
-```
-
-## Short Note on Resource Requirements:
-
-Different programs in the zol suite have different resource requirements. Moving forward, the default settings in the `zol` program itself should usually allow for low memory usage and faster runtime. For thousands of gene cluster instances, we recommend to either use the dereplication/reinflation approach (see manuscript for comparison on evolutionary statistics between this approach and a full processing) or using CD-HIT clustering (a greedy incremental clustering approach - which is nicely illustrated/explained on the [MMSeqs2 wiki](https://github.com/soedinglab/MMseqs2/wiki#clustering-modes)) to determine protein clusters/families (not true ortholog groups). Disk space is generally not a huge concern for zol analysis, but if working with thousands of gene clusters things can temporarily get large. 
-
-Available disk space is the primary concern however for `fai` and `prepTG`. This is mostly the case for users interested in the construction and searching of large databases (containing over a thousand genomes). Generally, `prepTG` and `fai` are designed to work on metagenomic as well as genomic datasets and do not have a high memory usage, but genomic files stack up in space and DIAMOND alignment files can quite get large as well.
-
 ## Test case:
 
 Following installation, you can run a provided test case focused on a subset of Enterococcal polysaccharide antigen instances in *E. faecalis* and *E. faecium* as such:
@@ -150,14 +115,6 @@ chmod a+x ./test_docker.sh
 
 Note, the script `test_docker.sh` must be run in the same folder as run_ZOL.sh!
 
-#### Conda Manual:
-
-Within the zol GitHub repo, run the following:
-
-```bash
-bash run_tests.sh
-```
-
 ## License:
 
 ```

diff --git a/bin/atpoc b/bin/atpoc
@@ -38,16 +38,13 @@
 import os
 import sys
 import argparse
-import subprocess
 from time import sleep
 from zol import util, fai
-import _pickle as cPickle
 from Bio import SeqIO
-from operator import itemgetter
 from collections import defaultdict
-import traceback
 import pandas as pd
 import math 
+import traceback
 
 def create_parser():
 	""" Parse arguments """
@@ -141,6 +138,17 @@ def atpoc():
 	sys.stdout.write('Parsing phage prediction results\n')
 	logObject.info('Parsing phage prediction results')
 
+
+	scaff_lengths = {}
+	try:
+		with open(sample_genome) as osg:
+			for rec in SeqIO.parse():
+				scaff_lengths[rec.id] = len(str(rec.seq))
+	except:
+		msg = 'Issue processing genome file.'
+		sys.stderr.write(msg + '\n')
+		sys.stderr.write(traceback.format_exc() + '\n')
+
 	sample_phages = []
 	if vibrant_results_dir != None and os.path.isdir(vibrant_results_dir):
 		for root, dirs, files in os.walk(vibrant_results_dir):
@@ -180,10 +188,14 @@ def atpoc():
 							phage_id, phage_length, topology, coords, n_genes, genetic_code, virus_score, fdr, n_hallmarks, marker_enrichment, taxonomy = line.split('\t')
 							scaff = phage_id.split('|')[0]
 							phage_id = phage_id.replace('|', '_')
-							start_coord, end_coord = coords.split('-')
+							if coords != 'NA':
+								start_coord, end_coord = coords.split('-')
+							else:
+								start_coord = 1 
+								end_coord = scaff_lengths[scaff]
 							additional_info = '; '.join(['virus_score' + virus_score, 'topology=' + topology, 'genetic_code=' + genetic_code, 'n_hallmarks=' + n_hallmarks, 'marker_enrichment=' + marker_enrichment, 'taxonomy=' + taxonomy])
 							sample_phages.append(['geNomad', 'GN-' + phage_id, scaff, start_coord, end_coord, phage_length, additional_info])
-
+	
 	sys.stdout.write('Found %d prophage predictions!\n' % len(sample_phages))
 	logObject.info('Found %d prophage predictions!' % len(sample_phages))