Skip to content

List of computational resources for analyzing microbial sequencing data.

License

Notifications You must be signed in to change notification settings

stevetsa/awesome-microbes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 

Repository files navigation

awesome-microbes

List of resources, including software packages (and the people developing these methods) for microbiome (16S), metagenomics (WGS, Shot-gun sequencing), and pathogen identification/detection/characterization. Contributions welcome...

Inspired by awesome-single-cell

Long-read Sequencing Tools

Canu - [Perl/C] - scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

BulkViz - [Python] - An app written in Python3 using Bokeh to visualise raw squiggle data from Oxford Nanopore Technologies (ONT) bulkfiles. See the documentation at https://bulkvis.readthedocs.io .

HGAP - [?] - The Hierarchical Genome Assembly Process (HGAP) for long single pass reads generated by the PacBio® Single Molecule Real Time (SMRT) sequencer was developed to allow the complete and accurate shotgun assembly of bacterial sized genomes.

metaFlye - [C++] - scalable long-read metagenome assembly using repeat graphs.

MetaMaps - [perl/R] - Strain-level metagenomic assignment and compositional estimation for long reads.

Minimap - [C] - an experimental tool to efficiently find multiple approximate mapping positions between two sets of long sequences, such as between reads and reference genomes, between genomes and between long noisy reads.

Shasta - [C++] - The goal of the Shasta long read assembler is to rapidly produce accurate assembled sequence using as input DNA reads generated by Oxford Nanopore flow cells.

Microbiome (16S)

Explicet - [?] - a free to use, open source software package (GPLv3) available for Windows, Mac, and Linux that facilitates the exploration and visualization of taxonomy-based microbiome datasets (a.k.a. OTU tables).

LotuS - [?] - aims at scientists and bioinformatician that want a simple pipeline that is streamlined to a core functionality of creating OTU and taxa abundance tables, at very fast speeds (e.g. processing an 8GB 16S miSeq run takes ~ 30 min on a laptop). LotuS does not include numerical analysis of samples, instead we designed LotuS output to be easily integrateable into existing workflows in e.g. statistical programming languages like R, QIIME/mothur or Matlab.

METAREP - [?] - high-performance comparative metagenomics. It provides a suite of web based tools to help scientists to view, query, browse and compare metagenomic annotation data derived from ORFs called on metagenomics reads or assemblies.

Microbiome Util - [perl] - NASTiEr - Sequence Alignment; WigeoN - Chimera detection; TreeChopper - OTU binning; AMOSScmp - Sequence assembly.

mothur - [C++] - OTU-based analysis of 16S data.

Otupipe - [?] - a bash script for OTU clustering based on USEARCH. This page is retained for historical interest because a script based on otupipe was used to create the published QIIME results for the Human Microbiome Project (HMP).

Puma - [Ruby] - Program for Unifying Microbiome Analyses (PUMA) - a novel tool for comprehensive and efficient streamlining of 16S rRNA microbiome taxonomy data for analysis and visualization (CLI/GUI).

Qiime - [Python] - QIIME is designed to take users from raw sequencing data generated on the Illumina or other platforms through publication quality graphics and statistics. This includes demultiplexing and quality filtering, OTU picking, taxonomic assignment, and phylogenetic reconstruction, and diversity analyses and visualizations.

Qiita (cheetah) - [?] - microbiome storage and analysis resource that can run on everything from your laptop to a supercomputer. It is built on top of the widely used QIIME package, and enables the exploration of -omics data.

UPARSE - [?] - generates OTUs that are far superior to state-of-the-art methods including QIIME, mothur and AmpliconNoise on mock community tests. OTU representative sequences are more accurate predictions of biological sequences, and the number of OTUs is much close to the number of species.

USEARCH - [?] - a high-throughput sequencing tool that offers read processing, clustering (ESTs, OTUs, +more), and diversity and taxonomy analysis algorithms in a single package. USEARCH's database search feature is 10-100 times faster than BLAST, and the documentation is thorough and user-friendly. The 32-bit version is free, including for commercial use. A paid 64-bit version is also available.

Metagenomics (WGS, Shotgun sequencing)

bioBakery-MetaPhlAn - [python] - a virtual environment platform that provides meta'omic analysis tools.

Krakenuniq - [C++] - fast and accurate kmer based metagenomic binning tool. Requires considerable RAM.

Mauve - [?] - a system for constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignments provide a basis for research into comparative genomics and the study of genome-wide evolutionary dynamics.

MetaKallisto - [Python] - Pseudoalignment for metagenomic read assignment.

MetaMeta - [python] - Integrates metagenome analysis tools to improve taxonomic profiling. doi

SMART Metagenomics Classifer - [C++] - Metagenomics aligner pipeline for ingesting WGS FASTQ files and uses highly parallel k-mer search strategy to search against all of NCBI GenBank. Link to manuscript

Wochenende - [Python3, BASH] Metagenomics alignment pipeline for short and long read metagenomic analysis. SLURM Enabled for cluster usage.

R/Bioconductor tools

curatedMetagenomicData - a curated database of standardized metadata and human microbiome data from shotgun metagenomics, from the Human Microbiome Project and numerous published datasets. Provides metabolic functional potential (HUMAnN) and taxonomic profiles (MetaPhlAn), including linked phylogenetic information using TreeSummarizedExperiment, distributed via Bioconductor's ExperimentHub with a command-line interface also available.

HMP16SData - HMP16SData is a Bioconductor ExperimentData package of the Human Microbiome Project (HMP) 16S rRNA sequencing data for variable regions 1–3 and 3–5. Data are as downloaded from the HMP Data Analysis and Coordination Center. Processed data is provided as SummarizedExperiment class objects via ExperimentHub.

bugsigdbr - The bugsigdbr package implements convenient access to bugsigdb.org from within R/Bioconductor. The goal of the package is to facilitate import of BugSigDB data into R/Bioconductor, provide utilities for extracting microbe signatures, and enable export of the extracted signatures to plain text files in standard file formats such as GMT.

mia - mia implements tools for microbiome analysis based on the SummarizedExperiment, SingleCellExperiment and TreeSummarizedExperiment infrastructure. Data wrangling and analysis in the context of taxonomic data is the main scope. Additional functions for common task are implemented such as community indices calculation and summarization.

TreeSummarizedExperiment - TreeSummarizedExperiment is a Bioconductor data structure extending SingleCellExperiment to include hierarchical information on the rows or columns of the rectangular data, such as a Newick phylogenetic tree.

phyloseq - phyloseq provides a set of classes and tools to facilitate the import, storage, analysis, and graphical display of microbiome census data.

Microbe (pathogen, bacterial, viral) Identification/Detection/Characterization

GenomeGraphR - [R] - A user-friendly open-source web application for foodborne pathogen Whole Genome Sequencing data integration, analysis, and visualization.

Pathoscope - [python] - Species identification and strain attribution with unassembled sequencing data.

MetaHIT - [?] - DNA microarrays and high-throughput DNA re-sequencing technology for structural and functional analysis of microbial populations.

POPSICLE - [R] - a software suite to determine population structure and Ancestral Determinants of Phenotypes using Whole Genome Sequencing data.

Visualization

Krona - [python] - Interactive metagenomic visualization in a Web browser.

MEGAN - [?] - The most powerful interactive microbiome analysis tool Analyse metagenome, metatranscriptome and amplicon sequences from multiple sources.

Other Tools

AmpliconNois - [?] - a collection of programs for the removal of noise from 454 sequenced PCR amplicons. It involves two steps the removal of noise from the sequencing itself and the removal of PCR point errors. This project also includes the Perseus algorithm for chimera removal.

BLAST - [C/C++] - Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Introduced in 2009, BLAST+ is an improved version of BLAST command line applications.

BLAST+ Docker - [NA] - Docker image for BLAST.

gist, genepuddle, and gargle - [?] - filter contaminants from RNA-seq sequences to make them suitable for alignment and analysis.

MLSTEZ - [python] - MLSTEZ is designed for next generation sequencing technology (PacBio CCS or Roche 454 platform) based MSLT methods.

Workflows

SAMSA2 - [NA] - Simple Analysis of Metatranscriptomes by Sequence Annotation.

Web Portals

CosmosID - Exploring the Universe of Microbes.

One Codex - One Codex is a data platform for applied microbial genomics.

NCI Cloud Resources - A cloud platform to analyze microbe data. Currently available pipelines - BLAST-based microbe identification and characterization, mothur, Qimme, and MetaPhlAn with built-in visualization. Contact @stevetsa for more information.

MG-RAST - Metagenomics Analysis Server.

Nephele - Microbiome Analysis without Boundaries.

PATRIC - Bacterial database and analysis platform.

ViPR - Viral pathogen database and analysis platform.

BugSigDB - A Comprehensive Database of Published Microbial Signatures. BugSigDB is a community-editable Semantic Mediawiki knowledge management system that standardizes, enables analysis, and allows bulk download of differential abundance results from all types of metagenomic studies.

Journal Articles, etc.

BIOM Format - is designed to be a general-use format for representing biological sample by observation contingency tables. BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium supported project.

Experimental and analytical tools for studying the human microbiome - Nature Reviews Genetics 13, 47-58 (January 2012) | doi:10.1038/nrg3129.

Metagenomics - Tools and other Points of Interest - list compiled by @bioinformer

Coronavirus Disease (COVID-19) Outbreak Resources

International Government Organization

US Center for Disease Control - Situation Summary - "The Centers for Disease Control and Prevention (CDC) is closely monitoring an outbreak of respiratory illness caused by a novel (new) coronavirus first identified in Wuhan, Hubei Province, China. Chinese authorities identified the new coronavirus, which has resulted in thousands of confirmed cases in China, including cases outside Wuhan City. Additional cases have been identified in a growing number of other international locations, including the United States. There are ongoing investigations to learn more."

CDC - Testing for the 2019 Novel Coronavirus - “Centers for Disease Control and Prevention (CDC) 2019-Novel Coronavirus (2019-nCoV) Real-Time Reverse Transcriptase (RT)-PCR Diagnostic Panel.” It is intended for use with the Applied Biosystems 7500 Fast DX Real-Time PCR Instrument with SDS 1.4 software. CDC is shipping the test kits to laboratories CDC has designated as qualified, including U.S. state and local public health laboratories, Department of Defense (DOD) laboratories and select international laboratories. The test kits are bolstering global laboratory capacity for detecting SARS-CoV-2.

US National Institutes of Health - Coronavirus Disease 2019 (COVID-19) Situation Summary.

US National Institute of Allergy and Infectious Diseases - "Current studies at NIAID-funded institutions and by scientists in NIAID laboratories include efforts that build on previous work on SARS- and MERS-CoVs. For example, researchers are developing diagnostic tests to rapidly detect 2019-nCoV infection and exploring the use of broad-spectrum anti-viral drugs to treat 2019-nCoVs, the authors note. NIAID researchers also are adapting approaches used with investigational SARS and MERS vaccines to jumpstart candidate vaccine development for 2019-nCoV. Advances in technology since the SARS outbreak have greatly compressed the vaccine development timeline, the authors write. They indicate that a candidate vaccine for 2019-nCoV could be ready for early-stage human testing in as little as three months as compared to 20 months for early-stage development of an investigational SARS vaccine."

US Food and Drug Administration - "FDA is working with U.S. Government partners, including the U.S. Centers for Disease Control and Prevention (CDC), and international partners to closely monitor an outbreak caused by a novel (new) coronavirus first identified in Wuhan City, Hubei Province, China."

World Health Organization - "This page is a one-stop shop for all information and guidance from WHO regarding the current outbreak of novel coronavirus (2019-nCoV) that was first reported from Wuhan, China, on 31 December 2019. Please visit this page for daily updates. WHO is working closely with global experts, governments and partners to rapidly expand scientific knowledge on this new virus, to track the spread and virulence of the virus, and to provide advice to countries and individuals on measures to protect health and prevent the spread of this outbreak." Situation Reports

Data Repositories

C-I-TASSER - "This page contains 3D structural models and function annotation for all proteins in the 2019-nCov genome. The structure models are generated by the C-I-TASSER pipeline, which utilizes deep convolutional network based contact-map predictions to guide the I-TASSER fragment assembly simulations. Benchark and blind CASP tests showed that C-I-TASSER generates models with a higher accuracy than I-TASSER does, especially for the protein targets lack of homologous templates."

Chinese National Genomics Data Center - 2019 Novel Coronavirus Resource.

Coronavirus Structural Task Force (Thorn Lab) - This repository is a global public resource for the structures from beta-coronavirus with a focus on SARS-CoV and SARS-CoV-2. You can find here:

  • The original files for 19 of the 28 proteins in SARS-CoV and SARS-CoV-2, over 300 different structures.
  • Re-refined structures from different contributors.
  • Validation statistics for these and the original structural models.
  • Diagnostic data for the quality of the experimental data.

The COVID Tracking Project - The COVID Tracking Project collects and publishes the most complete testing data available for US states and territories. This project was launched out of The Atlantic to fill a major gap in publicly available COVID-19 testing data. Johns Hopkins University maintains a comprehensive case count, but no governmental or institutional source is publishing complete testing data—including not just identified cases, but how many people have been tested, and where. Without this data, we can't make informed decisions or accurately communicate risks.

Global Initiative on Sharing All Influenza Data (GISAID) - The GISAID Initiative, also known as a Global Initiative on Sharing All Influenza Data, involves public-private-partnerships between the Initiative's administrative arm Freunde of GISAID e.V., a registered non-profit association, and governments of the Federal Republic of Germany, the official host of the GISAID platform and EpiFlu™ database, Singapore and the United States of America, with support from private and corporate philanthropy.

NCBI Genbank - Entrez Nucletide/RefSeq sequences, BLAST against a custom Betacoronavirus database, the new NCBI Virus resource, SRA sequences, reference genome, and PubMed articles.

Protein Data Bank - Coronavirus protein structures.

Radiological Society of North America (RSNA)- RSNA is committed to connecting radiologists and the radiology community to the most timely and useful COVID-19 information and resources. Bookmark this page to access the latest guidance, original research, image collection and more. Today, RSNA is announcing an agreement to collaborate closely with the European Imaging COVID-19 AI initiative, supported by the European Society of Medical Imaging Informatics.

Virus Pathogen Database and Analysis Resource (ViPR) - Nucleotide sequences, preliminary comparative genomics, tools for sequence alignment, phylogenetic, SNP, BLAST, annotation and analysis.

Global community (data science) efforts to combat COVID-19

Awesome-cornonavirus - Another community curated resources of COVID-19 related resources.

Call for expression of interest for contribution to the Linked Open Data for Global Disaster Risk Research - The global pandemic is a powerful reminder of the necessity of the international community’s intensified and sustained commitment to emergency preparedness. We are thus inviting experts in disaster risk reduction data and policy issues to collaborate on preparing these documents.

CodeVsCOVID-19 - The world’s brightest minds collaborate in a 72h non-profit online hackathon to fight the COVID-19 crisis. ​ This is an initiative under the patronage of the Swiss Federal Department of Economic Affairs, Education and Research (EAER) and the Federal Department of Home Affairs (FDHA). ​ The first edition starts March 27, 5pm CET. You need to sign up until Friday, March 27, 4pm latest.

COVID-19 Global Hackathons - The COVID-19 Global Hackathon is an opportunity for developers to build software solutions that drive social impact, with the aim of tackling some of the challenges related to the current coronavirus (COVID-19) pandemic. We’re encouraging YOU - innovators around the world - to #BuildforCOVID19 using technologies of your choice across a range of suggested themes and challenge areas - some of which have been sourced through health partners including the World Health Organization and scientists at the Chan Zuckerberg Biohub. The hackathon welcomes locally and globally focused solutions, and is open to all developers - with support from technology companies and platforms including Facebook, Giphy, Microsoft, Pinterest, Slack, TikTok, Twitter and WeChat, who will be sharing resources to support participants throughout the submission period.

COVID-19 Open Research Dataset Challenge (CORD-19) - In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses... We are issuing a call to action to the world's artificial intelligence experts to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions.

Folding@Home - This repository will contain all input files and generated datasets for the Folding@home efforts to better understand how the SARS-CoV-2 virus that causes COVID-19 can be targeted with small molecule and antibody therapeutics. This repository will be continuously updated to share results that are being generated on Folding@home. You can follow along with news updates on the Folding@home blog and Folding@home twitter feed. More information on this project can be found on this Folding@home news post.

FoldIt - COVID Spike protein - You don't have to be a scientist to do science! Download and play Foldit and you can help researchers discover new antiviral drugs that might stop coronavirus! The most promising solutions will be manufactured and tested at the University of Washington Institute for Protein Design in Seattle. Foldit is run by academic research scientists.

Hack the Crisis Norway - March 27-29, 2020 - Hack the Crisis Norway is an online hackathon set up to meet the challenges our society is facing as a result of the coronavirus. The event is organised by volunteers from the tech and startup community in Norway.

Hack COVID-19 - Join us for a global online hackathon hosted by Starfish. It’s imperative that we create solutions to address Coronavirus challenges and keep people as safe and healthy as possible! We’re looking to raise $20K for prizes, production, and organizational costs for Hack Covid-19. In addition, we’re looking for subject matter experts, donations of time, and appreciate support from all verticals.

Hack Quanrantine - March 23-April 12, 2020 - Major League Hacking - A fully-online, people-focused hackathon bringing people together to use their skills to help combat the issues the world is facing with the COVID-19 pandemic. By working with medical professionals and industry, we’ll provide the knowledge and tools to empower hackers to work towards around improving health, remote working and helping vulnerable populations.

Hack for Wuhan - March 5-7, 2020 - Hack for Wuhan, by Wuhan2020 unites developers, designers, builders, and creators all over the world to use technology to come up with solutions to help fight the current outbreak of coronavirus disease (COVID-19).

Harmony Hacks - HarmonyHacks is a high school hackathon—a 12-hour programming event where students work in teams to make hardware and software projects. This year, we aim to have around 150 attendees with a 1:1 gender ratio.

MachineHack Covid-19 - In the coming weeks and months, we at MachineHack (an Analytics India Magazine initiative) along with our community members will ominously examine how the coronavirus could affect different nations. Thereby, we invite MachineHackers to predict potential COVID-19 cases across all the globe on an everyday basis. The objective of the hackathon is to gauge COVID-19 on three metrics- confirmed cases, recovered cases and death events for the next day using historical data as on a given date.

Openlink Progressive SARS-CoV-2 (COVID-19) Outbreak Knowledge Graph Generation & Exploitation - Along with the rest of the world, we have been anxiously following the development of the virus COVID-19 (aka coronavirus). Crucial to fighting against this disease is the need to harness the power of Data (Measurements), Information (Metrics), and Knowledge (Insights) via open data access oriented infrastructure such as what’s provided by the Linked Open Data (LOD) Cloud – the world’s largest Knowledge Graph comprising massive enclaves associated with BioInformatics, Genomics, Life Sciences, Biology, Molecular Biology, Chemistry, etc.

UGA COVID-19 Virtual Hackathon - March 25, 2020 - The hackathon’s aim is to produce meaningful public health education and preparedness information for Georgia. The deliverables could be an infographic, a video for social media, a brief one-pager, an op-ed, a radio ad…get creative! CPH faculty and staff will be on hand to support you efforts. You can sign up to work as individuals or in teams up to five people. Multidisciplinary teams are highly encouraged. All degree levels are welcome.

Zindi/AI4D- Predict the Global Spread of COVID-19 - Accurately modelling the spread of these viral diseases is critical for policymakers and health workers to take appropriate actions to contain and mitigate the impact of these disease. This challenge asks data scientists on Zindi to accurately predict the spread of COVID-19 around the world over the next few months. Solutions will be evaluated against future data. The effects of COVID-19 have yet to emerge as the situation is evolving rapidly. With this challenge we hope to contribute to the global body of knowledge which will help stem the impact of pandemics such as this one as well as those in the future. This challenge is sponsored by the Artificial Intelligence for Development Africa(AI4D-Africa) Network.

Testing

CDC Guideline on Testing CDC provides the test kits for public health laboratories (PHLs) to perform real-time RT-polymerase chain reaction (rRT-PCR) detection of the SARS-CoV-2 virus (the virus that causes COVID-19) in respiratory specimens. These test kits are available through the International Reagent Resource (IRR).

Currently, genomic RNA material can be used for validation purposes at biosafety level 2 laboratories (BSL-2). Genomic RNA material is available through BEI Resources. BEI Resources was established by NIAID/NIH to provide reagents, tools and information for studying Category A, B, and C priority #pathogens and other microbes.

Visualization

Epidemic Calculator - At the time of writing, the coronavirus disease of 2019 remains a global health crisis of grave and uncertain magnitude. To the non-expert (such as myself), contextualizing the numbers, forecasts and epidemiological parameters described in the media and literature can be challenging. I created this calculator as an attempt to address this gap in understanding.

JHU CSSE Near-real-time mapping of 2019-nCoV - Near-real-time mapping of the pathogen.

How the Virus Got Out - New York Times, accessed March 22, 2020.

Nextstrain - Real-time tracking of pathogen evolution. Chinese Version/中文版

Tableau COVID-19 Visualizations - Stay up to date with the most impactful Coronavirus visualizations from the Tableau Community.

Publications

American Society for Microbiology - ASM is providing free access to more than 100 research articles published over the last year in ASM’s 18 scholarly journals to support research efforts and communications about the novel coronavirus (2019-nCoV).

bioRxiv - 2019nCoV articles.

Elsevier- Elsevier’s free health and medical research on novel coronavirus (2019-nCoV).

medRxiv - 2019nCoV articles.

PubMed - PubMed articles on the pathogen.

Science Magazine - 2019nCov Q&A.

The Lancet - 2019-nCoV Resource Centre.

The New England Journal of Medicine - A collection of articles and other resources on the 2019 Novel Coronavirus outbreak, including clinical reports, management guidelines, and commentary.

Wiley - In response to the outbreak in China, we’re providing free access to all Wiley published articles related to #coronavirus.

Clinical Trials for 2019 Novel Coronavirus

ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world.

National Institutes of Health News - NIH clinical trial of remdesivir to treat COVID-19 begins study enrolling hospitalized adults with COVID-19 in Nebraska.

Twitter hashtags

#COVID19 #COVID2019 #COVID-19 #FlattenTheCurve #2019ncov #ncov2019 #ncov

About

List of computational resources for analyzing microbial sequencing data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published