GitHub

filtering and identifying non-host DNA pipeline for 03-713 Bioinformatics Practicum at Carnegie Mellon University, Spring 2021 by Jeremy Fisher, Simon Levine, Sid Reed and Tomas Matteson

fainhD does the following

Filter RNA-seq data to remove host sequences
Assemble unknown sequences into contigs
BLAST virus sequences against known viruses
Predict functional ORFs in viral sequences
Search for structural elements in virus sequences

It takes an Illumina paired-read sequencing files and report these into a json-lines file with the following columns:

query_name: arbitrary name assigned by the contig assembly software
contiq_seq: contiguous DNA sequence itself
rfam_e_value: expect value for a query into the rfam structure database
blast_results: paired loci and expect values from BLAST, with keys sseqid for the loci and evalue for the expect value

This report, once completed, is deposited in report/{sample}.json, where {sample} is the name of library.

Useful intermediary files are available in the data/processed directory, including:

{sample}_blast_results.csv: comma-delimited table of BLAST alignments to determine the identity of viral sequences
{sample}_orf.fasta: open reading frame predictions
{sample}_structural_predictions.cmscan visualized alignments against the rfam database, demonstrating RNA secondary structure (if any is found)
{sample}_structural_predictions.tblout alignments against the rfam database, summarized

Installation

We organize the pipeline using Snakemake.

Please refer to Snakemake's installation instructions, as Snakemake is the only dependency. It will, in turn, handle the pipelines software and data dependencies (besides the fastq files themselves).

Note that you may need to run pip install ftputil if you plan to use remote files (as is the default).

Then, clone this repository and run as follows:

To run

After cloning, cd into this repository. Change the input.yaml to refer to the proper dataset (as described below). There are two pipelines: one for single-ended sequencing (workflow/pe.smk) and one for pair-ended sequencing (workflow/se.smk). Depending on the sequencing reaction, invoke as follows (with, for example, twelve cores):

snakemake --use-conda -j12 -s workflow/pe.smk
# or...
snakemake --use-conda -j12 -s workflow/se.smk

To use an HPC, an example job script (for pair-ended datasets) is supplied at run.job:

sbatch run.job

Using custom data

Local

Only paired-end read are supported. If those data are available locally (e.g., if they were downloaded before-hand), put them in the data/raw/reads folder and specify the library name in the input.yaml file. For example, if the read files were "foo_R1.fastq" and "foo_R2.fastq.fastq" and specify the sample name as:

sample: foo

Sequence Read Archive

We facilitate using any paired-end RNAseq dataset from SRA. Simply specify the accession number as the sample:

sample: SRR8250770

This would run the pipeline on the HPV-positive experiment under the accession number SRR8250770

Batch Invocation

Snakemake is built to produce target files from source files. By specifying a sample, internally the pipeline requests the target file report/{sample}.json. This, in turn, produces the other dependencies in the pipeline like, for example, the contigous reads fasta file.

This makes it simple to analyze multiple libraries simultaneously. For example, to analyze the SRA experiments SRR8250770 and SRR8250765 (which are HPV-positive and HPV-negative, respectively):

snakemake --use-conda -j12 -s workflow/pe.smk reports/{SRR8250770,SRR8250765}.json

It shall be noted that this is simply a bash expansion. The shell actually runs the following

snakemake --use-conda -j12 workflow/pe.smk reports/SRR8250770.json reports/SRR8250765.json

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
data		data
deliverables		deliverables
docs		docs
reports		reports
workflow		workflow
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE.md		LICENSE.md
README.md		README.md
fainhD.png		fainhD.png
input.yaml		input.yaml
requirements-docs.txt		requirements-docs.txt
sbatch_example.job		sbatch_example.job

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

To run

Using custom data

Local

Sequence Read Archive

Batch Invocation

About

Releases

Packages

Contributors 3

Languages

License

simonlevine/fainhD

Folders and files

Latest commit

History

Repository files navigation

Installation

To run

Using custom data

Local

Sequence Read Archive

Batch Invocation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages