filtering and identifying non-host DNA pipeline for 03-713 Bioinformatics Practicum at Carnegie Mellon University, Spring 2021 by Jeremy Fisher, Simon Levine, Sid Reed and Tomas Matteson
fainhD does the following
- Filter RNA-seq data to remove host sequences
- Assemble unknown sequences into contigs
- BLAST virus sequences against known viruses
- Predict functional ORFs in viral sequences
- Search for structural elements in virus sequences
It takes an Illumina paired-read sequencing files and report these into a json-lines
file with the following columns:
query_name
: arbitrary name assigned by the contig assembly softwarecontiq_seq
: contiguous DNA sequence itselfrfam_e_value
: expect value for a query into the rfam structure databaseblast_results
: paired loci and expect values from BLAST, with keyssseqid
for the loci andevalue
for the expect value
This report, once completed, is deposited in report/{sample}.json
, where {sample}
is the name of library.
Useful intermediary files are available in the data/processed
directory, including:
{sample}_blast_results.csv
: comma-delimited table of BLAST alignments to determine the identity of viral sequences{sample}_orf.fasta
: open reading frame predictions{sample}_structural_predictions.cmscan
visualized alignments against the rfam database, demonstrating RNA secondary structure (if any is found){sample}_structural_predictions.tblout
alignments against the rfam database, summarized
We organize the pipeline using Snakemake.
Please refer to Snakemake's installation instructions, as Snakemake is the only dependency. It will, in turn, handle the pipelines software and data dependencies (besides the fastq
files themselves).
Note that you may need to run pip install ftputil
if you plan to use remote files (as is the default).
Then, clone this repository and run as follows:
After cloning, cd
into this repository. Change the input.yaml
to refer to the proper dataset (as described below). There are two pipelines: one for single-ended sequencing (workflow/pe.smk
) and one for pair-ended sequencing (workflow/se.smk
). Depending on the sequencing reaction, invoke as follows (with, for example, twelve cores):
snakemake --use-conda -j12 -s workflow/pe.smk
# or...
snakemake --use-conda -j12 -s workflow/se.smk
To use an HPC, an example job script (for pair-ended datasets) is supplied at run.job
:
sbatch run.job
Only paired-end read are supported. If those data are available locally (e.g., if they were downloaded before-hand), put them in the data/raw/reads
folder and specify the library name in the input.yaml
file. For example, if the read files were "foo_R1.fastq" and "foo_R2.fastq.fastq" and specify the sample name as:
sample: foo
We facilitate using any paired-end RNAseq dataset from SRA. Simply specify the accession number as the sample:
sample: SRR8250770
This would run the pipeline on the HPV-positive experiment under the accession number SRR8250770
Snakemake is built to produce target files from source files. By specifying a sample
, internally the pipeline requests the target file report/{sample}.json
. This, in turn, produces the other dependencies in the pipeline like, for example, the contigous reads fasta
file.
This makes it simple to analyze multiple libraries simultaneously. For example, to analyze the SRA experiments SRR8250770
and SRR8250765
(which are HPV-positive and HPV-negative, respectively):
snakemake --use-conda -j12 -s workflow/pe.smk reports/{SRR8250770,SRR8250765}.json
It shall be noted that this is simply a bash expansion. The shell actually runs the following
snakemake --use-conda -j12 workflow/pe.smk reports/SRR8250770.json reports/SRR8250765.json