HASSL - Homogeneous Analysis of SRA Sequencing Libraries

HASSL is a snakemake pipeline designed to take RNA-Seq SRA run accession numbers and output both sorted bam alignments and raw read counts for each run quickly and uniformly.

For the lazy (not a hassle if you have all the dependencies installed ;)

snakemake -s hassl.py resources -j
echo SRR1200675 > accessions.txt
snakemake -s hassl.py -j --config ACCESSION_FILE='accessions.txt'

This will produce the following output files

SRR1200675.featureCounts.counts - featureCounts raw count file
SRR1200675.hisat.crsm - Picard CollectRnaSeqMetrics output file
SRR1200675.pass - based on the Picard output file, this BAM file passed QC cutoffs
various log files found in log/

Setup your environment

HASSL includes a handy resources tool to gather references and build the hisat index for you. This will put the GRCh38 reference and annotation files in a reference directory (default is /mnt/resources) and build the hisat index there as well. Make sure you have write access to this directory. If you want to change this directory, change the REFERENCE_DIR variable in hassl.py to a directory that you have write access to.

In order to setup your environment, you'll need to install two programs and then edit the HISAT_BUILD variable in hassl.py so it knows where to find it.

snakemake
hisat - follow the HISAT directions to compile it with SRA support
Other dependencies: gunzip, wget

Then run hassl to setup your environment: snakemake -s hassl.py resources -j

Be aware of the computational resources required to get these files; unpacking the reference fasta file will take >30G HDD and indexing will need at least >10G RAM.

Run the pipeline

Before running HASSL, you will need to install the following programs:

featureCounts
Picard
[samtools rocks] (https://github.com/dnanexus/samtools)
samtools

Edit hassl.py to reflect the location of the HASSL install directory (HASSL), your reference files, and executables that HASSL will use (see the EXECUTABLE LOCATIONS section). The default location at the /mnt directory is where hassl.py resources will try put them. Also, you may want to adjust the threads (THREADS) to be equal to or less than the number of threads on your computer.

Isolate the run_accession IDs from SRA you want to run and put them in a file line by line. Do not leave any lines blank. The pipeline defaults to look for this input file as accessions.txt.

Run it!

snakemake -s hassl.py -j

You'll probably want to be on a fairly large machine for this (16 cpus). Refer to the snakemake documentation on how to (trivially!) run snakemake efficiently in a cluster environment that requires job submission.

HASSL automatically deletes the BAM files to save space. If you want to keep all you bam and bam.bai files, edit hassl.py by removing the temp() wrapper around the output file name in the index_bam and sort_bam rules.

Visualizing QC metrics

After you run HASSL on a set of SRA accessions, collate all the Picard and counts output with the following: scripts/collate.pl accessions.txt. This will create two files qc/collate.qc.tsv and counts/collate.counts.tsv. Then you can create QC plots by running the following cd qc; Rscript ../scripts/qc_plots.R (requires ggplot2 and grid R packages). This will create a qc_histogram.jpg file with several graphs of important QC metrics.

Visualizing count histograms of genes

In order to plot the count distributions of high coverage genes, run the following from the counts directory: Rscript ../scripts/densityplots_genes.R (requires DESeq2 and ggplot2 R packages).

Dependencies

snakemake
HISAT - follow the HISAT directions to compile it with SRA support
featureCounts
Picard
[samtools rocks] (https://github.com/dnanexus/samtools)
samtools
Other dependencies: gunzip, wget, perl

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
Spike-Ins		Spike-Ins
legacy		legacy
lib		lib
scripts		scripts
t		t
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hassl.py		hassl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HASSL - Homogeneous Analysis of SRA Sequencing Libraries

For the lazy (not a hassle if you have all the dependencies installed ;)

Setup your environment

Run the pipeline

Visualizing QC metrics

Visualizing count histograms of genes

Dependencies

About

Releases

Packages

Contributors 5

Languages

License

NCBI-Hackathons/HASSL_Homogeneous_Analysis_of_SRA_rnaSequencing_Libraries

Folders and files

Latest commit

History

Repository files navigation

HASSL - Homogeneous Analysis of SRA Sequencing Libraries

For the lazy (not a hassle if you have all the dependencies installed ;)

Setup your environment

Run the pipeline

Visualizing QC metrics

Visualizing count histograms of genes

Dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages