HASSL is a snakemake pipeline designed to take RNA-Seq SRA run accession numbers and output both sorted bam alignments and raw read counts for each run quickly and uniformly.
snakemake -s hassl.py resources -j
echo SRR1200675 > accessions.txt
snakemake -s hassl.py -j --config ACCESSION_FILE='accessions.txt'
This will produce the following output files
- SRR1200675.featureCounts.counts - featureCounts raw count file
- SRR1200675.hisat.crsm - Picard CollectRnaSeqMetrics output file
- SRR1200675.pass - based on the Picard output file, this BAM file passed QC cutoffs
- various log files found in
log/
HASSL includes a handy resources tool to gather references and build the
hisat index for you. This will put the GRCh38 reference and annotation
files in a reference directory (default is /mnt/resources
) and build the hisat index there as well. Make sure you have write access to this directory. If you want to change this directory, change the REFERENCE_DIR
variable in hassl.py
to a directory that you have write access to.
In order to setup your environment, you'll need to install two programs and then edit the HISAT_BUILD
variable in hassl.py
so it knows where to find it.
- snakemake
- hisat - follow the HISAT directions to compile it with SRA support
- Other dependencies: gunzip, wget
Then run hassl to setup your environment: snakemake -s hassl.py resources -j
Be aware of the computational resources required to get these files; unpacking the reference fasta file will take >30G HDD and indexing will need at least >10G RAM.
Before running HASSL, you will need to install the following programs:
- featureCounts
- Picard
- [samtools rocks] (https://github.com/dnanexus/samtools)
- samtools
Edit hassl.py
to reflect the location of the HASSL install directory (HASSL
), your reference files, and executables that HASSL will use (see the EXECUTABLE LOCATIONS
section). The default location at the /mnt
directory is where hassl.py resources
will try put them. Also, you may want to adjust the
threads (THREADS
) to be equal to or less than the number of threads on your computer.
Isolate the run_accession IDs from SRA you want to run and put them in a
file line by line. Do not leave any lines blank. The pipeline defaults to
look for this input file as accessions.txt
.
Run it!
snakemake -s hassl.py -j
You'll probably want to be on a fairly large machine for this (16 cpus). Refer to the snakemake documentation on how to (trivially!) run snakemake efficiently in a cluster environment that requires job submission.
HASSL automatically deletes the BAM files to save space. If you want to keep all you bam and bam.bai files, edit hassl.py
by removing the temp()
wrapper around the output
file name in the index_bam
and sort_bam
rules.
After you run HASSL on a set of SRA accessions, collate all the Picard and counts output with the following:
scripts/collate.pl accessions.txt
. This will create two files qc/collate.qc.tsv
and counts/collate.counts.tsv
. Then you can create QC plots by running the following cd qc; Rscript ../scripts/qc_plots.R
(requires ggplot2 and grid R packages). This will create a qc_histogram.jpg
file with several graphs of important QC metrics.
In order to plot the count distributions of high coverage genes, run the following from the counts
directory: Rscript ../scripts/densityplots_genes.R
(requires DESeq2 and ggplot2 R packages).
- snakemake
- HISAT - follow the HISAT directions to compile it with SRA support
- featureCounts
- Picard
- [samtools rocks] (https://github.com/dnanexus/samtools)
- samtools
- Other dependencies: gunzip, wget, perl