This pipeline consists of several separate slurm/perl/python scripts, and we probably want to clean it up at some point.
For this version, you will need a file DEFINITIONS.sh
that defines some variables that the scripts use. Notably, the $FILEEND
variable is used to remove the .fastq.gz
and other parts of the filename so that you end up with meaningful names. We use this for the R1 filename only.
For example, you might want one of these:
FILEEND=_S34_R1.fastq.gz
FILEEND=_S34_L001_R1.fastq.gz
FILEEND=_R1.fastq.gz
Currently, we read these variables from DEFINITIONS.sh
:
export FILEEND=_L001_R1_001.fastq.gz
export FAFILEEND=_L001_R1_001.fasta.gz
export SOURCE=fastq
export HOSTREMOVED=fastq_fastp
- Set the source of the scripts
Clone the atavide lite repository to your home directory, and make the compiled code
git clone https://github.com/linsalrob/atavide_lite.git
cd atavide_lite/bin
make
SRC=~/atavide_lite/slurm
0b. Set up the conda environment
mamba env create -f ~/atavide_lite/atavide_lite.yaml
mamba env create -f ~/atavide_lite/atavide_lite_vamb.yaml
0c. Create some directories for the slurm output
mkdir -p slurm_output/host_slurm slurm_output/megahit_slurm slurm_output/mmseqs_slurm slurm_output/vamb_slurm
- Create a list of R1 files.
Several of the steps will use the file R1_reads.txt
to know which reads we have. Of course, snakemake
doesn't need this, but at the moment I'm being very conservative and running each command separately via slurm
. In addition, this allows me to use $BGFS
which I don't know how to use with snakemake
!
find fastq -name \*R1\* -printf "%f\n" > R1_reads.txt
NUM_R1_READS=$(wc -l R1_reads.txt | cut -f 1 -d ' ')
echo There are $NUM_R1_READS R1 readsA
if [[ $(find fastq -name \*R2\* | awk 's++{} END {print s}') != $NUM_R1_READS ]]; then echo "There are a different number of R1 and R2 reads"; fi
- Quality control of the sequences
JOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 $SRC/fastp.slurm)
- HOST REMOVAL (optional)
If you don't want to do host removal just set HOSTREMOVED=fastq_fastp
in DEFINITIONS.sh
HOSTJOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 --dependency=afterok:$JOB $SRC/host_removal.slurm)
- Convert to fasta for mmseqs
FAJOB=$(sbatch --parsable --dependency=afterok:$HOSTJOB $SRC/fastq2fasta.slurm)
- Run mmseqs taxonomy
MMSEQSJOB=$(sbatch --parsable --dependency=afterok:$FAJOB $SRC/mmseqs_easy_taxonomy_submit.slurm)
4a. Create a taxonomy table
sbatch --dependency=afterok:$MMSEQSJOB $SRC/mmseqs_taxonomy.slurm
4b. Add the subsystems to the taxonomy
SSJOB=$(sbatch --parsable --dependency=afterok:$MMSEQSJOB --array=1-$NUM_R1_READS:1 $SRC/mmseqs_add_subsystems.slurm)
sbatch --dependency=afterok:$SSJOB $SRC/count_subsystems.slurm
- Run megahit
MEGAHITJOB=$(sbatch --parsable --dependency=afterok:$HOSTJOB --array=1-$NUM_R1_READS:1 $SRC/megahit.slurm
- Combine assemblies for VAMB
Note: this takes a few (<10) minutes, and so I run it on the short queue
VCJOB=$(sbatch --parsable --dependency=afterok:$MEGAHITJOB $SRC/vamb_concat.slurm)
- Map vamb reads.
VMJOB=$(sbatch --parsable --dependency=afterok:$VCJOB --array=1-$NUM_R1_READS:1 $SRC/vamb_minimap.slurm)
- Run VAMB
sbatch --dependency=afterok:$VMJOB $SRC/vamb.slurm
There comes a time when we want to run vamb on a subset of reads, and so we have a grouped
version of vamb.
We use a simple .tsv
file to designate the groups, and then we run each of the components of vamb on that group.
For example, if we have a .tsv
file with this data:
--- | --- Host | Sample Angelshark | GSV317_ Angelshark | GSV326_ Angelshark | GSV329_ Angelshark | GSV342_ Eagle_Ray | ER128RecapASuck_ Eagle_Ray | ER151Suck_ Eagle_Ray | ER247Suck_
We will separate GSV317, GSV326, GSV329, GSV342 into AngelShark and ER128RecapASuck, ER151Suck, ER247Suck into Eagle_Ray directories, and then vamb them separately.
The main trick here is to make it so that the sample names only give 0 or 1 results when grepping through R1_reads.txt which is why I append the _
to these names.
sbatch $SRC/vamb_concat_groups.slurm groups_sample.tsv
sbatch --array=1-$NUM_R1_READS:1 $SRC/vamb_minimap_group.slurm
sbatch $SRC/vamb_group.slurm
All three lines as one go:
VCONCAT=$(sbatch --parsable $SRC/vamb_concat_groups.slurm groups_sample.tsv)
VMAP=$(sbatch --parsable --dependency=afterok:$VCONCAT --array=1-$NUM_R1_READS:1 $SRC/vamb_minimap_group.slurm)
sbatch --parsable --dependency=afterok:$VMAP $SRC/vamb_group.slurm
mkdir -p slurm_output/host_slurm slurm_output/megahit_slurm slurm_output/mmseqs_slurm slurm_output/vamb_slurm slurm_output/fastp_slurm
find fastq -name \*R1\* -printf "%f\n" > R1_reads.txt
export NUM_R1_READS=$(wc -l R1_reads.txt | cut -f 1 -d ' ')
SRC=~/atavide_lite/slurm
JOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 $SRC/fastp.slurm)
HOSTJOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 --dependency=afterok:$JOB $SRC/host_removal.slurm)
FAJOB=$(sbatch --parsable --dependency=afterok:$HOSTJOB $SRC/fastq2fasta.slurm)
MMSEQSJOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 --dependency=afterok:$FAJOB $SRC/mmseqs_easy_taxonomy.slurm)
sbatch --dependency=afterok:$MMSEQSJOB $SRC/mmseqs_taxonomy.slurm
SSJOB=$(sbatch --parsable --dependency=afterok:$MMSEQSJOB --array=1-$NUM_R1_READS:1 $SRC/mmseqs_add_subsystems.slurm)
sbatch --dependency=afterok:$SSJOB $SRC/count_subsystems.slurm
MEGAHITJOB=$(sbatch --parsable --dependency=afterok:$HOSTJOB --array=1-$NUM_R1_READS:1 $SRC/megahit.slurm)
VCJOB=$(sbatch --parsable --dependency=afterok:$MEGAHITJOB $SRC/vamb_concat.slurm)
VMJOB=$(sbatch --parsable --dependency=afterok:$VCJOB --array=1-$NUM_R1_READS:1 $SRC/vamb_minimap.slurm)
sbatch --dependency=afterany:$VMJOB $SRC/vamb.slurm