tags |
---|
ggg, ggg2022, ggg201b |
Contents:
[toc]
I'm going to send you off into 5 breakout rooms (randomly assigned) for ~20 minutes.
In these breakout rooms, please do the following:
- try to assemble one or more sentences from the shotgun sequencing data present in the two PDFs posted to the lab 6 announcement in canvas. (~10-15 min)
- choose a spokesperson to report answers to the following questions: (5 minutes)
- what strategy or strategies did you use to assemble the data?
- if you had an unlimited supply of undergrads to throw at assembly, what would you tell them to do? what would be the most efficient use of your resources?
- what kinds of things do you think your assembly strategy would have a hard time recovering from data?
- what kind of additional data would you like in order to build your assembly?
- what made/makes assembling english text easier than assembling DNA sequence?
Variant calling is disk and CPU intensive, and the mapping / pileup phase can be split across multiple computers.
Assembly is memory intensive (and sometimes CPU intensive). It is much harder to split across multiple computers.
Today we're going to explore the results of our assembly from last week using two common "swiss army knife" tools, BLAST and bedtools. BLAST is used for sequence search, and bedtools is used for genome interval arithmetic.
Onwards!
Log into farm, request a node, and change into week 5's result directory:
srun --nodes=1 -p high2 -t 2:00:00 -c 4 --mem 6GB --pty /bin/bash
and then
cd ~/ggg201-week5/
If you didn't finish running prokka last week, you might want to copy lab5 results over from my directory: ::::spoiler
mkdir -p ~/ggg201-week5/
cd ~/ggg201-week5/
unzip -o ~ctbrown/transfer/2022-ggg-201-week5.zip
::::
::::info Make sure you've already installed the software from lab 5 into the assembly environment. ::::
Now let's install bedtools and NCBI BLAST:
conda activate assembly
mamba install -y bedtools blast
(mamba should actually say they've already been installed, because they were installed by prokka).
We're going to start by doing a gene search in the assembly, which is a pretty common way to explore and validate and use assembled sequences.
There are a bunch of ways to search for genes and we'll try a few of them -
- search for a gene by name in annotations
- search for a gene by sequence in annotations
- search for a gene by sequence in the genome contigs
You can use the 'grep' search tool to look for genes with a specific string in their name in an annotation; here, we're searching for the name 'crp', for cAMP response protein:
grep -i crp SRR2584857_annot/SRR2584857_annot.faa
This is, however, dependent on having an good set of annotation names. Can we search for it by sequence? Yes!
First, we'll need a query gene. We'll go with CRP.
Let's grab the CRP gene sequence from uniprot - I found this by googling 'salmonella crp'.
Go to https://www.uniprot.org/uniprot/P0A2T6, then click on Format at the top, then select 'FASTA'. You'll see some protein sequence:
>sp|P0A2T6|CRP_SALTY cAMP-activated global transcriptional regulator CRP OS=Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) OX=99287 GN=crp PE=1 SV=1
MVLGKPQTDPTLEWFLSHCHIHKYPSKSTLIHQGEKAETLYYIVKGSVAVLIKDEEGKEM
ILSYLNQGDFIGELGLFEEGQERSAWVRAKTACEVAEISYKKFRQLIQVNPDILMRLSSQ
MARRLQVTSEKVGNLAFLDVTGRIAQTLLNLAKQPDAMTHPDGMQIKITRQEIGQIVGCS
RETVGRILKMLEDQNLISAHGKTIVVYGTR
Let's put that in a file crp.faa
: copy the sequence from the Web page, then run
nano crp.faa
to create the file, and then paste the sequence in, and use CTRL-X, y, ENTER to save.
Now:
cat crp.faa
should show the same sequence as above.
Now, let's search for the gene in the assembled contigs.
First, build a BLAST database from your genome:
makeblastdb -dbtype nucl -in SRR2584857-assembly.fa
Now do a search with a protein query against a nucleotide database:
tblastn -query crp.faa -db SRR2584857-assembly.fa
We an also search for the CRP gene in the annotation sequence files; this is something you might want to do if you have a candidate gene for something that is unlikely to be "properly" named.
Build a BLAST database for the output protein files from the prokka annotation:
makeblastdb -dbtype prot -in SRR2584857_annot/SRR2584857_annot.faa
and search a protein database with a protein query:
blastp -query crp.faa -db SRR2584857_annot/SRR2584857_annot.faa
You'll note a lot of ~spurious matches above. This is because BLAST by default uses very sensitive but not very specific reporting parameters. What happens if we make the reporting parameters more stringent? (This is similar to what we did with the VCF files.)
tblastn -query crp.faa -db SRR2584857-assembly.fa -evalue 1e-6
Let's get the GFF entry for the CRP gene:
grep CRP SRR2584857_annot/SRR2584857_annot.gff
If we put that information into a gene feature format file (GFF):
grep CRP SRR2584857_annot/SRR2584857_annot.gff > crp.gff
we can then use bedtools to get the gene sequence out from the assembly:
bedtools getfasta -fi SRR2584857-assembly.fa -bed crp.gff
Suppose we have a file that specifies some features of interest - this might be variants, or something else. (Yes, I should have shown you this back in the variants labs :).
Here I've built a file in BED Format that specifies a location close to the CRP gene:
k119_74 15192 15193 position +
Let's pretend that we don't know what genes are close to this position, and see if we can get bedtools to tell us.
Per this issue, we need to remove the trailing FASTA sequences in the GFF file:
sed '/^##FASTA$/,$d' SRR2584857_annot/SRR2584857_annot.gff > just-features.gff
and then sort it:
bedtools sort -i just-features.gff > sorted-features.gff
Now we can use the BED file I created above:
cp ~ctbrown/transfer/crp-position.bed .
to run a query:
bedtools closest -a crp-position.bed -b sorted-features.gff
and you should see the same ol' CRP gene. yay!
This is a common kind of workflow to run through - generate various kinds of intermediate files to get ever closer to a question you actually want to answer.
Hopefully that gives you some idea of the things you can do with assembled sequences/contigs and their annotations - search them by text string sequence, and manipulate them by annotated features.
Now let's get back to my favorite topic - automation with snakemake!
OK, now, finally, let's start building a snakefile to automate the assembly and annotation stages.
Run nano -ET4 Snakefile
and copy/paste in the stuff below:
rule assemble:
shell: """
megahit -1 SRR2584857_1.fastq.gz -2 SRR2584857_2.fastq.gz -f -m 5e9 -t 4 -o SRR2584857_assembly
"""
rule copy_contigs:
shell: """
cp SRR2584857_assembly/final.contigs.fa SRR2584857-assembly.fa
"""
rule quast:
shell: """
quast SRR2584857-assembly.fa
"""
rule annotate:
shell: """
prokka --prefix SRR2584857_annot SRR2584857-assembly.fa
"""
Stepwise approach to building a snakefile:
- start with the shell commands
- add input/output to link the rules
- specify a default rule
- replace filenames in shell blocks with
{input}
and{output}
template variables - start thinking about wildcards
and save and check snakefile formatting with snakemake -n
periodically!