Skip to content

Latest commit

 

History

History
249 lines (178 loc) · 8.53 KB

lab-6.md

File metadata and controls

249 lines (178 loc) · 8.53 KB
tags
ggg, ggg2022, ggg201b

Lab 6 outline - more assembly and annotation - GGG 201(b) lab, Feb 11, 2022

hackmd-github-sync-badge

(Permanent link on github)

Contents:

[toc]

Breakout sessions

I'm going to send you off into 5 breakout rooms (randomly assigned) for ~20 minutes.

In these breakout rooms, please do the following:

  • try to assemble one or more sentences from the shotgun sequencing data present in the two PDFs posted to the lab 6 announcement in canvas. (~10-15 min)
  • choose a spokesperson to report answers to the following questions: (5 minutes)
    • what strategy or strategies did you use to assemble the data?
    • if you had an unlimited supply of undergrads to throw at assembly, what would you tell them to do? what would be the most efficient use of your resources?
    • what kinds of things do you think your assembly strategy would have a hard time recovering from data?
    • what kind of additional data would you like in order to build your assembly?
    • what made/makes assembling english text easier than assembling DNA sequence?

Discussion of assembly strategies

Notes on computational resources

Variant calling is disk and CPU intensive, and the mapping / pileup phase can be split across multiple computers.

Assembly is memory intensive (and sometimes CPU intensive). It is much harder to split across multiple computers.

Back to the command line!

Today we're going to explore the results of our assembly from last week using two common "swiss army knife" tools, BLAST and bedtools. BLAST is used for sequence search, and bedtools is used for genome interval arithmetic.

Onwards!

Log into farm

Log into farm, request a node, and change into week 5's result directory:

srun --nodes=1 -p high2 -t 2:00:00 -c 4 --mem 6GB --pty /bin/bash

and then

cd ~/ggg201-week5/

If you didn't finish running prokka last week, you might want to copy lab5 results over from my directory: ::::spoiler

mkdir -p ~/ggg201-week5/
cd ~/ggg201-week5/
unzip -o ~ctbrown/transfer/2022-ggg-201-week5.zip

::::

Install some software

::::info Make sure you've already installed the software from lab 5 into the assembly environment. ::::

Now let's install bedtools and NCBI BLAST:

conda activate assembly
mamba install -y bedtools blast

(mamba should actually say they've already been installed, because they were installed by prokka).

We're going to start by doing a gene search in the assembly, which is a pretty common way to explore and validate and use assembled sequences.

There are a bunch of ways to search for genes and we'll try a few of them -

  • search for a gene by name in annotations
  • search for a gene by sequence in annotations
  • search for a gene by sequence in the genome contigs

Searching for the CRP gene in the annotation with grep

You can use the 'grep' search tool to look for genes with a specific string in their name in an annotation; here, we're searching for the name 'crp', for cAMP response protein:

grep -i crp SRR2584857_annot/SRR2584857_annot.faa

This is, however, dependent on having an good set of annotation names. Can we search for it by sequence? Yes!

Getting the CRP gene sequence from Salmonella

First, we'll need a query gene. We'll go with CRP.

Let's grab the CRP gene sequence from uniprot - I found this by googling 'salmonella crp'.

Go to https://www.uniprot.org/uniprot/P0A2T6, then click on Format at the top, then select 'FASTA'. You'll see some protein sequence:

>sp|P0A2T6|CRP_SALTY cAMP-activated global transcriptional regulator CRP OS=Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) OX=99287 GN=crp PE=1 SV=1
MVLGKPQTDPTLEWFLSHCHIHKYPSKSTLIHQGEKAETLYYIVKGSVAVLIKDEEGKEM
ILSYLNQGDFIGELGLFEEGQERSAWVRAKTACEVAEISYKKFRQLIQVNPDILMRLSSQ
MARRLQVTSEKVGNLAFLDVTGRIAQTLLNLAKQPDAMTHPDGMQIKITRQEIGQIVGCS
RETVGRILKMLEDQNLISAHGKTIVVYGTR

Let's put that in a file crp.faa: copy the sequence from the Web page, then run

nano crp.faa

to create the file, and then paste the sequence in, and use CTRL-X, y, ENTER to save.

Now:

cat crp.faa

should show the same sequence as above.

Searching for the CRP gene in the assembly with BLAST

Now, let's search for the gene in the assembled contigs.

First, build a BLAST database from your genome:

makeblastdb -dbtype nucl -in SRR2584857-assembly.fa 

Now do a search with a protein query against a nucleotide database:

tblastn -query crp.faa -db SRR2584857-assembly.fa

Searching for the CRP gene in the annotation with BLAST

We an also search for the CRP gene in the annotation sequence files; this is something you might want to do if you have a candidate gene for something that is unlikely to be "properly" named.

Build a BLAST database for the output protein files from the prokka annotation:

makeblastdb -dbtype prot -in SRR2584857_annot/SRR2584857_annot.faa

and search a protein database with a protein query:

blastp -query crp.faa -db SRR2584857_annot/SRR2584857_annot.faa

playing with e-value cutoffs

You'll note a lot of ~spurious matches above. This is because BLAST by default uses very sensitive but not very specific reporting parameters. What happens if we make the reporting parameters more stringent? (This is similar to what we did with the VCF files.)

tblastn -query crp.faa -db SRR2584857-assembly.fa -evalue 1e-6

Using bedtools to get gene sequences out

Let's get the GFF entry for the CRP gene:

grep CRP SRR2584857_annot/SRR2584857_annot.gff 

If we put that information into a gene feature format file (GFF):

grep CRP SRR2584857_annot/SRR2584857_annot.gff > crp.gff

we can then use bedtools to get the gene sequence out from the assembly:

bedtools getfasta -fi SRR2584857-assembly.fa -bed crp.gff

Finding nearby features with bedtools

Suppose we have a file that specifies some features of interest - this might be variants, or something else. (Yes, I should have shown you this back in the variants labs :).

Here I've built a file in BED Format that specifies a location close to the CRP gene:

k119_74 15192   15193   position +

Let's pretend that we don't know what genes are close to this position, and see if we can get bedtools to tell us.

Per this issue, we need to remove the trailing FASTA sequences in the GFF file:

sed '/^##FASTA$/,$d' SRR2584857_annot/SRR2584857_annot.gff > just-features.gff

and then sort it:

bedtools sort -i just-features.gff  > sorted-features.gff

Now we can use the BED file I created above:

cp ~ctbrown/transfer/crp-position.bed .

to run a query:

bedtools closest -a crp-position.bed -b sorted-features.gff  

and you should see the same ol' CRP gene. yay!

This is a common kind of workflow to run through - generate various kinds of intermediate files to get ever closer to a question you actually want to answer.

Conclusion: working with contigs and annotations

Hopefully that gives you some idea of the things you can do with assembled sequences/contigs and their annotations - search them by text string sequence, and manipulate them by annotated features.

Now let's get back to my favorite topic - automation with snakemake!

Building an assembly Snakefile

OK, now, finally, let's start building a snakefile to automate the assembly and annotation stages.

Run nano -ET4 Snakefile and copy/paste in the stuff below:

rule assemble:
    shell: """
        megahit -1 SRR2584857_1.fastq.gz -2 SRR2584857_2.fastq.gz -f -m 5e9 -t 4 -o SRR2584857_assembly
    """
    
rule copy_contigs:
    shell: """
        cp SRR2584857_assembly/final.contigs.fa SRR2584857-assembly.fa
    """

rule quast:
    shell: """
        quast SRR2584857-assembly.fa
    """
    
rule annotate:
    shell: """
        prokka --prefix SRR2584857_annot SRR2584857-assembly.fa
    """

Stepwise approach to building a snakefile:

  • start with the shell commands
  • add input/output to link the rules
  • specify a default rule
  • replace filenames in shell blocks with {input} and {output} template variables
  • start thinking about wildcards

and save and check snakefile formatting with snakemake -n periodically!