tags
ggg, ggg2023, ggg201b

GGG 201(b), Lab Homework 2 - 2023

(Permanent link on github)

Due by 10pm, Wed Mar 22nd.

I'll be available on e-mail before Wed, but not available for meetings.
Please make sure to work on this in advance so we can resolve technical problems quickly!
I will happily "pre-grade" things that are ready to be reviewed.

Please note that you may freely work together with others, but you must hand it in separately.

Goals

This homework extends Lab 7 and Lab 8 to do a bit more with assembly.

The goal of this homework is to examine and evaluate at what coverage genome assembly collapses. To this end, you will calculate assemblies for several different subset sizes of reads using a Snakefile, and report the result to a central form.

Warning: the computation here will take about 20-30 minutes to run once you have the Snakefile all set up.

1. Connect to GitHub Classroom and clone the hw2 repository

As per Lab homework #1, connect to GitHub classroom for HW #2 and clone this assignment's repository to farm.

You'll also need to copy in the data files as per lab 8:

cp ~ctbrown/data/ggg201b/SRR2584857_*.fastq.gz .

2. Edit the Snakefile and save to GitHub

Update the Snakefile to calculate at least three different subset assemblies, choosing estimated coverages between 2x and 60x. The default target for the Snakefile should start from a directory containing only the two read files (SRR2584857_1.fastq.gz and SRR2584857_2.fastq.gz) and the Snakefile, and compute all three assemblies plus their quast statistics and prokka annotations.

Get the Snakefile working, verify that it works in an empty directory (using --delete-all-output), and then commit and push, as in homework #1.

(You do not need to download the read files in the Snakefile - assume they are in the directory already.)

3. Submit assembly results to this Google form

Submit at least three different assembly entries to this form, which will ask for the following information:

Your GitHub ID.
The filename of one assemby produced by your Snakefile.
The number of reads used in the assembly.
Your estimate of the coverage (use 4.5 Mb as the genome size).
The N50 of the assembly.
The total bp in contigs > 1kb for the assembly.
The total number of contigs > 1kb for the assembly.
The total number of protein coding genes (records in the annotation .faa file) for the assembly.

Note that the total number of lines in a single gzipped FASTQ file can be calculated like so:

gunzip -c FILENAME | wc -l

and the total number of records in a FASTA file (e.g. the .faa file output by prokka) can be calculated by doing

grep ^'>' FILENAME | wc -l

You can inspect the records in a FASTA file with

gunzip -c FILENAME | head

which will let you calculate the sequence length - note that all the sequences in the FASTQ file are the same length.

4. Relax in knowledge of a job well done.

Please do inspect the Snakefile on github via the Web interface to make sure it has all your changes!

You must turn this in via GitHub, and I will primarily be looking at GitHub for grading.

Questions? Problems?

If you run into any trouble with submission, that's ok - just let me know.

You can reach me via e-mail at ctbrown@ucdavis.edu, or in UC Davis slack under @ctitusbrown.

Appendix - snakemake details

If you're interested in understanding some of what's going on in the Snakefile, here is some reading!

wildcards and expand

General background reading on what we're doing with wildcards in this Snakefile: Wildcards are determined by the desired output.

"|| true"

The || true in the subset_reads rule is explained in this draft chapter: Never fail me - how to make shell commands always succeed.

`{subset}` - how it works and what it's used for

In the subset_reads rule, there's some tricky stuff with {subset} being used to specify the number of lines being extracted with head. See Using wildcards to determine parameters to use in the shell block for an explanation of this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hw-2.md

hw-2.md

GGG 201(b), Lab Homework 2 - 2023

Goals

1. Connect to GitHub Classroom and clone the hw2 repository

2. Edit the Snakefile and save to GitHub

3. Submit assembly results to this Google form

4. Relax in knowledge of a job well done.

Questions? Problems?

Appendix - snakemake details

wildcards and expand

"|| true"

`{subset}` - how it works and what it's used for

Files

hw-2.md

Latest commit

History

hw-2.md

File metadata and controls

GGG 201(b), Lab Homework 2 - 2023

Goals

1. Connect to GitHub Classroom and clone the hw2 repository

2. Edit the Snakefile and save to GitHub

3. Submit assembly results to this Google form

4. Relax in knowledge of a job well done.

Questions? Problems?

Appendix - snakemake details

wildcards and expand

"|| true"

{subset} - how it works and what it's used for

`{subset}` - how it works and what it's used for