tags |
---|
ggg, ggg2022, ggg201b |
Due by 10pm, Tuesday March 1st
As per Lab homework #1, connect to GitHub classroom for HW #2 and clone this assignment's repository to farm.
Update the Snakefile to calculate at least three different subset assemblies, choosing estimated coverages between 2x and 60x. The default target for the Snakefile should start from a directory containing only the two read files (SRR2584857_1.fastq.gz
and SRR2584857_2.fastq.gz
) and the Snakefile, and compute all three assemblies plus their quast statistics and prokka annotations.
Get the Snakefile working, verify that it works in an empty directory (using --delete-all-output
), and then commit and push as in homework #1.
Submit at least three different assembly entries to this form, which will ask for the following information:
- Your GitHub ID.
- The filename of one assemby produced by your Snakefile.
- The number of reads used in the assembly.
- Your estimate of the coverage (use 4.5 Mb as the genome size).
- The N50 of the assembly.
- The total bp in contigs > 1kb for the assembly.
- The total number of contigs > 1kb for the assembly.
- The total number of protein coding genes (records in the annotation .faa file) for the assembly.
Note that the total number of lines in a single gzipped FASTQ file can be calculated like so:
gunzip -c FILENAME | wc -l
and the total number of records in a FASTA file (e.g. the .faa file output by prokka) can be calculated by doing
grep ^'>' FILENAME | wc -l
Please do inspect the Snakefile on github via the Web interface to make sure it has all your changes!
If you run into any trouble with submission, that's ok - just let me know.
You can reach me via e-mail at [email protected], or in UC Davis slack under @ctitusbrown.