Skip to content

Commit

Permalink
Merge branch 'main' of github.com:kth-gt/cb2442
Browse files Browse the repository at this point in the history
  • Loading branch information
percolator committed Aug 15, 2023
2 parents 87bdfe9 + 23edeee commit 2babb07
Show file tree
Hide file tree
Showing 11 changed files with 138,275 additions and 235 deletions.
106 changes: 37 additions & 69 deletions lab/b1/readme.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# LAB B1: Gene finding, Blast and sequence alignment

## LAB PREPARATION

### The scenario
## The scenario

There is an outbreak of mysterious infectious diseases in your town. Doctors do not know
what is causing it, but they do know it is spreading fast. Patients come into the hospital
Expand All @@ -28,7 +26,7 @@ bases, called contigs. A bioinformatician at the sequencing centre has already d
work. It is now up to you and your colleagues to find out as much as possible about this
pathogen. The patients are counting on you!

### Preparation questions
## Preparation questions

We will start each lab discussing a few preparatory questions. You will not gain or lose points
from them, but you might be called to discuss them in front of your classmates, so be
Expand All @@ -39,148 +37,118 @@ prepared!
1. What does it mean to “align two sequences”? What is the goal?
1. What is a p-value? How is that different from an e-value (expectation value) in BLAST?

## Lab Instructions
## Instructions and questions

Select one of the files [`bacteria1.fasta`](bacteria1.fasta), [`bacteria2.fasta`](bacteria2.fasta), or [`bacteria3.fasta`](bacteria3.fasta), which
correspond to diseases 1, 2 and 3, respectively. Make a folder called bioinformatics in your
local computer account, download one of the fasta files and save it there.

### Questions
local computer account, download the fasta file and save it there.

#### Q1
**Q1** Which of the 3 unknown bacteria have you chosen to work with?

Which of the 3 unknown bacteria have you chosen to work with?
Open the fasta file in a text editor such as gedit. Alternatively, view it through the command
line.

#### Q2
**Q2** How is a fasta file organised? What information can be found in it? Is this a practical format? Why/why not?

How is a fasta file organised? What information can be found in it? Is this a practical
format? Why/why not?
To understand the metabolism and life-cycle of an unknown species based on its DNA
content, we have to study the functions of its genes. The first step for doing so is finding the
gene sequences within the genome. Fortunately, there are tools that can find the genes inside
a genome, based on certain sequence characteristics. Check the Bioinformatics Tools booklet
a genome, based on certain sequence characteristics. Check the [Bioinformatics Tools Booklet](../biotoolsbooklet.md)
and look for online tools for gene finding.

#### Q3
**Q3** Which tools did you find?

Which tools did you find?
Take a look at their websites. Feel free to explore them for a few minutes. Then, pick one tool
to use in this assignment. It’s important to have the nucleotide sequences as output.

#### Q4

Which tool did you choose? Why? Did you change any parameters from the default
**Q4** Which tool did you choose? Why? Did you change any parameters from the default
settings? Which, how, and why?

Make a fasta file of all the candidate genes you’ve found. Make sure to erase all comments or
other lines that don’t fit the fasta format.

#### Q5

How many genes did this tool find? Is this a good estimate for the number of genes in
**Q5** How many genes did this tool find? Is this a good estimate for the number of genes in
this organism? Why/why not?

Now download the total set of predicted genes. If this is not possible, select everything with
the mouse, paste it to a plain text document and save it.
Now that we have identified the genes (at least the most likely genes according to the gene
finder you employed), we can start studying their functions. One approach for doing this is to
compare these new sequences with sequences from better known organisms. A very popular
tool for doing this is Blast. Check the Bioinformatics Tools booklet on instructions in how to
compare these new sequences with annotated sequences from other organisms. A very popular
tool for doing this is Blast. Check the [Bioinformatics Tools Booklet](../biotoolsbooklet.md) for instructions on how to
use online Blast for nucleotides. Blast the first 5 genes you have found.

#### Q6

Which Blast variant have you chosen? Why? Did you change any parameters from the
**Q6** Which Blast variant have you chosen? Why? Did you change any parameters from the
default settings? Which, how, and why?

#### Q7

How many hits did you find for each gene? What do they correspond to? Does this
**Q7** How many hits did you find for each gene? What do they correspond to? Does this
make sense?
_Note: if there are too many hits, just describe the top ones!_

#### Q8

Considering how many genes you have found, is it practical to examine the function of
**Q8** Considering how many genes you have found, is it practical to examine the function of
each corresponding protein by online Blast?

Blast can also be run locally, through the command line interface. This allows whole genomes
or collections of genomes to be scanned very fast. However, it takes a bit more bioinformatics
expertise to go through the very large files that are produced, so we'll do this in a slightly
simplified way this time. Let's look for RNA-polymerases, that is, the enzymes that transcribe
DNA into RNA. They have already been downloaded from NCBI as described [here](https://www.youtube.com/watch?v=OC74-DpkWjE), using "Bacteria" and "RNA polymerase"
as keywords. You can retrieve this file directly as [`polymerases.fasta`](polymerases.fasta)
DNA into RNA. They have already been downloaded from NCBI as described [here](https://www.youtube.com/watch?v=OC74-DpkWjE), using "Bacteria" and "RNA polymerase" as keywords. You can retrieve this file directly as [`polymerases.fasta`](polymerases.fasta)
Now look into this fasta file. A lot of the sequences are described as “CDS”.

#### Q9

What does that mean? What is the difference between CDS, EST and ORF?
**Q9** What does that mean? What is the difference between CDS, EST and ORF?

#### Q10
**Q10** How many sequences are there in the fasta file? This can be quickly counted through
the command line

How many sequences are there in the fasta file? This can be quickly counted through
the command line.
The first step to running Blast through the command line is to prepare a database. You have
the necessary fasta files to compare, ie, your unknown bacterium and the collection of
bacterial RNA-polymerases. One of these files is going to be your database, and the other
one contains all of your queries (the sequences to be identified).

#### Q11

Which of these files should be the database, and which one should be the query?
**Q11** Which of these files should be the database, and which one should be the query?
Why? What would happen if you did it the other way around?
Look into the Bioinformatics Tools booklet to see how to prepare a Blast database.

#### Q12

Which command did you run? Describe what each part of it does.
Now that the database is ready, it's time to run nucleotide Blast.
Look into the [Bioinformatics Tools Booklet](../biotoolsbooklet.md) to see how to prepare a Blast database.

#### Q13
**Q12** Which command did you run? Describe what each part of it does.

Which command did you run? Describe what each part of it does.
Now that the database is ready, it's time to run nucleotide Blast.

#### Q14
**Q13** Which command did you run? Describe what each part of it does.

From all the results you got, can you pick RNA polymerases within your bacterial
**Q14** From all the results you got, can you pick RNA polymerases within your bacterial
genome? Give its position within the genome, together with its e-value, length and bit-score.
_Hint: You might get several hits for the same region, but take into account that a good hit
should have more or less the same length as the sequence it is matching to!_

Now download the file [`fewer_polymerases.fasta`](fewer_polymerases.fasta). This file contains only 50 of the
sequences from the polymerase database. Format a Blast database from this file, too, and
run Blast using it.

#### Q15

Which commands did you run? Are there any differences compared to what you did
**Q15** Which commands did you run? Are there any differences compared to what you did
before? Did you get the same hits as you had before?
Now look at the e-values and compare them to what you had before.

#### Q16
Now look at the e-values and compare them to what you had before.

Is there any change between the e-values you had before and what you got now? How
**Q16** Is there any change between the e-values you had before and what you got now? How
do you explain that?

Not all polymerases are the same. Let's compare a few of the ones in the database. You can
find all of them in the smaller dataset. For this, you can go back to using online Blast.

For the following sequences, justify your answer with your own words, but also include a dot plot and
at least part of the sequence alignment.
Compare the first two sequences in the file, `gi|328835344` and `gi|328835342`

#### Q17
**Q17** Which type of Blast did you run? Why?

Which type of Blast did you run? Why?
**Q18** How similar are these sequences? Looking at the sequence headers, is this expected?

#### Q18

How similar are these sequences? Looking at the sequence headers, is this expected?
Now compare the first sequence, `gi|328835344`, with the one that has ID `gi|619834969`.

#### Q19
**Q19** How similar are these sequences? Looking at the sequence headers, is this expected?

How similar are these sequences? Looking at the sequence headers, is this expected?
For the final test, compare the first sequence to the one with ID `gi|627755269`.

#### Q20

Considering what you learned from statistics, but also your biological
**Q20** Considering what you learned from statistics, but also your biological
knowledge, is this match relevant? Why/why not?
Loading

0 comments on commit 2babb07

Please sign in to comment.