Merge branch 'main' of https://github.com/kth-gt/cb2442 into main

kth-gt · Oct 3, 2024 · d98bf02 · d98bf02
2 parents 1aa0aee + 399b587
commit d98bf02
Show file tree

Hide file tree

Showing 3 changed files with 54 additions and 4 deletions.
diff --git a/lab/b3/readme.md b/lab/b3/readme.md
@@ -53,7 +53,7 @@ We could go directly to the UniProtKB website and run a Blast search there, but
 
 Run a Blast protein search (blastp) on the antibiotic resistance genes against the “UniProtKB/SwissProt” database for each protein serparately. Use the accession number (only the first part, without the version number (XXXX not XXXXX.1) or if you don't find it, the name and species of your top hit from blast protein and search for it in the UniProtKB website (uniprot.org) to get more information on the hit. **Hint: look under "Family & Domains"
 
-**Q7** Which domain is present in multiple (2 or more) proteins? How does this domain do its job?
+**Q7** Which domain is present in multiple (2 or more) proteins? How does this domain do its job? Hint: consider common protein domains and check for acronyms. Examples are SH2 (Src Homology), SH3, HTH (Helix-Turn-Helix), PH (Pleckstrin Homology), PAS domain, PDZ domain.
 
 The fact that we observe this domain already multiple times in such a small set of proteins shows the power of using recurring protein domains to infer the functionality of (parts of) proteins.
 
@@ -67,7 +67,7 @@ We will use PSI-BLAST (a type of protein blast, https://www.ebi.ac.uk/jdispatche
 
 **Q10** What is the best match and what is its E-value?
 
-Now keep running iterations until the algorithm has converged, i.e. until very few (< 2) new hits highlighted in yellow show up in the top 50.
+Now keep running iterations until the algorithm has converged, i.e. until very few (< 2) new hits show up in the top 50. Hint: search for "next iteration", and make sure you iterate on a new job (search for "view jobs"). Each job has its unique ID.
 
 **Q11** What is the best match now and what is its E-value? Why has it changed?
 

diff --git a/lab/b4/readme.md b/lab/b4/readme.md
@@ -58,7 +58,7 @@ Open the Cuffdiff program. Go to condition and put "Untreated" and "Treated" as
 
 [^1]: Isoforms arise from alternative splicing sites in eukaryote genomes. This means that the same gene can give rise to more than one transcript and, if it is protein coding, give rise to more than one protein [(see the wikipedia entry)](http://en.wikipedia.org/wiki/Gene_isoform).
 
-**Q8** What is the length of the longest and shortest transcripts in this dataset and what are the names of the transcripts? (The Filter and Sort programs can be helpful here)
+**Q8** What is the length of the longest and shortest transcripts in this dataset and what are the names of the transcripts? (The sorting tools, found in the Text Manipulation section or by using the search field, can be helpful here)
 
 **Q9** What is miRNA?
 

diff --git a/prog/p4/readme.md b/prog/p4/readme.md
@@ -1,6 +1,6 @@
 # Programming Lab P4 - Making a tree using UPGMA clustering
 
-Hierarchical clustering is widely used in bioinformatics. In the CB2442 course, you have encountered it both in the Sequence feature module and in the Phylogenetics module. In this lab, your task is to write a function that, given a pairwise distance matrix and a list of names of the corresponding objects (in this case, sequences), performs hierarchical clustering using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method. The function should output the result as a tree in the Newick format.
+Hierarchical clustering is widely used in bioinformatics. In the CB2442 course, you have encountered it both in the Sequence feature module and in the Phylogenetics module. In this lab, your task is to write a function that, given a pairwise distance matrix and a list of names of the corresponding objects (in this case, sequences), performs hierarchical clustering using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method. The function should output the result as a tree in the Newick format. **Before starting, please read this thoroughly**
 
 ### Installation
 
@@ -22,6 +22,56 @@ def upgma(dist_matr, names_list):
 ```
 that takes a pairwise distance matrix (as a 2-dimensional [numpy array](https://www.w3schools.com/python/numpy/numpy_creating_arrays.asp)) and a list of sequence names (as a [list](https://www.w3schools.com/python/python_lists.asp)) as input and should return a tree in [Newick](https://en.wikipedia.org/wiki/Newick_format#:~:text=In%20mathematics%2C%20Newick%20tree%20format,Maddison%2C%20Christopher%20Meacham%2C%20F.) format (as a [string](https://www.w3schools.com/python/python_strings.asp)). The output tree should not include branch lengths. Also, set the list `authors` in the beginning of the file to contain the group members' names.  
 
+
+### Algorithm Steps
+1. **Initialization**:
+
+   - **`nwk_list`**: A list containing the names of the sequences (e.g., `['S1', 'S2', 'S3', ...]`). 
+   - **`cluster_list`**: A list of lists, each with a single sequence index (e.g., `[[0], [1], [2], ...]`). This is the same as the **`nwk_list`** but it has indices instead of names.
+   - **`updated_dist_matr`**: A copy of the original distance matrix (`dist_matr`) that will be **modified** during clustering.
+
+2. **Iterative Clustering**:
+
+   - **While** there is more than one cluster in `cluster_list`:
+     - **Find the pair of clusters with the smallest distance** in `updated_dist_matr`. Remember that each row/column represents a cluster!
+     - **Merge Clusters**:
+       - Create a new cluster by combining the indices of the two closest clusters. `[0,1]`
+       - Remove the clusters that you merged from `cluster_list`.  `[[0], [1], [2], ...]` --> `[[2], ...]`
+       - Append the new merged cluster to `cluster_list`. `[[2], [0,1], ...]`
+     - **Update `nwk_list`**:
+       - Combine the Newick strings of the two clusters (e.g., `'(S1,S2)'`).
+       - Remove the old entries from `nwk_list`.
+       - Append the new Newick string to `nwk_list`.
+     - **Update `updated_dist_matr`**:
+       - Remove rows and columns of the merged clusters.
+       - Add a new row and column for the new cluster initialized with zeros (this step is already done for you).
+       - Compute the average distances between the new cluster and the remaining clusters. This way you can fill the new row and column added previously. If you find this tricky refer to the "updating the distance matrix" section.
+### Debugging Tips
+- **Printing statements** for the `cluster_list`, the `nwk_list`, and the `updated_dist_matr` to check they are changing according to your plan.
+- Think about what the smallest cluster distance means. Does it make sense to take the smallest distance for `i=j`?
+- Think about the distance matrix being symmetric. Does it mean there is redundancy in the data?
+- Make sure you know how to get array and list indices. Stack overflow can be a good source
+- Make sure you know how to add and delete elements from a list/array in Python. When you delete an element, be careful with the index order!
+- Be careful with the iterators in the `for` loops, and make sure they are behaving as you want. Python starts at `0` and finishes at `len-1` by default, but it can be changed.
+
+### Updating the Distance Matrix
+The last step is to update the distance matrix by filling the row and column initialized at values of `0`.
+
+For each other cluster:
+
+1. Look at every sequence in the new cluster.
+2. Look at every sequence in the other cluster.
+3. Find the distance between each pair (one from the new cluster and one from the other cluster) **from the original distance matrix** (`dist_matr`).
+4. Add all these distances together.
+5. Divide the total by the number of comparisons (i.e., the number of pairs). 
+
+This gives you the average distance between the new cluster and the other cluster. 
+
+
+
+To be more clear: If the new cluster consists of the original indices `[0,1]`, you start by computing the average distance between `0` and `2`, and then between `1` and `2` (where `2` is `other_cluster`). This average distance is a value that goes in the `updated_dist_matr` at the index `[new_cluster, other_cluster]`... Remember that the matrix is symmetric though! Repeat for every other cluster.
+
+
 ### Test
 
 You can make an initial execution of your `upgma` function by running the main function of the python file itself by executing the line