Merge branch 'master' of https://github.com/eead-csic-compbio/get_hom…

…ologues
eead-csic-compbio · Aug 28, 2018 · a449fb6 · a449fb6
2 parents 9b2ee98 + 7f04775
commit a449fb6
Show file tree

Hide file tree

Showing 5 changed files with 59 additions and 0 deletions.
diff --git a/user_utils/normalize/README.md b/user_utils/normalize/README.md
@@ -69,3 +69,62 @@ perl compare_clusters.pl -m -o intersection -d \
   your_sequences_homologues/clusters_0taxa_algOMCL_e0_raw,your_sequences_homologues/clusters_0taxa_algOMCL_e0_norm \
   &> log.intersection
 ```
+
+## Effect on the clustering
+
+__1) Protein datasets__
+
+Sequence clusters produced by the standard and the normalized version of GET_HOMOLOGUES-EST
+on the peptides of 119 isolates of the bacterial Streptophomonas genus showed some differences. 
+There were 23,249 identical clusters in both runs, but 1,190 and 1,359 clusters unique 
+to the standard and normalized set up, respectively. The number of clusters containing a single sequence 
+(singletons) was greater (329 vs 225) and more distributed across peptide length after the normalization step (Figure 1). 
+
+![singleton_dist_prot](images/singleton_len_prot.png)
+
+*Figure 1. Length distribution of singletons in the original and normalized clustering. Singleton
+sequence length after normalization is more evenly distributed across peptide length.*
+
+Some sequences originally found in clusters were clasified as singletons after normalization. 
+Subtracting those sequences did not have any effect on the overall percentage of sequence identity of the clusters,
+which indicates these may be miss-clustered after normalization. However, some outliers among long sequences 
+increase cluster mean identity when moved to singleton clusters, indicating the use of the normalization process 
+for building high quality clusters of long sequences to be used in phylogenetic analyses (Figure 2).
+
+![diff_identity](images/effect_prot_id.png)
+
+*Figure 2. Difference in cluster % sequence identity before and after removing sequences because of
+the normalization process in different length regions. Positive values indicate an increase in identity
+after removing a sequence after normalization. Length was measured as the mean alignment length
+reported by BLASTP for sequences within the cluster.*
+
+__2) Nucleotide datasets__
+
+The clusters sets produced by the standard and normalized predictions by the
+GET_HOMOLOGUES-EST protocol were very different with the transcripts of 11 species of the genus Oryza. 
+In particular, there were 111,964 identical gene clusters, and 14,779 and 38,524 unique clusters in the original
+and normalized results, respectively. Moreover, the number of singletons calculated by the standard
+program was 612, whereas after normalization it increased to 18,126. The number of singletons
+was, as in the protein dataset example, more distributed across nucleotide length after the
+normalization step (Figure 3).
+
+![singleton_dist_nucl](images/singleton_len_nucl.png)
+
+*Figure 3. Length distribution of singletons in the original and normalized clustering. Singleton
+sequence length after normalization is more evenly distributed across peptide length.*
+
+The mean BLAST coverage values of the clusters usually increased after the subtraction of
+sequences because of the normalization. Manual inspection of some cases revealed that long sequences 
+were subtracted from the original clusters and classified assingletons even when some regions 
+aligned without mismatches with other sequences of the cluster (Figure 4). This effect of the normalization 
+process might not be desired if users want to make clusters of CDS and transcripts, even if they only 
+share some particular regions, such as exons, but not other regions such as introns, only present 
+in transcripts. In most cases, susbtracting sequences because of normalization did not have an effect on 
+the overall indentity of the clusters.
+
+![effect_cov_nucl](images/effect_nucl_cov.png)
+
+*Figure 4. Coverage of the clusters before (x-axis) and after (y-axis) new singletons were sunstracted
+from original clusters because of the normalization process.*
+
+
diff --git a/user_utils/normalize/images/effect_nucl_cov.png b/user_utils/normalize/images/effect_nucl_cov.png
diff --git a/user_utils/normalize/images/effect_prot_id.png b/user_utils/normalize/images/effect_prot_id.png
diff --git a/user_utils/normalize/images/singleton_len_nucl.png b/user_utils/normalize/images/singleton_len_nucl.png
diff --git a/user_utils/normalize/images/singleton_len_prot.png b/user_utils/normalize/images/singleton_len_prot.png