Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
eead-csic-compbio committed Aug 28, 2018
2 parents 9b2ee98 + 7f04775 commit a449fb6
Show file tree
Hide file tree
Showing 5 changed files with 59 additions and 0 deletions.
59 changes: 59 additions & 0 deletions user_utils/normalize/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,62 @@ perl compare_clusters.pl -m -o intersection -d \
your_sequences_homologues/clusters_0taxa_algOMCL_e0_raw,your_sequences_homologues/clusters_0taxa_algOMCL_e0_norm \
&> log.intersection
```

## Effect on the clustering

__1) Protein datasets__

Sequence clusters produced by the standard and the normalized version of GET_HOMOLOGUES-EST
on the peptides of 119 isolates of the bacterial Streptophomonas genus showed some differences.
There were 23,249 identical clusters in both runs, but 1,190 and 1,359 clusters unique
to the standard and normalized set up, respectively. The number of clusters containing a single sequence
(singletons) was greater (329 vs 225) and more distributed across peptide length after the normalization step (Figure 1).

![singleton_dist_prot](images/singleton_len_prot.png)

*Figure 1. Length distribution of singletons in the original and normalized clustering. Singleton
sequence length after normalization is more evenly distributed across peptide length.*

Some sequences originally found in clusters were clasified as singletons after normalization.
Subtracting those sequences did not have any effect on the overall percentage of sequence identity of the clusters,
which indicates these may be miss-clustered after normalization. However, some outliers among long sequences
increase cluster mean identity when moved to singleton clusters, indicating the use of the normalization process
for building high quality clusters of long sequences to be used in phylogenetic analyses (Figure 2).

![diff_identity](images/effect_prot_id.png)

*Figure 2. Difference in cluster % sequence identity before and after removing sequences because of
the normalization process in different length regions. Positive values indicate an increase in identity
after removing a sequence after normalization. Length was measured as the mean alignment length
reported by BLASTP for sequences within the cluster.*

__2) Nucleotide datasets__

The clusters sets produced by the standard and normalized predictions by the
GET_HOMOLOGUES-EST protocol were very different with the transcripts of 11 species of the genus Oryza.
In particular, there were 111,964 identical gene clusters, and 14,779 and 38,524 unique clusters in the original
and normalized results, respectively. Moreover, the number of singletons calculated by the standard
program was 612, whereas after normalization it increased to 18,126. The number of singletons
was, as in the protein dataset example, more distributed across nucleotide length after the
normalization step (Figure 3).

![singleton_dist_nucl](images/singleton_len_nucl.png)

*Figure 3. Length distribution of singletons in the original and normalized clustering. Singleton
sequence length after normalization is more evenly distributed across peptide length.*

The mean BLAST coverage values of the clusters usually increased after the subtraction of
sequences because of the normalization. Manual inspection of some cases revealed that long sequences
were subtracted from the original clusters and classified assingletons even when some regions
aligned without mismatches with other sequences of the cluster (Figure 4). This effect of the normalization
process might not be desired if users want to make clusters of CDS and transcripts, even if they only
share some particular regions, such as exons, but not other regions such as introns, only present
in transcripts. In most cases, susbtracting sequences because of normalization did not have an effect on
the overall indentity of the clusters.

![effect_cov_nucl](images/effect_nucl_cov.png)

*Figure 4. Coverage of the clusters before (x-axis) and after (y-axis) new singletons were sunstracted
from original clusters because of the normalization process.*


Binary file added user_utils/normalize/images/effect_nucl_cov.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added user_utils/normalize/images/effect_prot_id.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a449fb6

Please sign in to comment.