Bug fixes from first run #23

taylorreiter · 2024-03-15T21:30:28Z

PR checklist

Describe the changes you've made.
Describe any tests you have conducted to confirm that your changes behave as expected.
If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
If you've added new functionality, make sure that the documentation is updated accordingly.
If you encountered bugs or features that you won't address, but should be addressed eventually, create new issues for them.

PR description

This PR addresses two bugs that came up for me when running this pipeline on real data (I'm done with t. urticae and human).

The first bug is in the combine script at the end of the pipeline. It deals with an edge case where there are multiple blast hits against peptipedia and reduces it down to 1. It then allows many matches against deepsig bc I found that peptides could be chains and signals. I also decided to split the peptide predictions out from the peptide annotations (as outline in #16) bc deepsig and peptipedia can produce many valid annotations, but this way each peptide only occurs once in the predictions dataframe.

The next bug is much bigger. When running on human data, I found non-standard amino acids broke nlpprecursor. I added a script to filter those out before running nlpprecursor. However, while debugging, I noticed that the two predict scripts do not output predictions in the same order as I had assumed they do 😱
Here is an example output:

class_predictions
[{'class_predictions': [{'class': 'NONRIPP', 'score': 0.7995544672012329}], 'name': 'NP_000006.2'},
{'class_predictions': [{'class': 'LASSO_PEPTIDE', 'score': 0.9999966621398926}], 'name': 'nonproblematic_sequence'}, 
{'class_predictions': [{'class': 'NONRIPP', 'score': 0.9863235950469971}], 'name': 'NP_000008.1'},
{'class_predictions': [{'class': 'NONRIPP', 'score': 0.8869850635528564}], 'name': 'NP_000007.1'}]
>>> cleavage_predictions
[{'name': 'nonproblematic_sequence', 'cleavage_prediction': {'sequence': 'LAGGTAKDLLGGHQLI', 'start': 19, 'stop': 35, 'score': -19735.994140625, 'name': 'nonproblematic_sequence', 'status': 'success'}},
{'name': 'NP_000006.2', 'cleavage_prediction': {'sequence': 'ISLGRNLVPKPGDGSLTI', 'start': 272, 'stop': 290, 'score': -18402.669921875, 'name': 'NP_000006.2', 'status': 'success'}},
{'name': 'NP_000007.1', 'cleavage_prediction': {'sequence': 'DAKIYQIYEGTSQIQRLIVAREHIDKYKN', 'start': 392, 'stop': 421, 'score': -17646.052734375, 'name': 'NP_000007.1', 'status': 'success'}},
{'name': 'NP_000008.1', 'cleavage_prediction': {'sequence': 'IYEGTSEIQRLVIAGHLLRSYRS', 'start': 389, 'stop': 412, 'score': -17679.158203125, 'name': 'NP_000008.1', 'status': 'success'}}]

My changes to the nlpprecursor script deal with this. I've confirmed that they work correctly on my test data set now.

Also, I altered the script so the second predict only runs on positive classification (not on NONRIPP) to save on run time.

I confirm that these changes run on my demo data, and that the nlpprecursor script runs on a more specialized test data set (see below so it's recorded somewhere).

No new software added!

>problematic_sequence
MCAARLAAAAAAAQSVYAFSARPLAGGEPVSLGSLRGKVLLIENVASLUGTTVRDYTQMNELQRRLGPRGLVVLGFPCNQ
FGHQENAKNEEILNSLKYVRPGGGFEPNFMLFEKCEVNGAGAHPLFAFLREALPAPSDDATALMTDPKLITWSPVCRNDV
AWNFEKFLVGPDGVPLRRYSRRFQTIDIEPDIEALLSQGPSCA
>nonproblematic_sequence
MTYERPTLSKAGGFRKTTGLAGGTAKDLLGGHQLI
>NP_000006.2 arylamine N-acetyltransferase 2 [Homo sapiens]
MDIEAYFERIGYKNSRNKLDLETLTDILEHQIRAVPFENLNMHCGQAMELGLEAIFDHIVRRNRGGWCLQVNQLLYWALT
TIGFQTTMLGGYFYIPPVNKYSTGMVHLLLQVTIDGRNYIVDAGSGSSSQMWQPLELISGKDQPQVPCIFCLTEERGIWY
LDQIRREQYITNKEFLNSHLLPKKKHQKIYLFTLEPRTIEDFESMNTYLQTSPTSSFITTSFCSLQTPEGVYCLVGFILT
YRKFNYKDNTDLVEFKTLTEEEVEEVLRNIFKISLGRNLVPKPGDGSLTI
>NP_000007.1 medium-chain specific acyl-CoA dehydrogenase, mitochondrial isoform a precursor [Homo sapiens]
MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQKEFQATARKFAREEIIPVAAEYDKTGEYPVPLIRR
AWELGLMNTHIPENCGGLGLGTFDACLISEELAYGCTGVQTAIEGNSLGQMPIIIAGNDQQKKKYLGRMTEEPLMCAYCV
TEPGAGSDVAGIKTKAEKKGDEYIINGQKMWITNGGKANWYFLLARSDPDPKAPANKAFTGFIVEADTPGIQIGRKELNM
GQRCSDTRGIVFEDVKVPKENVLIGDGAGFKVAMGAFDKTRPVVAAGAVGLAQRALDEATKYALERKTFGKLLVEHQAIS
FMLAEMAMKVELARMSYQRAAWEVDSGRRNTYYASIAKAFAGDIANQLATDAVQILGGNGFNTEYPVEKLMRDAKIYQIY
EGTSQIQRLIVAREHIDKYKN
>NP_000008.1 short-chain specific acyl-CoA dehydrogenase, mitochondrial isoform 1 precursor [Homo sapiens]
MAAALLARASGPARRALCPRAWRQLHTIYQSVELPETHQMLLQTCRDFAEKELFPIAAQVDKEHLFPAAQVKKMGGLGLL
AMDVPEELGGAGLDYLAYAIAMEEISRGCASTGVIMSVNNSLYLGPILKFGSKEQKQAWVTPFTSGDKIGCFALSEPGNG
SDAGAASTTARAEGDSWVLNGTKAWITNAWEASAAVVFASTDRALQNKGISAFLVPMPTPGLTLGKKEDKLGIRGSSTAN
LIFEDCRIPKDSILGEPGMGFKIAMQTLDMGRIGIASQALGIAQTALDCAVNYAENRMAFGAPLTKLQVIQFKLADMALA
LESARLLTWRAAMLKDNKKPFIKEAAMAKLAASEAATAISHQAIQILGGMGYVTEMPAERHYRDARITEIYEGTSEIQRL
VIAGHLLRSYRS

should produce this tsv:

peptide_id      start   end     peptide_type    peptide_class   prediction_tool nlpprecursor_class_score        nlpprecursor_cleavage_sequence  nlpprecursor_cleavage_score
nonproblematic_sequence_start19_end35   19      35      cleavage        LASSO_PEPTIDE   nlpprecursor    0.9999966621398926      LAGGTAKDLLGGHQLI        -19735.994140625

keithchev

LGTM! one minor comment.

keithchev · 2024-03-18T17:20:30Z

scripts/filter_protein_sequences_with_nonstandard_amino_acids.py

+
+    args = parser.parse_args()
+
+    allowed_amino_acids = {


since this is hard-coded I'd add a comment explain what this list corresponds to (e.g. whether this is the standard 20 amino acids).

Is there a reason this is a variable and not a constant that is documented?

Not totally -- I was thinking we could substitute selanocysteine for just cysteine or something in the future if we wanted to, and then we could change the variable to have e.g. "U" in it, but I didn't think it through that much. I think it just happened? I can change it if it's better as a hard coded constant.

merging this PR now, please open an issue if you think this should be switched to a documented constant!

taylorreiter added 6 commits March 13, 2024 21:04

rearrange combine for real data

cdf78a9

change to run once per seq and catch key error

9a5d5b3

roll back exception check and fix nlp reporting problem

11040ea

add a script to filter nonstandard aa char not accepted by nlppr

16f62d6

add rule to filter nonstandard char

75e03ed

format

a8ff743

taylorreiter requested a review from keithchev March 15, 2024 21:31

taylorreiter added 5 commits March 16, 2024 17:15

break up output files into two

d1b1b1c

update combine run command

d662201

rename df

8f053cb

woops wrong df pointer

6b6183e

typo

68e8254

keithchev approved these changes Mar 18, 2024

View reviewed changes

add note about 20 standard amino acids

50a16b4

taylorreiter mentioned this pull request Mar 18, 2024

Consider outputting two TSV files instead of one from combine_peptide_annotations.R #16

Closed

taylorreiter merged commit c934030 into main Mar 21, 2024
2 checks passed

taylorreiter deleted the ter/fixes-from-first-run branch March 21, 2024 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixes from first run #23

Bug fixes from first run #23

taylorreiter commented Mar 15, 2024 •

edited

Loading

keithchev left a comment

keithchev Mar 18, 2024

mertcelebi Mar 19, 2024

taylorreiter Mar 19, 2024

taylorreiter Mar 21, 2024

Bug fixes from first run #23

Bug fixes from first run #23

Conversation

taylorreiter commented Mar 15, 2024 • edited Loading

PR checklist

PR description

keithchev left a comment

Choose a reason for hiding this comment

keithchev Mar 18, 2024

Choose a reason for hiding this comment

mertcelebi Mar 19, 2024

Choose a reason for hiding this comment

taylorreiter Mar 19, 2024

Choose a reason for hiding this comment

taylorreiter Mar 21, 2024

Choose a reason for hiding this comment

taylorreiter commented Mar 15, 2024 •

edited

Loading