Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fixes from first run #23

Merged
merged 12 commits into from
Mar 21, 2024
Merged

Bug fixes from first run #23

merged 12 commits into from
Mar 21, 2024

Conversation

taylorreiter
Copy link
Member

@taylorreiter taylorreiter commented Mar 15, 2024

PR checklist

  • Describe the changes you've made.
  • Describe any tests you have conducted to confirm that your changes behave as expected.
  • If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
  • If you've added new functionality, make sure that the documentation is updated accordingly.
  • If you encountered bugs or features that you won't address, but should be addressed eventually, create new issues for them.

PR description

This PR addresses two bugs that came up for me when running this pipeline on real data (I'm done with t. urticae and human).

The first bug is in the combine script at the end of the pipeline. It deals with an edge case where there are multiple blast hits against peptipedia and reduces it down to 1. It then allows many matches against deepsig bc I found that peptides could be chains and signals. I also decided to split the peptide predictions out from the peptide annotations (as outline in #16) bc deepsig and peptipedia can produce many valid annotations, but this way each peptide only occurs once in the predictions dataframe.

The next bug is much bigger. When running on human data, I found non-standard amino acids broke nlpprecursor. I added a script to filter those out before running nlpprecursor. However, while debugging, I noticed that the two predict scripts do not output predictions in the same order as I had assumed they do 😱
Here is an example output:

class_predictions
[{'class_predictions': [{'class': 'NONRIPP', 'score': 0.7995544672012329}], 'name': 'NP_000006.2'},
{'class_predictions': [{'class': 'LASSO_PEPTIDE', 'score': 0.9999966621398926}], 'name': 'nonproblematic_sequence'}, 
{'class_predictions': [{'class': 'NONRIPP', 'score': 0.9863235950469971}], 'name': 'NP_000008.1'},
{'class_predictions': [{'class': 'NONRIPP', 'score': 0.8869850635528564}], 'name': 'NP_000007.1'}]
>>> cleavage_predictions
[{'name': 'nonproblematic_sequence', 'cleavage_prediction': {'sequence': 'LAGGTAKDLLGGHQLI', 'start': 19, 'stop': 35, 'score': -19735.994140625, 'name': 'nonproblematic_sequence', 'status': 'success'}},
{'name': 'NP_000006.2', 'cleavage_prediction': {'sequence': 'ISLGRNLVPKPGDGSLTI', 'start': 272, 'stop': 290, 'score': -18402.669921875, 'name': 'NP_000006.2', 'status': 'success'}},
{'name': 'NP_000007.1', 'cleavage_prediction': {'sequence': 'DAKIYQIYEGTSQIQRLIVAREHIDKYKN', 'start': 392, 'stop': 421, 'score': -17646.052734375, 'name': 'NP_000007.1', 'status': 'success'}},
{'name': 'NP_000008.1', 'cleavage_prediction': {'sequence': 'IYEGTSEIQRLVIAGHLLRSYRS', 'start': 389, 'stop': 412, 'score': -17679.158203125, 'name': 'NP_000008.1', 'status': 'success'}}]

My changes to the nlpprecursor script deal with this. I've confirmed that they work correctly on my test data set now.

Also, I altered the script so the second predict only runs on positive classification (not on NONRIPP) to save on run time.

I confirm that these changes run on my demo data, and that the nlpprecursor script runs on a more specialized test data set (see below so it's recorded somewhere).

No new software added!

>problematic_sequence
MCAARLAAAAAAAQSVYAFSARPLAGGEPVSLGSLRGKVLLIENVASLUGTTVRDYTQMNELQRRLGPRGLVVLGFPCNQ
FGHQENAKNEEILNSLKYVRPGGGFEPNFMLFEKCEVNGAGAHPLFAFLREALPAPSDDATALMTDPKLITWSPVCRNDV
AWNFEKFLVGPDGVPLRRYSRRFQTIDIEPDIEALLSQGPSCA
>nonproblematic_sequence
MTYERPTLSKAGGFRKTTGLAGGTAKDLLGGHQLI
>NP_000006.2 arylamine N-acetyltransferase 2 [Homo sapiens]
MDIEAYFERIGYKNSRNKLDLETLTDILEHQIRAVPFENLNMHCGQAMELGLEAIFDHIVRRNRGGWCLQVNQLLYWALT
TIGFQTTMLGGYFYIPPVNKYSTGMVHLLLQVTIDGRNYIVDAGSGSSSQMWQPLELISGKDQPQVPCIFCLTEERGIWY
LDQIRREQYITNKEFLNSHLLPKKKHQKIYLFTLEPRTIEDFESMNTYLQTSPTSSFITTSFCSLQTPEGVYCLVGFILT
YRKFNYKDNTDLVEFKTLTEEEVEEVLRNIFKISLGRNLVPKPGDGSLTI
>NP_000007.1 medium-chain specific acyl-CoA dehydrogenase, mitochondrial isoform a precursor [Homo sapiens]
MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQKEFQATARKFAREEIIPVAAEYDKTGEYPVPLIRR
AWELGLMNTHIPENCGGLGLGTFDACLISEELAYGCTGVQTAIEGNSLGQMPIIIAGNDQQKKKYLGRMTEEPLMCAYCV
TEPGAGSDVAGIKTKAEKKGDEYIINGQKMWITNGGKANWYFLLARSDPDPKAPANKAFTGFIVEADTPGIQIGRKELNM
GQRCSDTRGIVFEDVKVPKENVLIGDGAGFKVAMGAFDKTRPVVAAGAVGLAQRALDEATKYALERKTFGKLLVEHQAIS
FMLAEMAMKVELARMSYQRAAWEVDSGRRNTYYASIAKAFAGDIANQLATDAVQILGGNGFNTEYPVEKLMRDAKIYQIY
EGTSQIQRLIVAREHIDKYKN
>NP_000008.1 short-chain specific acyl-CoA dehydrogenase, mitochondrial isoform 1 precursor [Homo sapiens]
MAAALLARASGPARRALCPRAWRQLHTIYQSVELPETHQMLLQTCRDFAEKELFPIAAQVDKEHLFPAAQVKKMGGLGLL
AMDVPEELGGAGLDYLAYAIAMEEISRGCASTGVIMSVNNSLYLGPILKFGSKEQKQAWVTPFTSGDKIGCFALSEPGNG
SDAGAASTTARAEGDSWVLNGTKAWITNAWEASAAVVFASTDRALQNKGISAFLVPMPTPGLTLGKKEDKLGIRGSSTAN
LIFEDCRIPKDSILGEPGMGFKIAMQTLDMGRIGIASQALGIAQTALDCAVNYAENRMAFGAPLTKLQVIQFKLADMALA
LESARLLTWRAAMLKDNKKPFIKEAAMAKLAASEAATAISHQAIQILGGMGYVTEMPAERHYRDARITEIYEGTSEIQRL
VIAGHLLRSYRS

should produce this tsv:

peptide_id      start   end     peptide_type    peptide_class   prediction_tool nlpprecursor_class_score        nlpprecursor_cleavage_sequence  nlpprecursor_cleavage_score
nonproblematic_sequence_start19_end35   19      35      cleavage        LASSO_PEPTIDE   nlpprecursor    0.9999966621398926      LAGGTAKDLLGGHQLI        -19735.994140625

Copy link
Member

@keithchev keithchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! one minor comment.


args = parser.parse_args()

allowed_amino_acids = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is hard-coded I'd add a comment explain what this list corresponds to (e.g. whether this is the standard 20 amino acids).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is a variable and not a constant that is documented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally -- I was thinking we could substitute selanocysteine for just cysteine or something in the future if we wanted to, and then we could change the variable to have e.g. "U" in it, but I didn't think it through that much. I think it just happened? I can change it if it's better as a hard coded constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merging this PR now, please open an issue if you think this should be switched to a documented constant!

@taylorreiter taylorreiter merged commit c934030 into main Mar 21, 2024
2 checks passed
@taylorreiter taylorreiter deleted the ter/fixes-from-first-run branch March 21, 2024 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants