Add rule to classify peptide bioactivity with the autopeptideml tool #10

taylorreiter · 2024-02-21T21:33:57Z

PR checklist

Describe the changes you've made.
Describe any tests you have conducted to confirm that your changes behave as expected.
If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
If you've added new functionality, make sure that the documentation is updated accordingly.
If you encountered bugs or features that you won't address, but should be addressed eventually, create new issues for them.

PR Description

This PR adds a rule to run the binary classifier AutoPeptideML. I choseto use the models that the authors trained in their preprint, however as added into a docstring, we could instead use the labels in the peptipedia database and train new models in a separate snakefile (like the nrps one) and then make them available for download. I prefer using the models they built in their preprint bc they and other experts put thought into the labels and use cases.

The models were supplied to me by the author of the paper via email. They said they are working on a solution to make the available/downloadable, so I added a TODO item to a rule to download when I can.

The output of the script looks like this (first few lines), where the AB column is the name of the model and the value is the prediction of that bioactivity.

ID      sequence        AB
Transcript_1000626.p1_NONRIPP_49_105_nlpprecursor       YYSGLVTDSRNMQGTVIKRKRQVKRCLAKVRTNKCVCLCQQRIVLQRCAATTFPSL        0.6666666666666666
Transcript_0.p1_CLASS_I_LANTIPEPTIDE_134_180_nlpprecursor       HLRTHTGECPYKCDHCDSSFFEKGNLKQHPCTHTGERPYKCDHCDS  0.3333333333333333
Transcript_100036.p2_NONRIPP_55_96_nlpprecursor RSVAEGTTLTPWKERKKAAAIVFASKRFPHLSAHSFLLPPP       0.3333333333333333

Testing

The changes run successfully on the demo data set and I confirmed that pytorch can find the GPU in the snakemake-built conda environment.

Documentation

punt again...but getting very close to actually doing this!

next PR

My next PR will clean up some of the issues with peptide header names and collect all of the annotation information produced since the peptipedia PR.

Update

I'm working on a summary script to put together all of the annotation data, which I'm in part hoping to use to determine whether a peptide is real or not. As part of this, I was looking at the autopeptideml predictions, and they look something like this:

ID	sequence	AB	ACE	ACP	AF	AMAP	AMP	AOX	APP	AV	BBP	DPPIV	Neuro	QS	TOX	TTCA	total
Transcript_1000463.p1_start95_end131	HNLIAESTIGAALAVMEAMQTTYAVRGKLVVLGTPA	0.33	0.33	0.66	0	0	0.66	1	0.33	1	1	0.33	1	0.33	1	0.66	8.67
Transcript_100028.p1_start77_end112	LRGQSLGSVAFLDTASAYPLVDSTAGLHVSAIAPV	0	0.33	0.33	0	0	0.33	1	0	1	1	0.33	1	0.33	0.66	1	7.33
Transcript_1001336.p1_start33_end79	GEVGETEDLEVLASFRVSSYLVSPVIAEDSFHVTSQATSLGAAATR	0	0.66	0	0	0.33	0.33	1	0	1	1	0.33	1	0	1	0.66	7.33
Transcript_1000535.p1_start68_end92	MFSSNRGTVPVSLDMPFQVVRQVD	0	0.66	0	0	0	0.33	1	0	0.66	0.66	0.66	1	0.33	0.66	1	7
Transcript_1000655.p1_start55_end108	SYVRKLCFPEGNPVLDVEDLKHGGHYVALLPHESFKKPSSKIPNNYMRTYETL	0	0.66	0	0	0	0	1	0.33	0.66	1	0.66	1	0.66	0.66	0.33	7
Transcript_1.p1_start84_end120	DHIRIHTGEKPYHCHLCPMAFAQNSGLYHHLRRHKN	0.33	0	0	1	0	1	1	1	0.33	1	0	0	0	1	0	6

This feels somewhat concerning because there are so many predictions, it's certainly over predicting. Since this isn't a labelled dataset (it's just the first 200 rows of transcripts from the Amblyomma transcriptome), we don't know the ground truth here. However, imagine being presented with this information...what do you do with it?? I was sort of hoping that there wouldn't be quite so much overprediction, and we could use this information as a filter for peptides that are more likely to be real. I don't think we can do that now, but I do think this is still worth including.

I'm going to start an issue on thinking through how to filter down to peptides that are potentially real.

keithchev

Made a few inline comments but nothing major.

Some questions/comments:

should the models that you received from the autopeptideml authors be included in the repo (assuming they're not too large)? alternatively, we could host the models somewhere else (S3 or another github repo) and then download them in the snakefile.
the use of autopeptideml here is a bit inefficient because it re-generates the ESM embeddings for each of the 12 named models. For now this is probably okay, but it may be worth optimizing if the dataset of combined peptide predictions that are input to autopeptideml becomes large (I would guess larger than ~10,000 sequences).
we should look into the implications of snakemake parallelizing processes that use the GPU (in this case, all of the autopeptideml models). I assume that this is handled in a sensible way at the level of CUDA or the GPU itself, but I'm not sure.

Snakefile

keithchev · 2024-02-22T20:26:02Z

Snakefile

+    output:
+        tsv=OUTPUT_DIR / "annotation/autopeptideml/autopeptideml_{autopeptideml_model_name}.tsv",
+    params:
+        modelsdir=INPUT_DIR / "models/autopeptideml/HPO_NegSearch_HP/",


I think that params can contain wildcards, so this could be the full path to the model, which imo would be clearer and would help make the command shorter.

it has to be a lambda function for params to contain a wildcard (at least historically, unless this changed), and when i tried to write it as a lambda function it gave an error with mixing a path and a string :(

see example of lambda function in params to ref a wildcard here: https://github.com/Arcadia-Science/prehgt/blob/9a99b641c0130ba05c3608a71b976040e81e4579/Snakefile#L117

oh, right. I remember this now lol. I wonder if this was changed in snakemake 8?

fwiw I'd bet that error was because pathlib.Path objects can't be added to strings (e.g. you have to write INPUT_DIR / "dir" and not INPUT_DIR + "/dir").

but in any case, very much nbd in this context, imo.

taylorreiter · 2024-02-23T14:07:59Z

should the models that you received from the autopeptideml authors be included in the repo (assuming they're not too large)? alternatively, we could host the models somewhere else (S3 or another github repo) and then download them in the snakefile.

The model folders are ~400mb, so I didn't upload them here. My hope is that the person who shared them with me will make them available for download soon, and i'll incorporate them with a download link in the pipeline then. My plan is to punt on putting them anywhere til this happens, but I put some comments into the snakefile as reminders to do that. If the authors don't make them available for download soon, I'll put them on OSF for download.

the use of autopeptideml here is a bit inefficient because it re-generates the ESM embeddings for each of the 12 named models. For now this is probably okay, but it may be worth optimizing if the dataset of combined peptide predictions that are input to autopeptideml becomes large (I would guess larger than ~10,000 sequences).

This is a really good point. I'll make an issue for this. It might be worth just running all twelve models in the same script (I think it probably is, but will make an issue an think on it more!)

we should look into the implications of snakemake parallelizing processes that use the GPU (in this case, all of the autopeptideml models). I assume that this is handled in a sensible way at the level of CUDA or the GPU itself, but I'm not sure.

Also a great point, I'll add it to the issue.

Co-authored-by: Keith Cheveralls <[email protected]> Signed-off-by: Taylor Reiter <[email protected]>

taylorreiter added 3 commits February 21, 2024 13:16

add env file

0fe633e

update deps, add snakefile rule, and run script for autopeptideml

72747aa

linting

c90ca0a

taylorreiter marked this pull request as draft February 21, 2024 21:34

Merge branch 'main' into ter/autopeptideml

22a1a0e

taylorreiter marked this pull request as ready for review February 22, 2024 13:59

taylorreiter requested a review from keithchev February 22, 2024 14:00

taylorreiter mentioned this pull request Feb 22, 2024

Brainstorming how to filter down to cleavage peptides that are potentially real #11

Closed

update line spacing

e818fe5

keithchev approved these changes Feb 22, 2024

View reviewed changes

taylorreiter mentioned this pull request Feb 23, 2024

Autopeptideml efficiency #14

Open

taylorreiter and others added 2 commits February 23, 2024 13:16

update to pathlib

457f1ad

Co-authored-by: Keith Cheveralls <[email protected]> Signed-off-by: Taylor Reiter <[email protected]>

revert to tmp model path

74fc9e2

taylorreiter merged commit 57f1bad into main Feb 23, 2024
2 checks passed

taylorreiter deleted the ter/autopeptideml branch February 23, 2024 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rule to classify peptide bioactivity with the autopeptideml tool #10

Add rule to classify peptide bioactivity with the autopeptideml tool #10

taylorreiter commented Feb 21, 2024 •

edited

Loading

keithchev left a comment

keithchev Feb 22, 2024

taylorreiter Feb 23, 2024

keithchev Feb 23, 2024

taylorreiter commented Feb 23, 2024

Add rule to classify peptide bioactivity with the autopeptideml tool #10

Add rule to classify peptide bioactivity with the autopeptideml tool #10

Conversation

taylorreiter commented Feb 21, 2024 • edited Loading

PR checklist

PR Description

Testing

Documentation

next PR

Update

keithchev left a comment

Choose a reason for hiding this comment

keithchev Feb 22, 2024

Choose a reason for hiding this comment

taylorreiter Feb 23, 2024

Choose a reason for hiding this comment

keithchev Feb 23, 2024

Choose a reason for hiding this comment

taylorreiter commented Feb 23, 2024

taylorreiter commented Feb 21, 2024 •

edited

Loading