Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rule to classify peptide bioactivity with the autopeptideml tool #10

Merged
merged 7 commits into from
Feb 23, 2024

Conversation

taylorreiter
Copy link
Member

@taylorreiter taylorreiter commented Feb 21, 2024

PR checklist

  • Describe the changes you've made.
  • Describe any tests you have conducted to confirm that your changes behave as expected.
  • If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
  • If you've added new functionality, make sure that the documentation is updated accordingly.
  • If you encountered bugs or features that you won't address, but should be addressed eventually, create new issues for them.

PR Description

This PR adds a rule to run the binary classifier AutoPeptideML. I choseto use the models that the authors trained in their preprint, however as added into a docstring, we could instead use the labels in the peptipedia database and train new models in a separate snakefile (like the nrps one) and then make them available for download. I prefer using the models they built in their preprint bc they and other experts put thought into the labels and use cases.

The models were supplied to me by the author of the paper via email. They said they are working on a solution to make the available/downloadable, so I added a TODO item to a rule to download when I can.

The output of the script looks like this (first few lines), where the AB column is the name of the model and the value is the prediction of that bioactivity.

ID      sequence        AB
Transcript_1000626.p1_NONRIPP_49_105_nlpprecursor       YYSGLVTDSRNMQGTVIKRKRQVKRCLAKVRTNKCVCLCQQRIVLQRCAATTFPSL        0.6666666666666666
Transcript_0.p1_CLASS_I_LANTIPEPTIDE_134_180_nlpprecursor       HLRTHTGECPYKCDHCDSSFFEKGNLKQHPCTHTGERPYKCDHCDS  0.3333333333333333
Transcript_100036.p2_NONRIPP_55_96_nlpprecursor RSVAEGTTLTPWKERKKAAAIVFASKRFPHLSAHSFLLPPP       0.3333333333333333

Testing

The changes run successfully on the demo data set and I confirmed that pytorch can find the GPU in the snakemake-built conda environment.

Documentation

punt again...but getting very close to actually doing this!

next PR

My next PR will clean up some of the issues with peptide header names and collect all of the annotation information produced since the peptipedia PR.

Update

I'm working on a summary script to put together all of the annotation data, which I'm in part hoping to use to determine whether a peptide is real or not. As part of this, I was looking at the autopeptideml predictions, and they look something like this:

ID sequence AB ACE ACP AF AMAP AMP AOX APP AV BBP DPPIV MRSA Neuro QS TOX TTCA total
Transcript_1000463.p1_start95_end131 HNLIAESTIGAALAVMEAMQTTYAVRGKLVVLGTPA 0.33 0.33 0.66 0 0 0.66 1 0.33 1 1 0.33 0 1 0.33 1 0.66 8.67
Transcript_100028.p1_start77_end112 LRGQSLGSVAFLDTASAYPLVDSTAGLHVSAIAPV 0 0.33 0.33 0 0 0.33 1 0 1 1 0.33 0 1 0.33 0.66 1 7.33
Transcript_1001336.p1_start33_end79 GEVGETEDLEVLASFRVSSYLVSPVIAEDSFHVTSQATSLGAAATR 0 0.66 0 0 0.33 0.33 1 0 1 1 0.33 0 1 0 1 0.66 7.33
Transcript_1000535.p1_start68_end92 MFSSNRGTVPVSLDMPFQVVRQVD 0 0.66 0 0 0 0.33 1 0 0.66 0.66 0.66 0 1 0.33 0.66 1 7
Transcript_1000655.p1_start55_end108 SYVRKLCFPEGNPVLDVEDLKHGGHYVALLPHESFKKPSSKIPNNYMRTYETL 0 0.66 0 0 0 0 1 0.33 0.66 1 0.66 0 1 0.66 0.66 0.33 7
Transcript_1.p1_start84_end120 DHIRIHTGEKPYHCHLCPMAFAQNSGLYHHLRRHKN 0.33 0 0 1 0 1 1 1 0.33 1 0 0 0 0 1 0 6

This feels somewhat concerning because there are so many predictions, it's certainly over predicting. Since this isn't a labelled dataset (it's just the first 200 rows of transcripts from the Amblyomma transcriptome), we don't know the ground truth here. However, imagine being presented with this information...what do you do with it?? I was sort of hoping that there wouldn't be quite so much overprediction, and we could use this information as a filter for peptides that are more likely to be real. I don't think we can do that now, but I do think this is still worth including.

I'm going to start an issue on thinking through how to filter down to peptides that are potentially real.

@taylorreiter taylorreiter marked this pull request as draft February 21, 2024 21:34
Copy link
Member

@keithchev keithchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few inline comments but nothing major.

Some questions/comments:

  • should the models that you received from the autopeptideml authors be included in the repo (assuming they're not too large)? alternatively, we could host the models somewhere else (S3 or another github repo) and then download them in the snakefile.
  • the use of autopeptideml here is a bit inefficient because it re-generates the ESM embeddings for each of the 12 named models. For now this is probably okay, but it may be worth optimizing if the dataset of combined peptide predictions that are input to autopeptideml becomes large (I would guess larger than ~10,000 sequences).
  • we should look into the implications of snakemake parallelizing processes that use the GPU (in this case, all of the autopeptideml models). I assume that this is handled in a sensible way at the level of CUDA or the GPU itself, but I'm not sure.

Snakefile Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
output:
tsv=OUTPUT_DIR / "annotation/autopeptideml/autopeptideml_{autopeptideml_model_name}.tsv",
params:
modelsdir=INPUT_DIR / "models/autopeptideml/HPO_NegSearch_HP/",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that params can contain wildcards, so this could be the full path to the model, which imo would be clearer and would help make the command shorter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has to be a lambda function for params to contain a wildcard (at least historically, unless this changed), and when i tried to write it as a lambda function it gave an error with mixing a path and a string :(

see example of lambda function in params to ref a wildcard here: https://github.com/Arcadia-Science/prehgt/blob/9a99b641c0130ba05c3608a71b976040e81e4579/Snakefile#L117

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right. I remember this now lol. I wonder if this was changed in snakemake 8?

fwiw I'd bet that error was because pathlib.Path objects can't be added to strings (e.g. you have to write INPUT_DIR / "dir" and not INPUT_DIR + "/dir").

but in any case, very much nbd in this context, imo.

@taylorreiter
Copy link
Member Author

  • should the models that you received from the autopeptideml authors be included in the repo (assuming they're not too large)? alternatively, we could host the models somewhere else (S3 or another github repo) and then download them in the snakefile.

The model folders are ~400mb, so I didn't upload them here. My hope is that the person who shared them with me will make them available for download soon, and i'll incorporate them with a download link in the pipeline then. My plan is to punt on putting them anywhere til this happens, but I put some comments into the snakefile as reminders to do that. If the authors don't make them available for download soon, I'll put them on OSF for download.

  • the use of autopeptideml here is a bit inefficient because it re-generates the ESM embeddings for each of the 12 named models. For now this is probably okay, but it may be worth optimizing if the dataset of combined peptide predictions that are input to autopeptideml becomes large (I would guess larger than ~10,000 sequences).

This is a really good point. I'll make an issue for this. It might be worth just running all twelve models in the same script (I think it probably is, but will make an issue an think on it more!)

  • we should look into the implications of snakemake parallelizing processes that use the GPU (in this case, all of the autopeptideml models). I assume that this is handled in a sensible way at the level of CUDA or the GPU itself, but I'm not sure.

Also a great point, I'll add it to the issue.

taylorreiter and others added 2 commits February 23, 2024 13:16
Co-authored-by: Keith Cheveralls <[email protected]>
Signed-off-by: Taylor Reiter <[email protected]>
@taylorreiter taylorreiter merged commit 57f1bad into main Feb 23, 2024
2 checks passed
@taylorreiter taylorreiter deleted the ter/autopeptideml branch February 23, 2024 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants