-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rule to classify peptide bioactivity with the autopeptideml tool #10
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a few inline comments but nothing major.
Some questions/comments:
- should the models that you received from the autopeptideml authors be included in the repo (assuming they're not too large)? alternatively, we could host the models somewhere else (S3 or another github repo) and then download them in the snakefile.
- the use of autopeptideml here is a bit inefficient because it re-generates the ESM embeddings for each of the 12 named models. For now this is probably okay, but it may be worth optimizing if the dataset of combined peptide predictions that are input to autopeptideml becomes large (I would guess larger than ~10,000 sequences).
- we should look into the implications of snakemake parallelizing processes that use the GPU (in this case, all of the autopeptideml models). I assume that this is handled in a sensible way at the level of CUDA or the GPU itself, but I'm not sure.
output: | ||
tsv=OUTPUT_DIR / "annotation/autopeptideml/autopeptideml_{autopeptideml_model_name}.tsv", | ||
params: | ||
modelsdir=INPUT_DIR / "models/autopeptideml/HPO_NegSearch_HP/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that params can contain wildcards, so this could be the full path to the model, which imo would be clearer and would help make the command shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has to be a lambda function for params to contain a wildcard (at least historically, unless this changed), and when i tried to write it as a lambda function it gave an error with mixing a path and a string :(
see example of lambda function in params to ref a wildcard here: https://github.com/Arcadia-Science/prehgt/blob/9a99b641c0130ba05c3608a71b976040e81e4579/Snakefile#L117
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right. I remember this now lol. I wonder if this was changed in snakemake 8?
fwiw I'd bet that error was because pathlib.Path
objects can't be added to strings (e.g. you have to write INPUT_DIR / "dir"
and not INPUT_DIR + "/dir"
).
but in any case, very much nbd in this context, imo.
The model folders are ~400mb, so I didn't upload them here. My hope is that the person who shared them with me will make them available for download soon, and i'll incorporate them with a download link in the pipeline then. My plan is to punt on putting them anywhere til this happens, but I put some comments into the snakefile as reminders to do that. If the authors don't make them available for download soon, I'll put them on OSF for download.
This is a really good point. I'll make an issue for this. It might be worth just running all twelve models in the same script (I think it probably is, but will make an issue an think on it more!)
Also a great point, I'll add it to the issue. |
Co-authored-by: Keith Cheveralls <[email protected]> Signed-off-by: Taylor Reiter <[email protected]>
PR checklist
conda
environments.PR Description
This PR adds a rule to run the binary classifier AutoPeptideML. I choseto use the models that the authors trained in their preprint, however as added into a docstring, we could instead use the labels in the peptipedia database and train new models in a separate snakefile (like the nrps one) and then make them available for download. I prefer using the models they built in their preprint bc they and other experts put thought into the labels and use cases.
The models were supplied to me by the author of the paper via email. They said they are working on a solution to make the available/downloadable, so I added a TODO item to a rule to download when I can.
The output of the script looks like this (first few lines), where the
AB
column is the name of the model and the value is the prediction of that bioactivity.Testing
The changes run successfully on the demo data set and I confirmed that pytorch can find the GPU in the snakemake-built conda environment.
Documentation
punt again...but getting very close to actually doing this!
next PR
My next PR will clean up some of the issues with peptide header names and collect all of the annotation information produced since the peptipedia PR.
Update
I'm working on a summary script to put together all of the annotation data, which I'm in part hoping to use to determine whether a peptide is real or not. As part of this, I was looking at the autopeptideml predictions, and they look something like this:
This feels somewhat concerning because there are so many predictions, it's certainly over predicting. Since this isn't a labelled dataset (it's just the first 200 rows of transcripts from the Amblyomma transcriptome), we don't know the ground truth here. However, imagine being presented with this information...what do you do with it?? I was sort of hoping that there wouldn't be quite so much overprediction, and we could use this information as a filter for peptides that are more likely to be real. I don't think we can do that now, but I do think this is still worth including.
I'm going to start an issue on thinking through how to filter down to peptides that are potentially real.