-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fixes from first run #23
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! one minor comment.
|
||
args = parser.parse_args() | ||
|
||
allowed_amino_acids = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is hard-coded I'd add a comment explain what this list corresponds to (e.g. whether this is the standard 20 amino acids).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is a variable and not a constant that is documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not totally -- I was thinking we could substitute selanocysteine for just cysteine or something in the future if we wanted to, and then we could change the variable to have e.g. "U" in it, but I didn't think it through that much. I think it just happened? I can change it if it's better as a hard coded constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merging this PR now, please open an issue if you think this should be switched to a documented constant!
PR checklist
conda
environments.PR description
This PR addresses two bugs that came up for me when running this pipeline on real data (I'm done with t. urticae and human).
The first bug is in the combine script at the end of the pipeline. It deals with an edge case where there are multiple blast hits against peptipedia and reduces it down to 1. It then allows many matches against deepsig bc I found that peptides could be chains and signals. I also decided to split the peptide predictions out from the peptide annotations (as outline in #16) bc deepsig and peptipedia can produce many valid annotations, but this way each peptide only occurs once in the predictions dataframe.
The next bug is much bigger. When running on human data, I found non-standard amino acids broke nlpprecursor. I added a script to filter those out before running nlpprecursor. However, while debugging, I noticed that the two predict scripts do not output predictions in the same order as I had assumed they do 😱
Here is an example output:
My changes to the nlpprecursor script deal with this. I've confirmed that they work correctly on my test data set now.
Also, I altered the script so the second predict only runs on positive classification (not on NONRIPP) to save on run time.
I confirm that these changes run on my demo data, and that the nlpprecursor script runs on a more specialized test data set (see below so it's recorded somewhere).
No new software added!
should produce this tsv: