-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build a new model for the RNAsamba tool to use for predicting coding vs. noncoding RNAs #1
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
neevor
reviewed
Feb 9, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of my questions might already be answered in some of the other documentation so sorry if that's the case. I'm still reading over it all.
Co-authored-by: Erin <[email protected]> Signed-off-by: Taylor Reiter <[email protected]>
…/peptigate into ter/build-rnasamba-model
neevor
approved these changes
Feb 12, 2024
neevor
reviewed
Feb 13, 2024
keithchev
approved these changes
Feb 13, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR checklist
conda
environments.PR Description
This PR creates a snakefile that records the data curation and commands I ran to build a new RNAsamba model to use for predicting coding vs. noncoding RNA. RNAsamba provides a model that was trained on human coding vs. noncoding transcripts. A recent benchmarking paper showed that RNAsamba accuracy increased on non-human transcripts when it used a model trained on diverse model organisms instead of just human. The paper did not make these models available, and the methods were somewhat vague. However, I think the authors used training data from the CPPred tool to train this model. Reading the methods in this paper, it was again somewhat vague exactly what these authors did to curate their training data, so I decided to re-create the model and record all of the commands that I ran to produce it. I also validation this model using a validation set produced by the benchmarking paper linked above.
The model is based on ensembl transcripts, which are pre-labeled as coding or non-coding. I removed homology between training, testing, and validation sets by clustering all sequences at 80% similarity. I also downsampled the input data so that their were balanced classes in both the training and testing data.
The model is ~90% accurate on the testing data set, but only 11% accurate on the validation dataset. This is expected behavior. The validation dataset is intentionally composed of tricky cases of coding/noncoding transcripts, in particular, many short open reading frames. In the benchmarking paper, most tools had <10% accuracy on this dataset. I plan to use this model to classify coding vs. noncoding transcripts for long transcripts only (>300nt). For short transcripts, I going to create a new tool (smallesm) to predict these, as well as train a short RNAsamba classifier specifically and see how these perform.
I don't expect many people to run this snakefile again. It's primary purpose is to document my data curation steps and how i ran the model. Most people will use the peptigate tool itself (which is still under construction), using the models that are trained in that tool.
I plan to document all of this in separate documentation PR once peptigate is a bit further along in dev.
request for feedback: The model is 8.5mb. Should i add it to this github repository, create an OSF repo and upload it there, or place it on zenodo? If either of the last two options are best, I'll document model locations in the README.