Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a new model for the RNAsamba tool to use for predicting coding vs. noncoding RNAs #1

Merged
merged 35 commits into from
Feb 14, 2024

Conversation

taylorreiter
Copy link
Member

PR checklist

  • Describe the changes you've made.
  • Describe any tests you have conducted to confirm that your changes behave as expected.
  • If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
  • If you've added new functionality, make sure that the documentation is updated accordingly.

PR Description

This PR creates a snakefile that records the data curation and commands I ran to build a new RNAsamba model to use for predicting coding vs. noncoding RNA. RNAsamba provides a model that was trained on human coding vs. noncoding transcripts. A recent benchmarking paper showed that RNAsamba accuracy increased on non-human transcripts when it used a model trained on diverse model organisms instead of just human. The paper did not make these models available, and the methods were somewhat vague. However, I think the authors used training data from the CPPred tool to train this model. Reading the methods in this paper, it was again somewhat vague exactly what these authors did to curate their training data, so I decided to re-create the model and record all of the commands that I ran to produce it. I also validation this model using a validation set produced by the benchmarking paper linked above.

The model is based on ensembl transcripts, which are pre-labeled as coding or non-coding. I removed homology between training, testing, and validation sets by clustering all sequences at 80% similarity. I also downsampled the input data so that their were balanced classes in both the training and testing data.

The model is ~90% accurate on the testing data set, but only 11% accurate on the validation dataset. This is expected behavior. The validation dataset is intentionally composed of tricky cases of coding/noncoding transcripts, in particular, many short open reading frames. In the benchmarking paper, most tools had <10% accuracy on this dataset. I plan to use this model to classify coding vs. noncoding transcripts for long transcripts only (>300nt). For short transcripts, I going to create a new tool (smallesm) to predict these, as well as train a short RNAsamba classifier specifically and see how these perform.

I don't expect many people to run this snakefile again. It's primary purpose is to document my data curation steps and how i ran the model. Most people will use the peptigate tool itself (which is still under construction), using the models that are trained in that tool.

I plan to document all of this in separate documentation PR once peptigate is a bit further along in dev.

request for feedback: The model is 8.5mb. Should i add it to this github repository, create an OSF repo and upload it there, or place it on zenodo? If either of the last two options are best, I'll document model locations in the README.

Copy link
Contributor

@neevor neevor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of my questions might already be answered in some of the other documentation so sorry if that's the case. I'm still reading over it all.

build_rnasamba_euk_model.snakefile Outdated Show resolved Hide resolved
build_rnasamba_euk_model.snakefile Outdated Show resolved Hide resolved
build_rnasamba_euk_model.snakefile Outdated Show resolved Hide resolved
build_rnasamba_euk_model.snakefile Outdated Show resolved Hide resolved
build_rnasamba_euk_model.snakefile Outdated Show resolved Hide resolved
scripts/process_sequences_into_nonoverlapping_sets.R Outdated Show resolved Hide resolved
Co-authored-by: Erin <[email protected]>
Signed-off-by: Taylor Reiter <[email protected]>
curate_datasets.snakefile Outdated Show resolved Hide resolved
curate_datasets.snakefile Outdated Show resolved Hide resolved
curate_datasets.snakefile Outdated Show resolved Hide resolved
envs/rnasamba.yml Outdated Show resolved Hide resolved
scripts/process_sequences_into_nonoverlapping_sets.R Outdated Show resolved Hide resolved
scripts/process_sequences_into_nonoverlapping_sets.R Outdated Show resolved Hide resolved
curate_datasets.snakefile Outdated Show resolved Hide resolved
curate_datasets.snakefile Outdated Show resolved Hide resolved
@taylorreiter taylorreiter merged commit 45df8f9 into main Feb 14, 2024
2 checks passed
@taylorreiter taylorreiter deleted the ter/build-rnasamba-model branch February 14, 2024 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants