Build a new model for the RNAsamba tool to use for predicting coding vs. noncoding RNAs #1

taylorreiter · 2024-02-08T15:23:44Z

PR checklist

Describe the changes you've made.
Describe any tests you have conducted to confirm that your changes behave as expected.
If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
If you've added new functionality, make sure that the documentation is updated accordingly.

PR Description

This PR creates a snakefile that records the data curation and commands I ran to build a new RNAsamba model to use for predicting coding vs. noncoding RNA. RNAsamba provides a model that was trained on human coding vs. noncoding transcripts. A recent benchmarking paper showed that RNAsamba accuracy increased on non-human transcripts when it used a model trained on diverse model organisms instead of just human. The paper did not make these models available, and the methods were somewhat vague. However, I think the authors used training data from the CPPred tool to train this model. Reading the methods in this paper, it was again somewhat vague exactly what these authors did to curate their training data, so I decided to re-create the model and record all of the commands that I ran to produce it. I also validation this model using a validation set produced by the benchmarking paper linked above.

The model is based on ensembl transcripts, which are pre-labeled as coding or non-coding. I removed homology between training, testing, and validation sets by clustering all sequences at 80% similarity. I also downsampled the input data so that their were balanced classes in both the training and testing data.

The model is ~90% accurate on the testing data set, but only 11% accurate on the validation dataset. This is expected behavior. The validation dataset is intentionally composed of tricky cases of coding/noncoding transcripts, in particular, many short open reading frames. In the benchmarking paper, most tools had <10% accuracy on this dataset. I plan to use this model to classify coding vs. noncoding transcripts for long transcripts only (>300nt). For short transcripts, I going to create a new tool (smallesm) to predict these, as well as train a short RNAsamba classifier specifically and see how these perform.

I don't expect many people to run this snakefile again. It's primary purpose is to document my data curation steps and how i ran the model. Most people will use the peptigate tool itself (which is still under construction), using the models that are trained in that tool.

I plan to document all of this in separate documentation PR once peptigate is a bit further along in dev.

request for feedback: The model is 8.5mb. Should i add it to this github repository, create an OSF repo and upload it there, or place it on zenodo? If either of the last two options are best, I'll document model locations in the README.

…stats)

neevor

Some of my questions might already be answered in some of the other documentation so sorry if that's the case. I'm still reading over it all.

build_rnasamba_euk_model.snakefile

scripts/process_sequences_into_nonoverlapping_sets.R

Co-authored-by: Erin <[email protected]> Signed-off-by: Taylor Reiter <[email protected]>

…/peptigate into ter/build-rnasamba-model

curate_datasets.snakefile

envs/rnasamba.yml

scripts/process_sequences_into_nonoverlapping_sets.R

curate_datasets.snakefile

taylorreiter added 12 commits February 2, 2024 13:51

add ensembl genome links for rnasamba model building

4379976

check in cluster all and dumb split code

5c0cde3

prelim processing into sets finished (still need to refactor and add …

28477ff

…stats)

add summary stats

bf20a00

swap out wildcard names to be more descriptive

d20938a

simplify output of set creation

6a759e8

add rule to build rnasamba model

b0aba84

bump snakemake version and add pandas to dev

d2ef62a

add early stopping epochs, typos

7eca248

add rules and code to assess accuracy of new RNAsamba model

e56261b

missing eof new line

198fff9

rm comments

f50f19c

taylorreiter requested review from neevor and keithchev February 8, 2024 17:56

taylorreiter added 3 commits February 8, 2024 23:13

add in comparison to existing human model to show improvement

6989b64

diversify snakefmt file endings for CI

3dcdf2b

update snakefmt file endings and run linting locally

aa15c46

neevor reviewed Feb 9, 2024

View reviewed changes

Apply suggestions from code review

12985a8

Co-authored-by: Erin <[email protected]> Signed-off-by: Taylor Reiter <[email protected]>

taylorreiter requested a review from neevor February 12, 2024 13:10

taylorreiter added 2 commits February 12, 2024 13:13

indentation

b31317f

Merge branch 'ter/build-rnasamba-model' of github.com:Arcadia-Science…

93f1666

…/peptigate into ter/build-rnasamba-model

neevor approved these changes Feb 12, 2024

View reviewed changes

taylorreiter added 6 commits February 13, 2024 09:10

sample with replacement to augment noncoding to coding numbers

eaee034

add new test data set links

0a3c232

update train and test sets to be different species

8972760

update pthas

55cc055

linting

aace13f

try update rnasamba env for gpu

b54f0cb

neevor reviewed Feb 13, 2024

View reviewed changes

curate_datasets.snakefile Outdated Show resolved Hide resolved

taylorreiter added 6 commits February 13, 2024 18:57

deal with rnasamba install for gpu

6879d04

update file pointers for data processing

a073c13

linting

98c1a98

fix typos

43e2eae

add a benchmark for model building

f614cbd

missing new line eof

1a7d863

keithchev approved these changes Feb 13, 2024

View reviewed changes

taylorreiter added 5 commits February 14, 2024 00:52

woops filepath typo

3e24ad5

clean up versions around rnasamba env

f97ade0

clean up class weights comment

ee3c279

add note about order of arguments

dcb8b2a

change snakefile name since we were able to keep rnasamba build commands

1287e5c

taylorreiter merged commit 45df8f9 into main Feb 14, 2024
2 checks passed

taylorreiter deleted the ter/build-rnasamba-model branch February 14, 2024 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build a new model for the RNAsamba tool to use for predicting coding vs. noncoding RNAs #1

Build a new model for the RNAsamba tool to use for predicting coding vs. noncoding RNAs #1

taylorreiter commented Feb 8, 2024

neevor left a comment

Build a new model for the RNAsamba tool to use for predicting coding vs. noncoding RNAs #1

Build a new model for the RNAsamba tool to use for predicting coding vs. noncoding RNAs #1

Conversation

taylorreiter commented Feb 8, 2024

PR checklist

PR Description

neevor left a comment

Choose a reason for hiding this comment