Change the input files to only take one transcriptome assembly file instead of two #47

taylorreiter · 2024-06-05T15:22:24Z

PR checklist

Tag the issue(s) or milestones this PR fixes (e.g. Fixes #123, Resolves #456).
Describe the changes you've made.
Describe any tests you have conducted to confirm that your changes behave as expected.
If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
If you've added new functionality, make sure that the documentation is updated accordingly.
If you encountered bugs or features that you won't address, but should be addressed eventually, create new issues for them.

PR description

This PR addresses Consider requiring one file as input for the transcriptome assembly instead of splitting between short and long transcripts #42
It refactors the code, input files, and rewrites the relevant documentation to make only one transcriptome input file.
I ran the changes through the demo data and confirmed we got the same results.

keithchev

Just a couple minor comments/edits for clarity.

README.md

config.yml

keithchev · 2024-06-05T20:09:18Z

demo/README.md

-We pulled the "short contigs" file from an internal S3 bucket.
-It contains contigs that were filtered from the Amblyomma transcriptome prior to txome merging.
+We also pulled short contigs (less than 75 bp) from an internal S3 bucket and added these contigs to the `contigs.fa` file (50 contigs).
+These are contigs that were filtered from the *Amblyomma* transcriptome prior to transcriptome merging (mid assembly pipeline).


Consider clarifying what "mid assembly pipeline" means (afaict, it's also not clear from this readme what pipeline this is referring to)

keithchev · 2024-06-05T20:10:12Z

demo/config.yml

+#   ORFs should be predicted from the same transcriptome assembly as the "contigs" input file.
+#   ORFs should have the same name (before the first period in the name) as the contigs in the
+#   "contigs" input file. TransDecoder provides files in the proper format.
+#   Used for cleavage peptide prediction and annotation of nonribosomal peptide synthetases, and to
+#   remove coding transcripts from the transcriptome assembly before sORF prediction.
+# - orfs_nucleotides: predicted ORFs as nucleotide sequences. Should contain the same ORFs as
+#   "orfs_amino_acids" but in nucleotide format. TransDecoder also provides this file in the proper
+#   format. If this file contains short ORFs (< 300 nucleotides), they will not be reported as sORFs
+#   as they are already annotated in the input.
+# - plmutils_model_dir: path to the directory for the plmutils model that will predict whether sORFs
+#   are coding or non-coding.


This is unrelated to this PR, but fwiw I feel you could delete all of these comments in this demo config (so that they only appear in one place, in the main top-level config.yml). This would avoid duplicating them (and avoid the need to keep them in sync when they are changed)

oh i love this, I'll do that, thank you!

Co-authored-by: Keith Cheveralls <[email protected]> Signed-off-by: Taylor Reiter <[email protected]>

taylorreiter added 2 commits June 5, 2024 10:47

remove short and longer contigs input

5344849

lint and format

74fc391

taylorreiter marked this pull request as draft June 5, 2024 15:22

rev order of demo data cat

fe21372

taylorreiter marked this pull request as ready for review June 5, 2024 15:46

taylorreiter requested a review from keithchev June 5, 2024 15:46

taylorreiter mentioned this pull request Jun 5, 2024

Weird sORF results when changing the order of the input contigs file #50

Closed

keithchev approved these changes Jun 5, 2024

View reviewed changes

taylorreiter and others added 2 commits June 5, 2024 18:27

Apply suggestions from code review

6db82c3

Co-authored-by: Keith Cheveralls <[email protected]> Signed-off-by: Taylor Reiter <[email protected]>

suggestions from code review

ec0bf2e

taylorreiter merged commit 1d425ed into main Jun 5, 2024
2 checks passed

taylorreiter deleted the ter/change-input-files branch June 5, 2024 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the input files to only take one transcriptome assembly file instead of two #47

Change the input files to only take one transcriptome assembly file instead of two #47

taylorreiter commented Jun 5, 2024 •

edited

Loading

keithchev left a comment

keithchev Jun 5, 2024

keithchev Jun 5, 2024 •

edited

Loading

taylorreiter Jun 5, 2024

Change the input files to only take one transcriptome assembly file instead of two #47

Change the input files to only take one transcriptome assembly file instead of two #47

Conversation

taylorreiter commented Jun 5, 2024 • edited Loading

PR checklist

PR description

keithchev left a comment

Choose a reason for hiding this comment

keithchev Jun 5, 2024

Choose a reason for hiding this comment

keithchev Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

taylorreiter Jun 5, 2024

Choose a reason for hiding this comment

taylorreiter commented Jun 5, 2024 •

edited

Loading

keithchev Jun 5, 2024 •

edited

Loading