generated from Sydney-Informatics-Hub/template-nf
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revising trycycler and select assembly implementations #61
Open
fredjaya
wants to merge
34
commits into
main
Choose a base branch
from
issue-54
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Modularising assembly steps for incoming revision of select_assembly. Current implementation branches channels according to the trycycler_cluster output. Move counting of contigs per assembly prior to trycycler so all trycycler (sub)processes can be run directly one after the other.
Refactoring iteratively so everything doesn't break
TIL the clustering step filters out small contigs (default < 5000 nt) and will run into < 2 contigs error in trycycler_classify again.
Consider flye and unicycler assemblies as "de novo". This impacts which data are grouped together for processing and channelling. Previously, medaka was hardcoded to polish only trycycler and flye assemblies. This commit adds an assembly-agnostic module for medaka polishing (WIP). Introduce val(assembler_name) for tagging, reporting etc.
Rename channels and comments to clarify differences between de novo vs. consensus assemblies. combined assemblies should be both de novo and consensus (all).
Add more flexible quast module, some temp medaka changes to keep old implementation running for now
To better align with best practices and readability. Be more explicit with error strategy and process script def. This required some additional groovy in workflow{} though
- Remove manual file moving or bash conditionals in process script def - Remove if/else channel operators/groovy - Output dir no longer has barcode, might re-add later, but might be ok because it's output with the barcode tuple
Change process outputs to recurse through barcode and cluster directories (e.g. **/out_file)
Mainly tidying medaka denovo and consensus implementations to look for the polished assembly in the process outputs.
Fixes inconsistent publishing for assemblies and qc results
Diffs I have intentionally kept separate - a lot of things add temporary tweaks to get this current version running during development.
Some comment tidying
Also fix `trycycler_reconcile_new` inconsistent `2_all_seqs.fasta` process output
For trycycler and flye-specific downstream processes, modules, and config. Commenting out existing "chromosome" implementations and will re-add progressively.
`select_assembly_new` didn't cache properly as it was outputing a `stdout` - best assembly now stored in a text file. Update bakta and amrfinderplus processes for chromosome annotation to handle new metadata, reduce `mkdir` and file movement within script, and decouple output definitions from hardcoded paths etc. No longer need `helper.patch`
Clarify map syntax, module tags, publishDir handling
Modules suffixed with `*_new` replaces existing modules
This is what the current
Suggestions:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tested and works on all Vibrio and Tenacibaculum barcodes. Currently on
/scratch/er01/fj9712/2411_wholetest
- to be moved.Things to discuss and address either in this PR or later ones:
Implementation
Reference-free chromosome assembly selection
Addresses #54, #23
For the chromosomal assembly, every barcode is assembled by flye and unicycler, and polished. The single "best" polished assembly out of flye, unicycler, and optionally trycycler (consensus assembly), is selected for downstream annotation and analyses.
To avoid biasing assemblies to published references, the assembly with the most complete BUSCOs is considered the best one. This now allows unicycler assemblies to be considered too. QUAST is also run but not used for selecting the best assembly.
Now only has a implementation for chromosomal assembly, instead of two independent ones, to make updating the criteria for selecting an assembly easier. For example, to incorporate QUAST outputs, or add additional tools like Merqury.
Trycycler implementation
Addresses #43, #60
Trycycler processes are now self-contained. Additional assemblers can be implemented easier to generate better consensus assemblies if required.
Added more error handling for too-few-contigs (trycycler cluster filters more out). If trycycler correctly fails at any point, the pipeline will still continue and select either the flye or unicycler assembly for downstream processes.
Input/output process definitions are more explicit (i.e. specific files instead of globs) for better error handling. A lot more operators and groovy in the workflow scope as a result.