-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: processing vembrane config #330
Conversation
WalkthroughThe changes in this pull request involve updates to multiple Snakemake workflow files, focusing on version upgrades for wrappers, enhancements in sample data handling, and the introduction of new rules. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Workflow
participant Freebayes
participant Delly
participant Bcftools
User->>Workflow: Initiate variant calling
Workflow->>Freebayes: Execute with updated parameters
Freebayes-->>Workflow: Return variant calls
Workflow->>Delly: Execute variant calling
Delly-->>Workflow: Return variant calls
Workflow->>Bcftools: Filter invalid BCFs
Bcftools-->>Workflow: Return valid BCFs
Workflow-->>User: Provide final variant calls
Possibly related PRs
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (5)
workflow/rules/plugins.smk (1)
9-9
: Approve the URL change with suggestions for improvementThe change to use Zenodo as the download source for REVEL scores is a good improvement in terms of reliability and accessibility. This aligns well with the PR objective of addressing the unreliable download link issue.
However, I have a couple of suggestions to further enhance this change:
- Consider parameterizing the URL or at least the version number. This would make future updates easier to manage. For example:
params: revel_version = "v1.3", zenodo_record = "7072866" shell: "curl https://zenodo.org/records/{params.zenodo_record}/files/revel-{params.revel_version}_all_chromosomes.zip -o {output} &> {log}"
- It would be beneficial to add a checksum verification step to ensure the integrity of the downloaded file. This could be implemented using a separate rule or by adding a verification step within this rule.
Would you like me to provide an example of how to implement the checksum verification?
workflow/rules/candidate_calling.smk (3)
Line range hint
39-47
: New rule addresses the Delly breakends issue effectively.The
fix_delly_calls
rule successfully filters out breakends from Delly output, addressing the issue of invalid BCFs after VEP annotation. This aligns well with the PR objectives.Consider adding error handling to check if the input file exists and is not empty before processing:
rule fix_delly_calls: input: "results/candidate-calls/{group}.delly.bcf", output: "results/candidate-calls/{group}.delly.no_bnds.bcf", log: "logs/fix_delly_calls/{group}.log", conda: "../envs/bcftools.yaml" shell: - """bcftools view -e 'INFO/SVTYPE="BND"' {input} -Ob > {output} 2> {log}""" + """ + if [ ! -s {input} ]; then + echo "Error: Input file {input} is empty or does not exist" >&2 + exit 1 + fi + bcftools view -e 'INFO/SVTYPE="BND"' {input} -Ob > {output} 2> {log} + """
Line range hint
50-62
: Improved input handling for off-target variant filtering.The
filter_offtarget_variants
rule now uses theget_fixed_candidate_calls
function, which likely includes the output from the newfix_delly_calls
rule. This ensures that only valid calls (with breakends removed) are processed, aligning with the PR objectives.For improved clarity, consider adding a comment explaining the
get_fixed_candidate_calls
function:rule filter_offtarget_variants: input: + # get_fixed_candidate_calls returns the output of fix_delly_calls for Delly, + # and the original calls for other callers calls=get_fixed_candidate_calls("bcf"), index=get_fixed_candidate_calls("bcf.csi"), regions="resources/target_regions/target_regions.bed", output: "results/candidate-calls/{group}.{caller}.filtered.bcf", params: extra="", log: "logs/filter_offtarget_variants/{group}.{caller}.log", wrapper: "v2.3.2/bio/bcftools/filter"
Line range hint
65-79
: Improved flexibility in candidate call scattering.The
scatter_candidates
rule now conditionally uses either filtered calls (if target regions are defined) or fixed calls. This change allows for more flexibility in the workflow, accommodating both targeted and whole-genome analyses, which improves the overall robustness of the workflow.To improve readability, consider using a ternary operator for the input definition:
rule scatter_candidates: input: - "results/candidate-calls/{group}.{caller}.filtered.bcf" - if config.get("target_regions", None) - else get_fixed_candidate_calls("bcf"), + lambda wildcards: ( + "results/candidate-calls/{group}.{caller}.filtered.bcf" + if config.get("target_regions", None) + else get_fixed_candidate_calls("bcf") + ).format(**wildcards), output: scatter.calling( "results/candidate-calls/{{group}}.{{caller}}.{scatteritem}.bcf" ), log: "logs/scatter-candidates/{group}.{caller}.log", conda: "../envs/rbt.yaml" shell: "rbt vcf-split {input} {output}"workflow/rules/common.smk (1)
Line range hint
464-465
: Potential bug:get_reads
function ignoresfq
parameterIn the
get_cutadapt_input
function, theget_reads
function defined inside theelse
clause ofif len(fq1) == 1
uses a hardcoded filename ending with.fq1.fastq
, regardless of thefq
argument passed to it. This means that whenget_reads
is called with'fq2'
, it will still return a path ending with.fq1.fastq
, which may not be correct for paired-end data.Apply this diff to fix the issue:
-def get_reads(fq): - return f"pipe/cutadapt/{unit.sample_name}/{unit.unit_name}.fq1.fastq{ending}" +def get_reads(fq): + return f"pipe/cutadapt/{unit.sample_name}/{unit.unit_name}.{fq}.fastq{ending}"This change ensures that the
fq
parameter is correctly used in constructing the file path.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
- workflow/rules/candidate_calling.smk (1 hunks)
- workflow/rules/common.smk (1 hunks)
- workflow/rules/mapping.smk (1 hunks)
- workflow/rules/plugins.smk (1 hunks)
🔇 Additional comments (4)
workflow/rules/plugins.smk (1)
9-9
: Verify consistency of the downloaded fileWhile the change in the download URL is an improvement, it's crucial to ensure that the file downloaded from Zenodo has the exact same format and content as the one from the previous source. This is to guarantee that the subsequent rules (
process_revel_scores
andtabix_revel_scores
) continue to function correctly.To verify this, you could run the following script:
This script will help ensure that the new source provides consistent data, maintaining the integrity of the workflow.
workflow/rules/candidate_calling.smk (2)
Line range hint
1-79
: Overall, the changes effectively address the PR objectives.The modifications to the
workflow/rules/candidate_calling.smk
file successfully improve the robustness and flexibility of the workflow. The newfix_delly_calls
rule addresses the issue with Delly breakends, while the updates tofilter_offtarget_variants
andscatter_candidates
rules enhance the workflow's ability to handle different scenarios. The freebayes wrapper update is a good practice, but requires verification of compatibility.These changes align well with the PR objectives of enhancing the workflow's robustness and preventing failures related to custom annotation fields. The suggestions provided are minor and aimed at improving clarity and error handling.
23-23
: Verify compatibility with the updated freebayes wrapper.The update from v2.7.0 to v4.4.0 is a significant version jump. While this is generally good for leveraging new features and bug fixes, it's important to ensure compatibility with the rest of the workflow.
Please run the following script to check for any breaking changes or new requirements:
workflow/rules/mapping.smk (1)
86-86
: Verify the impact of updating the Picard MarkDuplicates wrapper version.The wrapper version for the
mark_duplicates
rule has been updated from v2.5.0 to v4.2.0. This is a significant version bump that may introduce new features, bug fixes, or changes in behavior.Please ensure that:
- This change is intentional and aligns with the PR objectives.
- The new version is compatible with the rest of the workflow.
- No additional changes are required in the rule or elsewhere in the workflow due to this update.
It's recommended to:
- Review the changelog or release notes for the new version to understand the changes.
- Test the workflow thoroughly with this new version to ensure compatibility and expected behavior.
To help verify the changes, you can run the following script:
This script will help identify any documentation about the version change, ensure consistent usage of the new version, and check for any related TODO or FIXME comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. If you have checked, that this also ends up putting the exon correctly into the report for your use case, I guess this is good to go.
The only thing you might want to double-check, is whether the wrapper updates are really necessary. These would trigger reruns in existing workflows updated to the resulting new workflow version, so I would only do them if necessary for this PR.
workflow/rules/candidate_calling.smk
Outdated
@@ -20,7 +20,7 @@ rule freebayes: | |||
), | |||
threads: max(workflow.cores - 1, 1) # use all available cores -1 (because of the pipe) for calling | |||
wrapper: | |||
"v2.7.0/bio/freebayes" | |||
"v4.4.0/bio/freebayes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary for anything (like a bug fix in a newer version of freebayes or the wrapper)? Because otherwise I would leave this as is, so as not to trigger rerunning of the candidate calling in workflows that update to a newer release after merging this PR...
info_prob_fields, | ||
rename_info_fields, | ||
lambda x: f"INFO['PROB_{x.upper()}']", | ||
"prob: {}".format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice cleanup, thanks!
workflow/rules/mapping.smk
Outdated
@@ -83,7 +83,7 @@ rule mark_duplicates: | |||
#https://broadinstitute.github.io/picard/faq.html | |||
mem_mb=3000, | |||
wrapper: | |||
"v2.5.0/bio/picard/markduplicates" | |||
"v4.2.0/bio/picard/markduplicates" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above: if such an update is not really necessary for anything in this pull request, I would defer updating to whenever this is really necessary. Otherwise this will trigger reruns in previously workflows where we update to a newer workflow version...
@@ -6,7 +6,7 @@ rule download_revel: | |||
conda: | |||
"../envs/curl.yaml" | |||
shell: | |||
"curl https://rothsj06.dmz.hpc.mssm.edu/revel-v1.3_all_chromosomes.zip -o {output} &> {log}" | |||
"curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, let's hope this stays more robust in the long-term and that newer versions will also end up on Zenodo!
Just some smaller fixes.
The link for downloading revel scores is broken from time to time. The scores are also hosted on zenodo which should be more reliable.
In addition I also changed parsing the exon entry as
ANN['exon].raw
as the default output would be e.gnumber / total: 2 / 5
. Calling it with.raw
attribute will output2/5
which is much more readable in a datavzrd report.Summary by CodeRabbit
Summary by CodeRabbit
New Features
Improvements
freebayes
,delly
, andPicard
, improving functionality and compatibility.Bug Fixes