-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix excluding events from the report #332
Conversation
WalkthroughThe changes made in the Changes
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🔇 Additional comments (2)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (7)
workflow/rules/common.smk (7)
Line range hint
107-111
: Clarify the Extraction of 'calling_types'The line where
calling_types
is defined:calling_types = samples["calling"].str.split(",").explode().unique().tolist()This line flattens the list of calling types across all samples. While functional, consider adding a comment to explain the purpose for better readability.
Line range hint
620-622
: Compatibility of 'match' Statement with Python VersionThe
match
statement used inget_map_reads_sorting_params
requires Python 3.10 or higher. Ensure that the execution environment supports Python 3.10. If not, refactor the code using if-else statements for compatibility.Refactored code using if-else statements:
-def get_map_reads_sorting_params(wildcards, ordering=False): - match (sample_has_umis(wildcards.sample), ordering): - case (True, True): - return "queryname" - case (True, False): - return "fgbio" - case (False, True): - return "coordinate" - case (False, False): - return "samtools" +def get_map_reads_sorting_params(wildcards, ordering=False): + has_umis = sample_has_umis(wildcards.sample) + if has_umis and ordering: + return "queryname" + elif has_umis and not ordering: + return "fgbio" + elif not has_umis and ordering: + return "coordinate" + else: + return "samtools"
Line range hint
729-731
: Simplify Configuration Lookups with 'dict.get'In places where
lookup
is used to access configuration values with a default, consider using the standarddict.get
method for simplicity unlesslookup
provides additional functionality.Example:
- if lookup(dpath="maf/activate", within=config, default=False): + if config.get("maf", {}).get("activate", False):
Line range hint
243-245
: Correct Syntax Error in 'wildcard_constraints' DefinitionThe definition of
wildcard_constraints
is missing a colon.Apply this diff to fix the syntax error:
-wildcard_constraints +wildcard_constraints:
Line range hint
512-514
: Handle Missing 'datasources' in ConfigurationIn the function
get_dgidb_datasources
, if the key'datasources'
is not present inconfig["annotations"]["dgidb"]
, it returns an empty string.def get_dgidb_datasources(): if config["annotations"]["dgidb"].get("datasources", ""): return "-s {}".format(" ".join(config["annotations"]["dgidb"]["datasources"])) return ""Consider specifying a default value or handling the absence more explicitly to avoid potential KeyErrors.
Line range hint
223-225
: Ensure 'primer_panels' DataFrame is Properly InitializedThe
primer_panels
variable is conditionally initialized based on the presence of a configuration key.primer_panels = ( ( pd.read_csv( config["primers"]["trimming"]["tsv"], sep="\t", dtype={"panel": str, "fa1": str, "fa2": str}, comment="#", ) .set_index(["panel"], drop=False) .sort_index() ) if config["primers"]["trimming"].get("tsv", "") else None )Ensure that the configuration provides the expected
tsv
path, and handle cases whereprimer_panels
might beNone
to prevent AttributeError when accessed later.
Line range hint
1079-1083
: Ensure Compatibility of Regular ExpressionsIn
get_fastqc_results
, the regular expression used may need verification to ensure it matches the intended files.valid = re.compile(r"^[^/]+\.tsv$")Confirm that the pattern correctly matches the filenames you expect to process.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- workflow/rules/common.smk (1 hunks)
🧰 Additional context used
🔇 Additional comments (4)
workflow/rules/common.smk (4)
Line range hint
411-413
: Potential Unhandled Case in 'get_trimming_input' FunctionIn
get_trimming_input
, if neithercalc_consensus_reads
norremove_duplicates
is activated, the function defaults to returning a path based on the aligner.def get_trimming_input(wildcards): if is_activated("remove_duplicates"): return "results/dedup/{sample}.bam" else: aligner = "star" if get_sample_datatype(wildcards.sample) == "rna" else "bwa" return "results/mapped/{aligner}/{{sample}}.bam".format(aligner=aligner)Consider whether handling is needed for cases when neither condition is met, especially if primer trimming requires input from a different step.
Line range hint
696-706
: Validation of 'variant_caller' ConfigurationThe
variant_caller
variable is created based on the activation of certain callers and whether samples contain 'variants' in their calling configuration.variant_caller = list( filter( None, [ ( "freebayes" if is_activated("calling/freebayes") and samples["calling"].str.contains("variants").any() else None ), ( "delly" if is_activated("calling/delly") and samples["calling"].str.contains("variants").any() else None ), ], ) )Ensure that at least one variant caller is activated; otherwise,
variant_caller
will be an empty list, potentially causing issues in downstream processes.Execute the following script to check the activation status:
#!/bin/bash # Verify that at least one variant caller is activated grep 'activate' config.yaml | grep -E 'calling/freebayes|calling/delly'
Line range hint
290-292
: Ensure Correct Retrieval of Calling EventsThe
get_calling_events
function retrieves events based on thecalling_type
. Verify that this function returns the expected list of events and that it aligns with the current configuration inconfig["calling"]["fdr-control"]["events"]
.Run the following script to list events for each calling type:
Line range hint
860-862
: Check Division Operations for Zero Division ErrorsAny division operations in the code should have checks to prevent division by zero.
Run the following script to find division operations:
else fusions_groups | ||
), | ||
event=get_calling_events(calling_type), | ||
event=event, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Issue with Event Parameter Usage in 'expand' Function
In the get_final_output
function, the event
parameter in the expand
call has been changed to use the variable event
directly. Ensure that event
is correctly defined in this context and that it provides the expected values. Previously, a function call like get_calling_events(calling_type)
might have provided a list of events, whereas event
here may be a single value from the loop. This could affect the expanded outputs if not handled properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
workflow/rules/common.smk (1)
Line range hint
137-224
: Consider refactoring for improved readability and maintainabilityThe
get_final_output
function is quite complex and handles multiple responsibilities. Consider the following suggestions to improve its structure and readability:
- Extract the logic for generating different types of outputs (e.g., reports, final calls, MAF files) into separate helper functions.
- Use list comprehensions or generator expressions instead of nested loops where possible to make the code more concise.
- Consider using a configuration-driven approach to determine which outputs to generate, reducing the need for multiple conditionals.
Example refactoring (partial):
def get_report_outputs(config, groups): if config["report"]["activate"]: return [ f"results/datavzrd-report/{group}.coverage" for group in groups ] return [] def get_final_call_outputs(config, groups, events, calling_types): if not config["report"]["activate"]: return [ f"results/final-calls/{group}.{event}.{calling_type}.fdr-controlled.bcf" for group in groups for event in events for calling_type in calling_types ] return [] def get_final_output(wildcards): final_output = [ f"results/qc/multiqc/{group}.html" for group in groups ] final_output.extend(get_report_outputs(config, groups)) final_output.extend(get_final_call_outputs(config, groups, events, calling_types)) # ... other output types ... return final_outputThis refactoring would make the main function more readable and easier to maintain, with each type of output handled by a separate, focused function.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- workflow/rules/common.smk (2 hunks)
🧰 Additional context used
🔇 Additional comments (3)
workflow/rules/common.smk (3)
166-166
: LGTM: Consistent with previous changeThis change is consistent with the modification on line 153, directly using the
event
variable. The previous comment and verification steps apply to this line as well.
Line range hint
1-1000
: Summary of changes and suggestionsThe changes in this file primarily involve simplifying the usage of the
event
variable in theget_final_output
function. While these changes appear to be improvements, it's crucial to verify that theevent
variable is correctly defined and provides the expected values in all contexts.Key points from this review:
- Verify the correct usage and scope of the
event
variable.- Consider refactoring the
get_final_output
function for improved readability and maintainability.- The overall structure and logic of the function remain intact, with only minor modifications to variable usage.
These changes should improve code simplicity without altering the core functionality. However, thorough testing is recommended to ensure that all output paths are still correctly generated after these modifications.
153-153
: Verify the correctness ofevent
variable usageThe change from a function call to directly using the
event
variable simplifies the code. However, we need to ensure thatevent
is correctly defined and provides the expected values in this context.To confirm the correct usage of the
event
variable, please run the following script:This will help us confirm that the
event
variable is properly defined and matches the expected events from the configuration.✅ Verification successful
event
Variable Usage Verified SuccessfullyThe
event
variable is correctly defined and utilized within theget_final_output
function. The events inconfig/config.yaml
align with the expected values, ensuring the simplified code functions as intended.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of 'event' variable in get_final_output function # Test: Check if 'event' is defined within the function scope rg -A 10 -B 10 'def get_final_output\(' workflow/rules/common.smk # Test: Verify the events defined in the config grep -A 5 'events:' config/config.yamlLength of output: 1338
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch!
I would suggest one small change, namely to move the get_calling_events(calling_type)
to the for event in [...]
loop, to keep its functionality in (the correct) place.
@@ -150,7 +150,7 @@ def get_final_output(wildcards): | |||
expand( | |||
"results/datavzrd-report/{batch}.{event}.{calling_type}.fdr-controlled", | |||
batch=get_report_batches(calling_type), | |||
event=get_calling_events(calling_type), | |||
event=event, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove the use of this utility function here, maybe we can instead use it above, to replace this:
for event in config["calling"]["fdr-control"]["events"]:
with the function call like this:
for event in get_calling_events(calling_type):
This ensures, that the calling_type variable from the respective for-loop is also respected, fixing a potential bug.
Makes sense, just applied the change. I will test this on the workflow but can't do it now as the testing workflow needs a rerun of almost all jobs and I can only test this hopefully sometime next week. But if you think that's not very necessary, you can also merge it now or it can also wait. |
Hmm, maybe we could even add the new |
Makes sense, I just added it to one of the tests. Now we can manually check this feature on the tests but not sure how this can be achieved by checking the artifacts from the CI tests. |
Before excluding events from the report did not work as it was thought. All the events that had
False
forreport
generation in the config still had datavzrd reports, becauseget_calling_events(calling_type)
always returned all events. This fix ensures that onlyTrue
events have datavzrd reports.Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Refactor