-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: group calling jobs per patient/sample-group such that they are submitted to the same cluster/cloud nodes in order to save I/O #338
base: master
Are you sure you want to change the base?
Conversation
…submitted to the same cluster/cloud nodes in order to save I/O
WalkthroughThe changes in this pull request involve modifications to the Snakemake workflows defined in Changes
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🔇 Additional comments (2)workflow/rules/trimming.smk (2)Line range hint The get_sra rule's output feeds into the cutadapt pipeline. Let's ensure the data flow remains compatible. #!/bin/bash
# Description: Verify the integration between get_sra and cutadapt rules
# Check the get_cutadapt_pipe_input function implementation
echo "Checking get_cutadapt_pipe_input function..."
rg -A 10 "def get_cutadapt_pipe_input"
# Look for any test cases
echo "Checking for tests..."
fd -t f "test.*py" | xargs rg "get_sra|get_cutadapt_pipe_input"
This is a major version upgrade (v2.3.2 → v5.0.2) which might include breaking changes. Let's verify the wrapper compatibility: ✅ Verification successfulBased on the information gathered from the shell scripts, I can now provide a final response: The wrapper version update appears safe to proceed The rule interface is minimal and stable:
This simple interface has remained unchanged across versions, and the rule is properly integrated with the workflow's common functions (get_sra_reads). The wrapper is only responsible for downloading and converting SRA data to FASTQ format, which is a fundamental operation unlikely to have breaking changes. 🏁 Scripts executedThe following scripts were executed for the analysis: Script: #!/bin/bash
# Description: Check wrapper compatibility and changes
# Get the wrapper repository
echo "Fetching wrapper repository..."
git clone https://github.com/snakemake/snakemake-wrappers.git
cd snakemake-wrappers
# Check changes between versions
echo "Checking changes between v2.3.2 and v5.0.2..."
git log --oneline v2.3.2..v5.0.2 -- bio/sra-tools/fasterq-dump/
# Verify current interface matches wrapper requirements
echo "Checking current wrapper implementation..."
cat bio/sra-tools/fasterq-dump/wrapper.py
# Check if there are any migration guides or breaking changes
echo "Checking for migration guides..."
git grep -l "MIGRATION" v5.0.2 -- bio/sra-tools/fasterq-dump/
Length of output: 1774 Script: #!/bin/bash
# Let's try a different approach to verify the wrapper changes
# First, let's check the current rule implementation
echo "Current rule implementation:"
rg -A 10 "rule get_sra:" workflow/rules/trimming.smk
# Check if there are any input/output/params definitions in the rule
echo -e "\nChecking rule interface:"
rg "input:|output:|params:|log:|threads:|resources:|wrapper:" -A 2 workflow/rules/trimming.smk
# Let's also check any dependencies or functions used by this rule
echo -e "\nChecking dependencies and functions:"
rg "get_sra" workflow/rules/common.smk -A 5
Length of output: 2799 Script: #!/bin/bash
# Let's check the repository for any documentation or changelog
# Check for README or documentation in the workflow directory
echo "Checking for workflow documentation:"
fd -t f "README|CHANGELOG" -E "*.fastq" -E "*.bam"
# Check the Snakefile for any version requirements or wrapper configurations
echo -e "\nChecking Snakefile for configuration:"
rg "wrapper:|configfile:|min_version" Snakefile
# Check config files for any SRA or fasterq-dump related settings
echo -e "\nChecking config files:"
fd -t f "config.yaml|config.json" | xargs cat
# Check for any existing SRA test data or test configurations
echo -e "\nChecking for test configurations:"
fd -t f "test_config.yaml|test.yaml|units.tsv" | xargs cat
Length of output: 589 Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
workflow/rules/candidate_calling.smk (1)
86-87
: Consider additional I/O optimizations for scattered processingWhile grouping helps optimize I/O, the scatter operation itself creates multiple files. Consider:
- Adjusting the scatter chunk size to balance parallelism with I/O overhead
- Ensuring downstream rules processing these scattered files maintain the same grouping
- Using local scratch space on compute nodes for intermediate files
workflow/rules/calling.smk (3)
37-38
: Effective use of Snakemake group directive for I/O optimizationThe addition of
group: "calling"
aligns well with the PR objective. This will help Snakemake schedule related jobs on the same compute node, potentially reducing I/O overhead when processing alignment properties for the same sample group.Consider documenting this I/O optimization strategy in the workflow documentation to help users understand the performance benefits and resource allocation patterns.
63-64
: Consider centralizing group name configurationThe consistent use of
group: "calling"
across rules is good for I/O optimization. However, the hardcoded string could be better managed through configuration.Consider defining the group name in the workflow configuration:
# In config.yaml or similar +calling: + group_name: "calling" # In rules group: - "calling" + config["calling"]["group_name"]This would make it easier to modify the grouping strategy if needed and prevent typos in group names.
Also applies to: 95-96
Line range hint
37-96
: Consider adding performance monitoring for I/O optimizationThe grouping strategy for I/O optimization is well-implemented. To maximize its benefits:
- Consider adding benchmarking to measure the I/O performance improvement
- Document the expected behavior and benefits in the workflow documentation
- Add logging to track when jobs are executed on the same vs. different nodes
This will help validate the optimization and provide data for future improvements.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
workflow/rules/calling.smk
(4 hunks)workflow/rules/candidate_calling.smk
(1 hunks)
🔇 Additional comments (1)
workflow/rules/calling.smk (1)
47-47
: Verify get_candidate_calls function compatibility
The simplification from lambda to direct function reference improves readability. However, we should verify that get_candidate_calls
handles all necessary cases.
✅ Verification successful
get_candidate_calls function is correctly referenced
The simplification from a lambda to the direct function reference is verified. get_candidate_calls
is properly defined and handles all necessary cases.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for get_candidate_calls definition and usage
rg "get_candidate_calls" --type python -A 5
Length of output: 74
Script:
#!/bin/bash
# Search for get_candidate_calls definition and usage
rg "get_candidate_calls" --type py -A 5
Length of output: 39
Script:
#!/bin/bash
# Search for get_candidate_calls definition and usage in all files
rg "get_candidate_calls" -A 5
Length of output: 940
group: | ||
"calling" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Inconsistent grouping for variant calling rules
Not all related variant calling rules are grouped under "calling"
. Please add group: "calling"
to the following rules to ensure consistent I/O optimization:
workflow/rules/annotation.smk
:gather_annotated_calls
workflow/rules/candidate_calling.smk
:fix_delly_calls
workflow/Snakefile
:only_alignment
workflow/rules/fusion_calling.smk
:sort_arriba_calls
workflow/rules/calling.smk
:varlociraptor_alignment_properties
,varlociraptor_preprocess
,varlociraptor_call
,sort_calls
workflow/rules/filtering.smk
:gather_calls
,merge_calls
workflow/rules/datavzrd.smk
:split_call_tables
,process_fusion_call_tables
,datavzrd_variants_calls
,datavzrd_fusion_calls
workflow/rules/benchmarking.smk
:gather_benchmark_calls
🔗 Analysis chain
LGTM! Effective grouping for I/O optimization
Adding this rule to the "calling" group is a good optimization that ensures related variant calling jobs run on the same node, reducing I/O overhead through improved data locality.
Let's verify that all related rules are consistently grouped:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if all related variant calling rules use the same group
# Expected: All variant calling related rules should have group: "calling"
# Search for rules and their group declarations
rg -A 1 'rule .*(_call|_preprocess|_alignment|scatter_candidates)' | rg -A 1 '(rule|group)'
Length of output: 2147
Summary by CodeRabbit
New Features
group
parameter to enhance organization and tracking of related outputs in several rules.group
declaration to thescatter_candidates
rule for improved categorization.Improvements
varlociraptor_preprocess
rule by updating thecandidates
input reference.get_sra
rule to use a newer version of thefasterq-dump
wrapper for processing SRA data.