Check if flowcell id matches for paired samples (nf-core#1664)

I noticed [this comment ](https://github.com/nf-core/sarek/blob/5cc30494a6b8e7e53be64d308b582190ca7d2585/workflows/sarek/main.nf#L946) about checking the flowcell ID for paired samples while constructing GATK read groups. I was adapting the read group code for a custom pipeline and attempted a quick fix, so I thought I'd contribute it back to sarek. > While constructing the read group from paired fastq samples, perform a check to ensure that the id is the same for (the first reads) in fastq_1 and fastq_2. Exit out with an error otherwise and report the problematic sample and file paths. Incidentally, while researching read groups I came across the following recommendations: https://support.sentieon.com/appnotes/read_groups/. Would it be worth updating some of the fields to match these guidelines?  ## PR checklist - [x] This comment contains a description of changes (with reason). - [ ] If you've fixed a bug or added code that should be tested, add tests! - => Only tested this manually, but happy to add a proper test if you could give me a starting point. Is there already an existing test for samplesheet validation that I can add this too? I guess I will need to add "corrupt" fastq files to the nf-core test repo? - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/sarek/tree/master/.github/CONTRIBUTING.md) - [ ] If necessary, also make a PR on the nf-core/sarek _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. - [x] Make sure your code lints (`nf-core lint`). - [x] Ensure the test suite passes (`nextflow run . -profile test,docker --outdir <OUTDIR>`). - [x] Check for unexpected warnings in debug mode (`nextflow run . -profile debug,test,docker --outdir <OUTDIR>`). - [ ] Usage Documentation in `docs/usage.md` is updated. - [ ] Output Documentation in `docs/output.md` is updated. - [ ] `CHANGELOG.md` is updated. - => will do this after submitting the PR so that I can link to it. - [ ] `README.md` is updated (including new tool citations and authors/contributors). - => should I do this even for such a minor contribution? --------- Co-authored-by: Maxime U Garcia <[email protected]> Co-authored-by: Maxime U Garcia <[email protected]>
famosab · Oct 30, 2024 · 74db9d3 · 74db9d3
1 parent 8ea4af9
commit 74db9d3
Show file tree

Hide file tree

Showing 2 changed files with 8 additions and 3 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [1653](https://github.com/nf-core/sarek/pull/1653) - Updates `sarek_subway` files with `lofreq`
 - [1660](https://github.com/nf-core/sarek/pull/1642) - Add `--length_required` for minimal reads length with `FASTP`
 - [1663](https://github.com/nf-core/sarek/pull/1663) - Massive conda modules update
+- [1664](https://github.com/nf-core/sarek/pull/1664) - Check if flowcell ID matches for read pair
 
 ### Changed
 

diff --git a/workflows/sarek/main.nf b/workflows/sarek/main.nf
@@ -944,11 +944,15 @@ workflow SAREK {
 // Add readgroup to meta and remove lane
 def addReadgroupToMeta(meta, files) {
     def CN = params.seq_center ? "CN:${params.seq_center}\\t" : ''
+    def flowcell = flowcellLaneFromFastq(files[0])
+
+    // Check if flowcell ID matches
+    if ( flowcell && flowcell != flowcellLaneFromFastq(files[1]) ){
+        error("Flowcell ID does not match for paired reads of sample ${meta.id} - ${files}")
+    }
 
-    // Here we're assuming that fastq_1 and fastq_2 are from the same flowcell:
     // If we cannot read the flowcell ID from the fastq file, then we don't use it
-    def sample_lane_id = flowcellLaneFromFastq(files[0]) ? "${meta.flowcell}.${meta.sample}.${meta.lane}" : "${meta.sample}.${meta.lane}"
-    // TO-DO: Would it perhaps be better to also call flowcellLaneFromFastq(files[1]) and check that we get the same flowcell-id?
+    def sample_lane_id = flowcell ? "${meta.flowcell}.${meta.sample}.${meta.lane}" : "${meta.sample}.${meta.lane}"
 
     // Don't use a random element for ID, it breaks resuming
     def read_group = "\"@RG\\tID:${sample_lane_id}\\t${CN}PU:${meta.lane}\\tSM:${meta.patient}_${meta.sample}\\tLB:${meta.sample}\\tDS:${params.fasta}\\tPL:${params.seq_platform}\""