--> version 0.4

* added preprocessing of raw reads * added `--preprocessed` flag to suppress preprocessing for already preprocessed datasets * R1/R2 read lengths are now assessed and homogenised separately (i.e. they can have different lengths as supported by Figaro) * initial read length distribution assessment is now performed by fastqc * fastq filenames are normalised to R1/R2 naming scheme * readlen homogenisation is now performed by bbduk * added mapseq-profiling of asv sequences * added two parameters (`--dada2_chimera_method`, `--dada2_chimera_min_fold_parent_over_abundance`) for dada2 chimera removal Bugfixes/Workarounds * fixed an R script error that would generate the wrong table format when having a single sample * in `dada2_analysis` commented vortex-code (prevalence check) as it fails for certain samples Co-authored-by: Christian Schudoma <[email protected]>
zellerlab · Nov 19, 2021 · 872c3f3 · 872c3f3
1 parent 6291939
commit 872c3f3
Show file tree

Hide file tree

Showing 23 changed files with 937 additions and 411 deletions.
diff --git a/README.md b/README.md
@@ -1,16 +1,29 @@
 # gaga2 - automated 16S amplicon analysis with Figaro/DADA2
 
+![](docs/img/gaga2_flow.png)
+
 ## Installation instructions
-`TBD`
+`gaga2` requires a working `nextflow` installation (v20.4+). 
+
+Other dependencies:
+* bbmap 
+* fastqc
+* figaro
+* R v4+ with dada2, devtools, tidyverse, and cowplot installed
+
+For convenience, `gaga2` comes with a Singularity container with all dependencies installed. 
+
+```singularity pull oras://ghcr.io/zellerlab/gaga2:latest```
+
 
 ## Usage instructions
 
 ### Input
 `gaga2` takes as input Illumina paired-end 16S amplicon sequences (e.g. sequenced on a MiSeq).
 
 #### Prerequisites
-* Read files need to be named according the typical pattern `<prefix=sample_id>_*[12].{fastq,fq,fastq.gz,fq.gz}`
-and need to be arranged in a sample-based directory structure:
+* Read files need to be named according the typical pattern `<prefix=sample_id>_R?[12].{fastq,fq,fastq.gz,fq.gz}`.
+They should, but don't have to, be arranged in a sample-based directory structure:
 
 ```
 <project_dir> (aka "input_dir")
@@ -26,29 +39,42 @@ and need to be arranged in a sample-based directory structure:
               |____ <sample_n_reverse_reads>
 ```
 
-* If input reads have been "invasively" preprocessed (as evidenced by length differences between reads), 
-`gaga2` will generate an empty sentinel file `<output_dir>/SKIP_FIGARO`.
-This will prevent `Figaro` from executing and should automatically advance to `dada2` processing.
+A flat directory structure (with all read files in the same directory) or a deeply-branched (with read files scattered over multiple levels) should also work. 
+
+If `gaga2` preprocesses the reads, it will automatically use `_R1/2` endings internally.
+
+* If input reads have already been preprocessed, you can set the `--preprocessed` flag. In this case, `gaga2` will do no preprocessing at all and instruct `dada2` to perform no trimming. Otherwie, `gaga2` will assess the read lengths for uniformity. If read lengths differ within and between samples, preprocessing with `figaro` is not possible and `dada2` will be run without trimming. 
 
 * Samples with less than `110` reads after `dada2` preprocessing, will be discarded.
 
 ### Running gaga2
-The typical command for running `gaga2` is
 
-`gaga2.nf --input_dir <input_dir> --output_dir <output_dir> --amplicon_length <int> --left_primer <int:length_left_primer> --right_primer <int:length_right_primer>`
+`gaga2` can be directly run from github. 
+
+`nextflow run zellerlab/gaga2 <parameters>`
+
+To obtain a newer version, do a 
+
+`nextflow pull zellerlab/gaga2`
+
+before.
+
+In addition, you should obtain a copy of the `run.config` from the `gaga2` github repo and modify it according to your environment.
 
 #### Mandatory arguments
-* `input_dir` is the project directory mentioned above. **Recommended: absolute path**
-* `output_dir` will be created automatically. **Recommended: absolute path**
-* `amplicon_length`, `left_primer`, and `right_primer` are derived from the experiment parameters 
+* `--input_dir` is the project directory mentioned above.
+* `--output_dir` will be created automatically.
+* `--amplicon_length` this is derived from your experiment parameters (this is not read-length, but the length of the, well, amplicon!)
+* `--single_end` this is only required for single-end libraries (auto-detection of library-type is in progress)
 
 #### Optional arguments
 * `--min_overlap` of read pairs is `20bp` by default
-* `-w, -work-dir` should be set to `<output_dir>/work`, otherwise it will be automatically placed in the current directory.
+* `--primers <comma-separated-list-of-primer-sequences>` or `--left_primer`, and `--right_primer` If primer sequences are provided via `--primers`, `gaga2` will remove primers and upstream sequences (using `bbduk`), such as adapters based on the primer sequences. If non-zero primer lengths are provided instead (via `--left_primer` and `--right_primer`), `figaro` will take those into account when determining the best trim positions.
+* `--preprocessed` will prevent any further preprocessing by `gaga2` - this flag should only be used if the read data is reliably clean.
 
 
 ### internal beta-testing instructions
-* `source /g/scb2/zeller/schudoma/software/wrappers/gaga2_wrapper` **before** submitting job to cluster
-* please report issues/requests/feedback in the github issue tracker 
-* If you want to run `gaga2` on the cluster, `nextflow` alone requires `>4GB memory` just for 'managing'.
+* The old gaga2 version can be run with `source /g/scb2/zeller/schudoma/software/wrappers/gaga2_wrapper` **before** submitting job to cluster
+* Please report issues/requests/feedback in the github issue tracker 
+* If you want to run `gaga2` on the cluster, `nextflow` alone requires `>=5GB` memory just for 'managing'.
 
diff --git a/R_scripts/dada2.R b/R_scripts/dada2.R
diff --git a/R_scripts/dada2_analysis.R b/R_scripts/dada2_analysis.R
diff --git a/R_scripts/dada2_analysis_paired.R b/R_scripts/dada2_analysis_paired.R
@@ -19,6 +19,9 @@ if (length(args) >= 4) {
 	nthreads = TRUE
 }
 
+dada2_chimera_method = args[5] #
+dada2_chimera_min_fold_parent_over_abundance = as.numeric(c(args[6]))
+
 # get the read files and sample names
 list.files(input_dir)
 sample_ids = basename(list.files(input_dir))
@@ -70,7 +73,8 @@ print("ASV length")
 print(table(nchar(getSequences(seqtab))))
 
 # remove chimeras
-seqtab.nochim = removeBimeraDenovo(seqtab, method="consensus", multithread=nthreads, verbose=TRUE)
+#seqtab.nochim = removeBimeraDenovo(seqtab, method="consensus", multithread=nthreads, verbose=TRUE)
+seqtab.nochim = removeBimeraDenovo(seqtab, method=dada2_chimera_method, minFoldParentOverAbundance=dada2_chimera_min_fold_parent_over_abundance, multithread=nthreads, verbose=TRUE)
 n_OTU_after_removing_chimeras = dim(seqtab.nochim)[2]
 print(dim(seqtab.nochim))
 asv.table = t(seqtab.nochim)
@@ -103,19 +107,19 @@ write.table(track, file = "summary_table.tsv", sep="\t")
 
 save(seqtab, seqtab.nochim, r1_error, r2_error, track, file = "result.RData")
 
-# prevalance sanity check
-pdf('dada2_figures.pdf')
-temp = prop.table(asv.table, 2)
-print(temp)
-hist(log10(temp), 100, main='abundance')
-prev = rowMeans(asv.table!=0)
-hist(prev, 50, main='prevalence')
-mean.ab = rowMeans(log10(temp + 1e-05))
-hist(mean.ab, 50, main='log.abundance')
-print("Prevalence")
-print(summary(prev))
-print("Mean abundance")
-print(summary(mean.ab))
-plot(prev, mean.ab, main='prevalence vs abundance')
-plot(nchar(rownames(asv.table)), prev, main='ASV length vs prevalence')
-dev.off()
+## prevalance sanity check
+#pdf('dada2_figures.pdf')
+#temp = prop.table(asv.table, 2)
+#print(temp)
+#hist(log10(temp), 100, main='abundance')
+#prev = rowMeans(asv.table!=0)
+#hist(prev, 50, main='prevalence')
+#mean.ab = rowMeans(log10(temp + 1e-05))
+#hist(mean.ab, 50, main='log.abundance')
+#print("Prevalence")
+#print(summary(prev))
+#print("Mean abundance")
+#print(summary(mean.ab))
+#plot(prev, mean.ab, main='prevalence vs abundance')
+#plot(nchar(rownames(asv.table)), prev, main='ASV length vs prevalence')
+#dev.off()