Skip to content

Commit

Permalink
--> version 0.4
Browse files Browse the repository at this point in the history
* added preprocessing of raw reads
* added `--preprocessed` flag to suppress preprocessing for already preprocessed datasets
* R1/R2 read lengths are now assessed and homogenised separately (i.e. they can have different lengths as supported by Figaro)
* initial read length distribution assessment is now performed by fastqc
* fastq filenames are normalised to R1/R2 naming scheme
* readlen homogenisation is now performed by bbduk
* added mapseq-profiling of asv sequences
* added two parameters (`--dada2_chimera_method`, `--dada2_chimera_min_fold_parent_over_abundance`) for dada2 chimera removal

Bugfixes/Workarounds
* fixed an R script error that would generate the wrong table format when having a single sample
* in `dada2_analysis` commented vortex-code (prevalence check) as it fails for certain samples


Co-authored-by: Christian Schudoma <[email protected]>
  • Loading branch information
cschu and cschu authored Nov 19, 2021
1 parent 6291939 commit 872c3f3
Show file tree
Hide file tree
Showing 23 changed files with 937 additions and 411 deletions.
56 changes: 41 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,29 @@
# gaga2 - automated 16S amplicon analysis with Figaro/DADA2

![](docs/img/gaga2_flow.png)

## Installation instructions
`TBD`
`gaga2` requires a working `nextflow` installation (v20.4+).

Other dependencies:
* bbmap
* fastqc
* figaro
* R v4+ with dada2, devtools, tidyverse, and cowplot installed

For convenience, `gaga2` comes with a Singularity container with all dependencies installed.

```singularity pull oras://ghcr.io/zellerlab/gaga2:latest```


## Usage instructions

### Input
`gaga2` takes as input Illumina paired-end 16S amplicon sequences (e.g. sequenced on a MiSeq).

#### Prerequisites
* Read files need to be named according the typical pattern `<prefix=sample_id>_*[12].{fastq,fq,fastq.gz,fq.gz}`
and need to be arranged in a sample-based directory structure:
* Read files need to be named according the typical pattern `<prefix=sample_id>_R?[12].{fastq,fq,fastq.gz,fq.gz}`.
They should, but don't have to, be arranged in a sample-based directory structure:

```
<project_dir> (aka "input_dir")
Expand All @@ -26,29 +39,42 @@ and need to be arranged in a sample-based directory structure:
|____ <sample_n_reverse_reads>
```

* If input reads have been "invasively" preprocessed (as evidenced by length differences between reads),
`gaga2` will generate an empty sentinel file `<output_dir>/SKIP_FIGARO`.
This will prevent `Figaro` from executing and should automatically advance to `dada2` processing.
A flat directory structure (with all read files in the same directory) or a deeply-branched (with read files scattered over multiple levels) should also work.

If `gaga2` preprocesses the reads, it will automatically use `_R1/2` endings internally.

* If input reads have already been preprocessed, you can set the `--preprocessed` flag. In this case, `gaga2` will do no preprocessing at all and instruct `dada2` to perform no trimming. Otherwie, `gaga2` will assess the read lengths for uniformity. If read lengths differ within and between samples, preprocessing with `figaro` is not possible and `dada2` will be run without trimming.

* Samples with less than `110` reads after `dada2` preprocessing, will be discarded.

### Running gaga2
The typical command for running `gaga2` is

`gaga2.nf --input_dir <input_dir> --output_dir <output_dir> --amplicon_length <int> --left_primer <int:length_left_primer> --right_primer <int:length_right_primer>`
`gaga2` can be directly run from github.

`nextflow run zellerlab/gaga2 <parameters>`

To obtain a newer version, do a

`nextflow pull zellerlab/gaga2`

before.

In addition, you should obtain a copy of the `run.config` from the `gaga2` github repo and modify it according to your environment.

#### Mandatory arguments
* `input_dir` is the project directory mentioned above. **Recommended: absolute path**
* `output_dir` will be created automatically. **Recommended: absolute path**
* `amplicon_length`, `left_primer`, and `right_primer` are derived from the experiment parameters
* `--input_dir` is the project directory mentioned above.
* `--output_dir` will be created automatically.
* `--amplicon_length` this is derived from your experiment parameters (this is not read-length, but the length of the, well, amplicon!)
* `--single_end` this is only required for single-end libraries (auto-detection of library-type is in progress)

#### Optional arguments
* `--min_overlap` of read pairs is `20bp` by default
* `-w, -work-dir` should be set to `<output_dir>/work`, otherwise it will be automatically placed in the current directory.
* `--primers <comma-separated-list-of-primer-sequences>` or `--left_primer`, and `--right_primer` If primer sequences are provided via `--primers`, `gaga2` will remove primers and upstream sequences (using `bbduk`), such as adapters based on the primer sequences. If non-zero primer lengths are provided instead (via `--left_primer` and `--right_primer`), `figaro` will take those into account when determining the best trim positions.
* `--preprocessed` will prevent any further preprocessing by `gaga2` - this flag should only be used if the read data is reliably clean.


### internal beta-testing instructions
* `source /g/scb2/zeller/schudoma/software/wrappers/gaga2_wrapper` **before** submitting job to cluster
* please report issues/requests/feedback in the github issue tracker
* If you want to run `gaga2` on the cluster, `nextflow` alone requires `>4GB memory` just for 'managing'.
* The old gaga2 version can be run with `source /g/scb2/zeller/schudoma/software/wrappers/gaga2_wrapper` **before** submitting job to cluster
* Please report issues/requests/feedback in the github issue tracker
* If you want to run `gaga2` on the cluster, `nextflow` alone requires `>=5GB` memory just for 'managing'.

95 changes: 0 additions & 95 deletions R_scripts/dada2.R

This file was deleted.

121 changes: 0 additions & 121 deletions R_scripts/dada2_analysis.R

This file was deleted.

38 changes: 21 additions & 17 deletions R_scripts/dada2_analysis_paired.R
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ if (length(args) >= 4) {
nthreads = TRUE
}

dada2_chimera_method = args[5] #
dada2_chimera_min_fold_parent_over_abundance = as.numeric(c(args[6]))

# get the read files and sample names
list.files(input_dir)
sample_ids = basename(list.files(input_dir))
Expand Down Expand Up @@ -70,7 +73,8 @@ print("ASV length")
print(table(nchar(getSequences(seqtab))))

# remove chimeras
seqtab.nochim = removeBimeraDenovo(seqtab, method="consensus", multithread=nthreads, verbose=TRUE)
#seqtab.nochim = removeBimeraDenovo(seqtab, method="consensus", multithread=nthreads, verbose=TRUE)
seqtab.nochim = removeBimeraDenovo(seqtab, method=dada2_chimera_method, minFoldParentOverAbundance=dada2_chimera_min_fold_parent_over_abundance, multithread=nthreads, verbose=TRUE)
n_OTU_after_removing_chimeras = dim(seqtab.nochim)[2]
print(dim(seqtab.nochim))
asv.table = t(seqtab.nochim)
Expand Down Expand Up @@ -103,19 +107,19 @@ write.table(track, file = "summary_table.tsv", sep="\t")

save(seqtab, seqtab.nochim, r1_error, r2_error, track, file = "result.RData")

# prevalance sanity check
pdf('dada2_figures.pdf')
temp = prop.table(asv.table, 2)
print(temp)
hist(log10(temp), 100, main='abundance')
prev = rowMeans(asv.table!=0)
hist(prev, 50, main='prevalence')
mean.ab = rowMeans(log10(temp + 1e-05))
hist(mean.ab, 50, main='log.abundance')
print("Prevalence")
print(summary(prev))
print("Mean abundance")
print(summary(mean.ab))
plot(prev, mean.ab, main='prevalence vs abundance')
plot(nchar(rownames(asv.table)), prev, main='ASV length vs prevalence')
dev.off()
## prevalance sanity check
#pdf('dada2_figures.pdf')
#temp = prop.table(asv.table, 2)
#print(temp)
#hist(log10(temp), 100, main='abundance')
#prev = rowMeans(asv.table!=0)
#hist(prev, 50, main='prevalence')
#mean.ab = rowMeans(log10(temp + 1e-05))
#hist(mean.ab, 50, main='log.abundance')
#print("Prevalence")
#print(summary(prev))
#print("Mean abundance")
#print(summary(mean.ab))
#plot(prev, mean.ab, main='prevalence vs abundance')
#plot(nchar(rownames(asv.table)), prev, main='ASV length vs prevalence')
#dev.off()
Loading

0 comments on commit 872c3f3

Please sign in to comment.