TAGADA is a Nextflow pipeline that processes RNA-Seq data. It parallelizes multiple tasks to control reads quality, align reads to a reference genome, assemble new transcripts to create a novel annotation, and quantify genes and transcripts.
- Dependencies
- Usage
- Custom resources
- Metadata
- Merging inputs
- Workflow and results
- Novel annotation
- Funding
- Citing
To use this pipeline you will need:
- Nextflow >= 21.04.1
- Docker >= 19.03.2 or Singularity >= 3.7.3
A small dataset is provided to test this pipeline. To try it out, use this command:
nextflow run FAANG/analysis-TAGADA -profile test,docker -revision 2.1.3 --output directory
The pipeline is written in Nextflow, which provides the following default options:
Option | Example | Description | Required |
---|---|---|---|
-profile |
profile1,profile2,etc. |
Profile(s) to use when running the pipeline. Specify the profiles that fit your infrastructure among singularity , docker , kubernetes , slurm . |
Required |
-config |
custom.config |
Configuration file tailored to your infrastructure and dataset. To find a configuration file for your infrastructure, browse nf-core configs. Some large datasets require more computing resources than the pipeline defaults. To specify custom resources for specific processes, see the custom resources section. |
Optional |
-revision |
version |
Version of the pipeline to launch. | Optional |
-work-dir |
directory |
Work directory where all temporary files are written. | Optional |
-resume |
Resume the pipeline from the last completed process. | Optional |
For more Nextflow options, see Nextflow's documentation.
Option | Example | Description | Required |
---|---|---|---|
--output |
directory |
Output directory where all results are written. | Required |
--reads |
'path/to/reads/*' |
Input fastq file(s) and/or bam file(s).For single-end reads, your files must end with: .fq[.gz] For paired-end reads, your files must end with: _[R]{1,2}.fq[.gz] TAGADA code> will automatically infer read size and strandedness of libraries, but all libraries must have the same, and also the same read lengthFor aligned reads, your files must end with: .bam If the provided path includes a wildcard character like * , you must enclose it with quotes to prevent Bash glob expansion, as per Nextflow's requirements.If the files are numerous, you may provide a .txt sheet with one path or url per line. |
Required |
--annotation |
annotation.gtf |
Input reference annotation file or url. Be careful this file should contain both exon and transcript rows and should include gene_id and transcript_id in the 9th field. |
Required |
--genome |
genome.fa |
Input genome sequence file or url. |
Required |
--index |
directory |
Input genome index directory or url. |
Optional, to skip genome indexing |
--metadata |
metadata.tsv |
Input tabulated metadata file or url. |
Required if--assemble-by or --quantify-by are provided |
Option | Example | Description | Required |
---|---|---|---|
--assemble-by |
factor1,factor2,etc. |
Factor(s) defining groups in which transcripts are assembled. Aligned reads of identical factors are merged and each resulting merge group is processed individually. See the merging inputs section for details. | Optional |
--quantify-by |
factor1,factor2,etc. |
Factor(s) defining groups in which transcripts are quantified. Aligned reads of identical factors are merged and each resulting merge group is processed individually. See the merging inputs section for details. | Optional |
Option | Example | Description | Required |
---|---|---|---|
--min-transcript-occurrence |
2 |
After transcripts assembly, rare novel transcripts that appear in few assembly groups are removed from the final novel annotation. By default, if a transcript occurs in less than 2 assembly groups, it is removed. If there is only one assembly group, this option defaults to 1 . |
Optional |
--min-monoexonic-occurrence |
2 |
If specified, rare novel monoexonic transcripts are filtered according to the provided threshold. Otherwise, this option takes the value of--min-transcript-occurrence . |
Optional |
--min-transcript-tpm |
0.1 |
After transcripts assembly, novel transcripts with low TPM values in every assembly group are removed from the final novel annotation. By default, if a transcript's TPM value is lower than 0.1 in every assembly group, it is removed. |
Optional |
--min-monoexonic-tpm |
1 |
If specified, novel monoexonic transcripts with low TPM values are filtered according to the provided threshold. Otherwise, this option takes the value of--min-transcript-tpm * 10 . |
Optional |
--coalesce-transcripts-with |
tmerge |
Tool used to coalesce transcripts assemblies into a non-redundant set of transcripts for the novel annotation. Can be tmerge or stringtie . Defaults to tmerge . |
Optional |
--tmerge-args |
'--endFuzz 10000' |
Custom arguments to pass to tmerge when coalescing transcripts. | Optional |
--feelnc-filter-args |
'--size 200' |
Custom arguments to pass to FEELnc's filter script when detecting long non-coding transcripts. | Optional |
--feelnc-codpot-args |
'--mode shuffle' |
Custom arguments to pass to FEELnc's coding potential script when detecting long non-coding transcripts. | Optional |
--feelnc-classifier-args |
'--window 10000' |
Custom arguments to pass to FEELnc's classifier script when detecting long non-coding transcripts. | Optional |
Option | Example | Description | Required |
---|---|---|---|
--skip-assembly |
Skip transcripts assembly with StringTie and skip all subsequent processes working with a novel annotation. | Optional | |
--skip-lnc-detection |
Skip detection of long non-coding transcripts in the novel annotation with FEELnc. | Optional |
Option | Example | Description | Required |
---|---|---|---|
--max-cpus |
16 |
Maximum number of CPU cores that can be used for each process. This is a limit, not the actual number of requested CPU cores. | Optional |
--max-memory |
64GB |
Maximum memory that can be used for each process. This is a limit, not the actual amount of allotted memory. | Optional |
--max-time |
24h |
Maximum time that can be spent on each process. This is a limit and has no effect on the duration of each process. | Optional |
With large datasets, some workflow processes may require more computing resources than the pipeline defaults. To customize the amount of resources allotted to specific processes, add a process scope to your configuration file. Resources provided in the configuration file override the resources options.
-config custom.config
custom.config
process {
withName: TRIMGALORE_trim_adapters {
cpus = 8
memory = 18.GB
time = 36.h
}
withName: STAR_align_reads {
cpus = 16
memory = 64.GB
time = 2.d
}
}
Using --metadata
, you may provide a file describing your inputs with tab-separated factors. The first column must contain file names without file type extensions or paired-end suffixes. There are no constraints on column names or number of columns.
--reads reads.txt --metadata metadata.tsv
reads.txt
path/to/A_R1.fq
path/to/A_R2.fq
path/to/B.fq.gz
path/to/C.bam
path/to/D.fq
metadata.tsv
input tissue stage
A liver 30 days
B liver 30 days
C liver 60 days
D muscle 60 days
When using --assemble-by
and/or --quantify-by
, your inputs are merged into experiment groups that share common factors. With --assemble-by
, transcripts assembly is done individually for each assembly group, and consensus transcripts are kept to generate a novel annotation. With --quantify-by
, quantification values are given individually for each quantification group.
--assemble-by tissue --quantify-by stage
Metadata | Transcripts assembly by tissue |
Annotation | Quantification by stage |
||
---|---|---|---|---|---|
input | tissue | stage | |||
A | liver | 30 days | A, B, C ↓ liver |
liver, muscle ↓ novel annotation |
A, B ↓ 30 days |
B | liver | 30 days | |||
C | liver | 60 days | C, D ↓ 60 days |
||
D | muscle | 60 days | D ↓ muscle |
--assemble-by tissue,stage
Metadata | Transcripts assembly by tissue and stage |
Annotation | Quantification by input |
||
---|---|---|---|---|---|
input | tissue | stage | |||
A | liver | 30 days | A, B ↓ liver at 30 days |
liver at 30 days, liver at 60 days, muscle at 60 days ↓ novel annotation |
A |
B | liver | 30 days | B | ||
C | liver | 60 days | C ↓ liver at 60 days |
C | |
D | muscle | 60 days | D ↓ muscle at 60 days |
D |
The pipeline executes the following processes:
-
FASTQC_control_reads
Control reads quality with FastQC. -
TRIMGALORE_trim_adapters
Trim adapters with Trim Galore. -
STAR_index_genome
Index genome with STAR.
The indexed genome is saved tooutput/index
. -
STAR_align_reads
Align reads to the indexed genome with STAR.
Aligned reads are saved tooutput/alignment
in.bam
files. -
BEDTOOLS_compute_coverage
Compute genome coverage with Bedtools.
Coverage information is saved tooutput/coverage
in.bed
files. -
SAMTOOLS_merge_reads
Merge aligned reads by factors with Samtools.
See the merging inputs section for details. -
STRINGTIE_assemble_transcripts
Assemble transcripts in each individual assembly group with StringTie. -
TAGADA_filter_transcripts
Filter rare transcripts that appear in few assembly groups and poorly-expressed transcripts with low TPM values. -
STRINGTIE_coalesce_transcripts
orTMERGE_coalesce_transcripts
Create a novel annotation with StringTie or Tmerge.
The novel annotation is saved tooutput/annotation
in a.gtf
file. -
FEELNC_classify_transcripts
Detect long non-coding transcripts with FEELnc.
The annotation saved tooutput/annotation
is updated with the results. -
STRINGTIE_quantify_expression
Quantify genes and transcripts with StringTie.
Counts and TPM matrices are saved tooutput/quantification
in.tsv
files. -
MULTIQC_generate_report
Aggregate quality controls into a report with MultiQC.
The report is saved tooutput/control
in a.html
file.
The novel annotation contains information from StringTie, Tmerge, and FEELnc. It is provided in gtf format with exon, transcript and gene rows. Row attributes vary depending on which tool was used to coalesce transcripts.
--coalesce-transcripts-with tmerge
-
gene_id
All rows. The Tmergegene_id
starting with LOC. -
ref_gene_id
All rows. A comma-separated list of reference annotationgene_id
when a Tmerge transcript is made of at least one reference transcript, otherwise a dot. -
transcript_id
Exon and transcript rows. The Tmergetranscript_id
starting with TM, unless the transcript is exactly identical to a reference transcript, in which case the reference annotationtranscript_id
is provided. -
tmerge_tr_id
Exon and transcript rows. Optional. A comma-separated list of Tmergetranscript_id
if the currenttranscript_id
is from the reference annotation, to list which initial Tmerge transcripts it is made of. -
transcript_biotype
Exon and transcript rows. Optional. The reference annotationtranscript_biotype
of thetranscript_id
. -
feelnc_biotype
Exon and transcript rows. Optional. The transcript biotype determined by FEELnc (lncRNA, mRNA, noORF, or TUCp) if the transcript has been classified. -
contains
,contains_count
,3p_dists_to_3p
,5p_dists_to_5p
,flrpm
,longest
,longest_FL_supporters
,longest_FL_supporters_count
,mature_RNA_length
,meta_3p_dists_to_5p
,meta_5p_dists_to_5p
,rpm
,spliced
Transcript rows. Attributes provided by Tmerge.
--coalesce-transcripts-with stringtie
-
gene_id
All rows. The StringTiegene_id
starting with MSTRG. -
ref_gene_id
All rows. Optional. The reference annotationgene_id
. -
ref_gene_name
All rows. Optional. The reference annotationgene_name
. -
transcript_id
Exon and transcript rows. The StringTietranscript_id
starting with MSTRG, unless the transcript is exactly identical to a reference transcript, in which case the reference annotationtranscript_id
is provided. -
transcript_biotype
Exon and transcript rows. Optional. The reference annotationtranscript_biotype
of thetranscript_id
. -
feelnc_biotype
Exon and transcript rows. Optional. The transcript biotype determined by FEELnc (lncRNA, mRNA, noORF, or TUCp) if the transcript has been classified. -
exon_number
Exon rows. The StringTieexon_number
starting from 1 within a given transcript.
The GENE-SWitCH project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No 817998.
This repository reflects only the listed contributors views. Neither the European Commission nor its Agency REA are responsible for any use that may be made of the information it contains.
If you use TAGADA in a publication, please cite this:
Kurylo C, Guyomar C, Foissac S, Djebali S. TAGADA: a scalable pipeline to improve genome annotations with RNA-seq data. NAR Genomics and Bioinformatics. 2023 Dec 1;5(4):lqad089.