Skip to content

FAANG/analysis-TAGADA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TAGADA: Transcript And Gene Assembly, Deconvolution, Analysis

TAGADA is a Nextflow pipeline that processes RNA-Seq data. It parallelizes multiple tasks to control reads quality, align reads to a reference genome, assemble new transcripts to create a novel annotation, and quantify genes and transcripts.

Table of contents

Dependencies

To use this pipeline you will need:

Usage

A small dataset is provided to test this pipeline. To try it out, use this command:

nextflow run FAANG/analysis-TAGADA -profile test,docker -revision 2.1.3 --output directory

Nextflow options

The pipeline is written in Nextflow, which provides the following default options:

Option Example Description Required
-profile profile1,profile2,etc. Profile(s) to use when running the pipeline. Specify the profiles that fit your infrastructure among singularity, docker, kubernetes, slurm. Required
-config custom.config Configuration file tailored to your infrastructure and dataset.

To find a configuration file for your infrastructure, browse nf-core configs.

Some large datasets require more computing resources than the pipeline defaults. To specify custom resources for specific processes, see the custom resources section.
Optional
-revision version Version of the pipeline to launch. Optional
-work-dir directory Work directory where all temporary files are written. Optional
-resume Resume the pipeline from the last completed process. Optional

For more Nextflow options, see Nextflow's documentation.

Input and output options

Option Example Description Required
--output directory Output directory where all results are written. Required
--reads 'path/to/reads/*' Input fastq file(s) and/or bam file(s).

For single-end reads, your files must end with:
.fq[.gz]

For paired-end reads, your files must end with:
_[R]{1,2}.fq[.gz]

TAGADAcode> will automatically infer read size and strandedness of libraries, but all libraries must have the same, and also the same read length

For aligned reads, your files must end with:
.bam

If the provided path includes a wildcard character like *, you must enclose it with quotes to prevent Bash glob expansion, as per Nextflow's requirements.

If the files are numerous, you may provide a .txt sheet with one path or url per line.
Required
--annotation annotation.gtf Input reference
annotation file or url. Be careful this file should contain both exon and transcript rows and should include gene_id and transcript_id in the 9th field.
Required
--genome genome.fa Input genome
sequence file or url.
Required
--index directory Input genome index
directory or url.
Optional, to
skip genome indexing
--metadata metadata.tsv Input tabulated
metadata file or url.
Required if
--assemble-by
or
--quantify-by
are provided

Merge options

Option Example Description Required
--assemble-by factor1,factor2,etc. Factor(s) defining groups in which transcripts are assembled. Aligned reads of identical factors are merged and each resulting merge group is processed individually. See the merging inputs section for details. Optional
--quantify-by factor1,factor2,etc. Factor(s) defining groups in which transcripts are quantified. Aligned reads of identical factors are merged and each resulting merge group is processed individually. See the merging inputs section for details. Optional

Assembly options

Option Example Description Required
--min-transcript-occurrence 2 After transcripts assembly, rare novel transcripts that appear in few assembly groups are removed from the final novel annotation. By default, if a transcript occurs in less than 2 assembly groups, it is removed. If there is only one assembly group, this option defaults to 1. Optional
--min-monoexonic-occurrence 2 If specified, rare novel monoexonic transcripts are filtered according to the provided threshold. Otherwise, this option takes the value of
--min-transcript-occurrence.
Optional
--min-transcript-tpm 0.1 After transcripts assembly, novel transcripts with low TPM values in every assembly group are removed from the final novel annotation. By default, if a transcript's TPM value is lower than 0.1 in every assembly group, it is removed. Optional
--min-monoexonic-tpm 1 If specified, novel monoexonic transcripts with low TPM values are filtered according to the provided threshold. Otherwise, this option takes the value of
--min-transcript-tpm * 10.
Optional
--coalesce-transcripts-with tmerge Tool used to coalesce transcripts assemblies into a non-redundant set of transcripts for the novel annotation. Can be tmerge or stringtie. Defaults to tmerge. Optional
--tmerge-args '--endFuzz 10000' Custom arguments to pass to tmerge when coalescing transcripts. Optional
--feelnc-filter-args '--size 200' Custom arguments to pass to FEELnc's filter script when detecting long non-coding transcripts. Optional
--feelnc-codpot-args '--mode shuffle' Custom arguments to pass to FEELnc's coding potential script when detecting long non-coding transcripts. Optional
--feelnc-classifier-args '--window 10000' Custom arguments to pass to FEELnc's classifier script when detecting long non-coding transcripts. Optional

Skip options

Option Example Description Required
--skip-assembly Skip transcripts assembly with StringTie and skip all subsequent processes working with a novel annotation. Optional
--skip-lnc-detection Skip detection of long non-coding transcripts in the novel annotation with FEELnc. Optional

Resources options

Option Example Description Required
--max-cpus 16 Maximum number of CPU cores that can be used for each process. This is a limit, not the actual number of requested CPU cores. Optional
--max-memory 64GB Maximum memory that can be used for each process. This is a limit, not the actual amount of allotted memory. Optional
--max-time 24h Maximum time that can be spent on each process. This is a limit and has no effect on the duration of each process. Optional

Custom resources

With large datasets, some workflow processes may require more computing resources than the pipeline defaults. To customize the amount of resources allotted to specific processes, add a process scope to your configuration file. Resources provided in the configuration file override the resources options.

Example configuration

-config custom.config

custom.config

process {

  withName: TRIMGALORE_trim_adapters {
    cpus = 8
    memory = 18.GB
    time = 36.h
  }

  withName: STAR_align_reads {
    cpus = 16
    memory = 64.GB
    time = 2.d
  }

}

Metadata

Using --metadata, you may provide a file describing your inputs with tab-separated factors. The first column must contain file names without file type extensions or paired-end suffixes. There are no constraints on column names or number of columns.

Example metadata

--reads reads.txt --metadata metadata.tsv

reads.txt

path/to/A_R1.fq
path/to/A_R2.fq
path/to/B.fq.gz
path/to/C.bam
path/to/D.fq

metadata.tsv

input    tissue     stage
A        liver      30 days
B        liver      30 days
C        liver      60 days
D        muscle     60 days

Merging inputs

When using --assemble-by and/or --quantify-by, your inputs are merged into experiment groups that share common factors. With --assemble-by, transcripts assembly is done individually for each assembly group, and consensus transcripts are kept to generate a novel annotation. With --quantify-by, quantification values are given individually for each quantification group.

Merging inputs by a single factor

--assemble-by tissue --quantify-by stage
Metadata Transcripts assembly
by tissue
Annotation Quantification
by stage
input tissue stage
A liver 30 days A, B, C

liver
liver, muscle

novel annotation
A, B

30 days
B liver 30 days
C liver 60 days C, D

60 days
D muscle 60 days D

muscle

Merging inputs by an intersection of factors

--assemble-by tissue,stage
Metadata Transcripts assembly
by tissue and stage
Annotation Quantification
by input
input tissue stage
A liver 30 days A, B

liver at 30 days
liver at 30 days,
liver at 60 days,
muscle at 60 days

novel annotation
A
B liver 30 days B
C liver 60 days C

liver at 60 days
C
D muscle 60 days D

muscle at 60 days
D

Workflow and results

The pipeline executes the following processes:

  1. FASTQC_control_reads
    Control reads quality with FastQC.

  2. TRIMGALORE_trim_adapters
    Trim adapters with Trim Galore.

  3. STAR_index_genome
    Index genome with STAR.
    The indexed genome is saved to output/index.

  4. STAR_align_reads
    Align reads to the indexed genome with STAR.
    Aligned reads are saved to output/alignment in .bam files.

  5. BEDTOOLS_compute_coverage
    Compute genome coverage with Bedtools.
    Coverage information is saved to output/coverage in .bed files.

  6. SAMTOOLS_merge_reads
    Merge aligned reads by factors with Samtools.
    See the merging inputs section for details.

  7. STRINGTIE_assemble_transcripts
    Assemble transcripts in each individual assembly group with StringTie.

  8. TAGADA_filter_transcripts
    Filter rare transcripts that appear in few assembly groups and poorly-expressed transcripts with low TPM values.

  9. STRINGTIE_coalesce_transcripts or TMERGE_coalesce_transcripts
    Create a novel annotation with StringTie or Tmerge.
    The novel annotation is saved to output/annotation in a .gtf file.

  10. FEELNC_classify_transcripts
    Detect long non-coding transcripts with FEELnc.
    The annotation saved to output/annotation is updated with the results.

  11. STRINGTIE_quantify_expression
    Quantify genes and transcripts with StringTie.
    Counts and TPM matrices are saved to output/quantification in .tsv files.

  12. MULTIQC_generate_report
    Aggregate quality controls into a report with MultiQC.
    The report is saved to output/control in a .html file.

Novel annotation

The novel annotation contains information from StringTie, Tmerge, and FEELnc. It is provided in gtf format with exon, transcript and gene rows. Row attributes vary depending on which tool was used to coalesce transcripts.


--coalesce-transcripts-with tmerge
  • gene_id
    All rows. The Tmerge gene_id starting with LOC.

  • ref_gene_id
    All rows. A comma-separated list of reference annotation gene_id when a Tmerge transcript is made of at least one reference transcript, otherwise a dot.

  • transcript_id
    Exon and transcript rows. The Tmerge transcript_id starting with TM, unless the transcript is exactly identical to a reference transcript, in which case the reference annotation transcript_id is provided.

  • tmerge_tr_id
    Exon and transcript rows. Optional. A comma-separated list of Tmerge transcript_id if the current transcript_id is from the reference annotation, to list which initial Tmerge transcripts it is made of.

  • transcript_biotype
    Exon and transcript rows. Optional. The reference annotation transcript_biotype of the transcript_id.

  • feelnc_biotype
    Exon and transcript rows. Optional. The transcript biotype determined by FEELnc (lncRNA, mRNA, noORF, or TUCp) if the transcript has been classified.

  • contains, contains_count, 3p_dists_to_3p, 5p_dists_to_5p, flrpm, longest, longest_FL_supporters, longest_FL_supporters_count, mature_RNA_length, meta_3p_dists_to_5p, meta_5p_dists_to_5p, rpm, spliced
    Transcript rows. Attributes provided by Tmerge.


--coalesce-transcripts-with stringtie
  • gene_id
    All rows. The StringTie gene_id starting with MSTRG.

  • ref_gene_id
    All rows. Optional. The reference annotation gene_id.

  • ref_gene_name
    All rows. Optional. The reference annotation gene_name.

  • transcript_id
    Exon and transcript rows. The StringTie transcript_id starting with MSTRG, unless the transcript is exactly identical to a reference transcript, in which case the reference annotation transcript_id is provided.

  • transcript_biotype
    Exon and transcript rows. Optional. The reference annotation transcript_biotype of the transcript_id.

  • feelnc_biotype
    Exon and transcript rows. Optional. The transcript biotype determined by FEELnc (lncRNA, mRNA, noORF, or TUCp) if the transcript has been classified.

  • exon_number
    Exon rows. The StringTie exon_number starting from 1 within a given transcript.

Funding

The GENE-SWitCH project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No 817998.

This repository reflects only the listed contributors views. Neither the European Commission nor its Agency REA are responsible for any use that may be made of the information it contains.

Citing

If you use TAGADA in a publication, please cite this:

Kurylo C, Guyomar C, Foissac S, Djebali S. TAGADA: a scalable pipeline to improve genome annotations with RNA-seq data. NAR Genomics and Bioinformatics. 2023 Dec 1;5(4):lqad089.