Merge pull request #47 from Arcadia-Science/ter/change-input-files

Change the input files to only take one transcriptome assembly file instead of two
Arcadia-Science · Jun 5, 2024 · 1d425ed · 1d425ed
2 parents 5d603d3 + ec0bf2e
commit 1d425ed
Show file tree

Hide file tree

Showing 7 changed files with 172 additions and 238 deletions.
diff --git a/README.md b/README.md
@@ -34,11 +34,11 @@ snakemake --software-deployment-method conda -j 1 --configfile demo/config.yml
 
 ## Input data
 
-The [peptigate pipeline](./Snakefile) requires two pairs of input files (for a total of four input files):
-* A transcriptome assembly, split into "long" and "short" contigs. 
-    * Long transcripts/contigs: A transcriptome assembly FASTA file in nucleotide format containing transcripts or contigs longer than X nucleotides (typically, 300-500nt).
-    * Short transcripts/contigs: contigs that are shorter than X nucleotides (typically, 300-500nt). Some transcriptome assemblers discard short contigs and do not include them in the final assembly. However, some provide them as an intermediate output file. These contigs may contain sORFs and so are included as an input to the peptigate pipeline. If you do not have a file that contains very short contigs, provide a path to an empty file. Short contigs can also be provided as part of the previous file. If that is the case, provide a path to an empty file for this input file.
-* Open reading frames predicted from the transcriptome in both amino acid and nucleotide format. The open reading frames in both files should have the same names. Tools like [Transdecoder](https://github.com/TransDecoder/TransDecoder) provide these files in the correct format.
+The [peptigate pipeline](./Snakefile) requires three input files:
+* A transcriptome assembly as a FASTA file in nucleotide format containing transcripts or contigs.
+* Open reading frames predicted from the transcriptome in both amino acid and nucleotide format.
+  The open reading frames in both files should have the same names before the first period in the FASTA header name.
+  Tools like [TransDecoder](https://github.com/TransDecoder/TransDecoder) provide these files in the correct format.
     * Open reading frames in amino acid format: A FASTA file of predicted open reading frames in amino acid format.
     * Open reading frames in nucleotide format: A FASTA file of predicted open reading frames in nucleotide format.
 

diff --git a/Snakefile b/Snakefile
@@ -21,8 +21,7 @@ OUTPUT_DIR = Path(config["output_dir"])
 
 ORFS_AMINO_ACIDS = Path(config["orfs_amino_acids"])
 ORFS_NUCLEOTIDES = Path(config["orfs_nucleotides"])
-CONTIGS_SHORTER = Path(config["contigs_shorter_than_r2t_minimum_length"])
-CONTIGS_LONGER = Path(config["contigs_longer_than_r2t_minimum_length"])
+CONTIGS = Path(config["contigs"])
 PLMUTILS_MODEL_DIR = Path(config["plmutils_model_dir"])
 
 ################################################################################
@@ -49,32 +48,6 @@ Note that we follow the conventions of the prokka tool for output file suffixes
 ################################################################################
 
 
-rule combine_contigs:
-    """
-    By default we assume that files provided to this pipeline are reads2transcriptome outputs.
-    Reads2transcriptome outputs two files that contain contigs.
-    The first we refer to as contigs_shorter_than_r2t_minimum (or CONTIGS_SHORTER),
-    which are transcripts output by assemblers that did not meet the r2t runs minimum contig length.
-    The second we refer to as contigs_longer_than_r2t_minimum (or CONTIGS_LONGER) and are assembled
-    transcripts that passed the isoform clustering and decontamination steps of r2t.
-    We no longer need these pools of transcripts differentiated, so we combine them in this rule. 
-    If your input transcriptome only has one file of assembled contigs, sequences only need to be
-    supplied in contigs_longer_than_r2t_minimum_length.
-    contigs_shorter_than_r2t_minimum_length can be an empty file.
-    """
-    input:
-        contigs_shorter=CONTIGS_SHORTER,
-        contigs_longer=CONTIGS_LONGER,
-    output:
-        all_contigs=OUTPUT_DIR / "sORF" / "contigs" / "all_input_contigs.fna",
-    conda:
-        "envs/seqkit.yml"
-    shell:
-        """
-        cat {input.contigs_shorter} {input.contigs_longer} > {output.all_contigs}
-        """
-
-
 rule get_coding_contig_names:
     """
     Extract amino acid contig names and remove everything after the first period, 
@@ -95,24 +68,21 @@ rule get_coding_contig_names:
 
 rule filter_contigs_to_no_predicted_ORF:
     """
-    The peptigate pipeline takes as input transcripts (provided in two files) and predicted or
-    annotated coding genes (provided in nucleotide and amino acid formats).
-    This rule removes contigs with coding genes from the full transcript file.
-    It assumes that the contig names are the same between to the two files (everything before the
-    first period) and uses an inverted grep on the sequence names in the coding file to
-    eliminate transcripts that contain protein-coding genes.
-    It keeps all other transcripts, regardless of length, to investigate the presence of an sORF
-    later in the pipeline.
-
-    We expect that read2transcriptome will often be used to used to create input files for
-    peptigate, or that transdecoder will be used to predict which transcripts contain protein-coding
-    genes (the r2t pipeline also uses transdecoder to predict open reading frames (ORFs) from
-    transcripts). By default, only ORFs that are longer than 100 amino acids are kept by
-    transdecoder. The peptigate pipeline predicts peptides that are 100 amino acids or shorter.
-    This rule eliminates transcripts that contained a transdecoder-predicted ORF.
+    The peptigate pipeline takes as input transcripts and predicted or annotated coding genes
+    (provided in nucleotide and amino acid formats). This rule removes contigs with coding genes
+    from the transcriptome assembly file (contigs). It assumes that the contig names are the same in
+    the transcriptome assembly file and the annotated coding genes file (everything before the first
+    period) and uses an inverted grep on the sequence names in the coding file to eliminate
+    transcripts that contain protein-coding genes. It keeps all other transcripts, regardless of
+    length, to investigate the presence of an sORF later in the pipeline.
+
+    We expect that transdecoder will often be used to predict which transcripts contain
+    protein-coding genes (ORF prediction). By default, only ORFs that are longer than 100 amino
+    acids are kept by transdecoder. The peptigate pipeline predicts peptides that are 100 amino
+    acids or shorter. This rule eliminates transcripts that contained a transdecoder-predicted ORF.
     """
     input:
-        fna=rules.combine_contigs.output.all_contigs,
+        fna=CONTIGS,
         names=rules.get_coding_contig_names.output.names,
     output:
         fna=OUTPUT_DIR / "sORF" / "contigs" / "contigs_with_no_annotated_orf.fna",

diff --git a/config.yml b/config.yml
@@ -14,25 +14,22 @@
 input_dir: "path/to/inputdir/"
 output_dir: "path/to/outputdir"
 
-# All input files are produced by reads2transcriptome.
-# While this is not a strict requirement, using these output files gives us the ability to look at very short contiguous sequences.
+# Input files and directories
+# - contigs: transcriptome assembly contigs.
 # - orfs_amino_acids: predicted ORFs translated into amino acids.
-#   Output by transdecoder.
-#   Used for cleavage peptide prediction and annotation of nonribosomal peptide synthetases.
-# - orfs_nucleotides: predicted ORFs as nucleotide sequences. Output by transdecoder.
-#   Used to compare peptide nucleotide sequences (clustering, dn/ds estimation, etc.).
-# - contigs_shorter_than_r2t_minimum_length: contigs that are shorter than X nucleotides (by default 75bp).
-#   The reads2transcriptome pipeline assembles RNA-seq reads into contigs (transcripts) using multiple assemblers and then merges those assemblies together.
-#   Before merging, very short contigs are removed (<75bp).
-#   However, reads2transcriptome outputs a FASTA file containing these transcripts, which is used as input to peptigate here.
-#   These contigs may contain sORFs and so are included as an input to the peptigate pipeline.
-# - contigs_longer_than_r2t_minimum_length: contigs longer than X nucleotides (default is 75bp).
-#   If the user did not use the reads2transcriptome file to generate their input files, this would be a transcriptome assembly FASTA file in nucleotide format containing transcripts.
-#   Note that the first step of sORF prediction combines the contigs_shorter_than_r2t_minimum_length and contigs_longer_than_r2t_minimum_length files, so there is no need to perform any pre-processing by length.
-# - plmutils_model_dir: path to the directory for the plmutils model that will predict whether sORFs are coding or non-coding.
+#   ORFs should be predicted from the same transcriptome assembly as the "contigs" input file.
+#   ORFs should have the same name (before the first period in the name) as the contigs in the
+#   "contigs" input file. TransDecoder provides files in the proper format.
+#   This file is used for cleavage peptide prediction and annotation of nonribosomal peptide synthetases, and to
+#   remove coding transcripts from the transcriptome assembly before sORF prediction.
+# - orfs_nucleotides: predicted ORFs as nucleotide sequences. Should contain the same ORFs as
+#   "orfs_amino_acids" but in nucleotide format. TransDecoder also provides this file in the proper
+#   format. If this file contains short ORFs (< 300 nucleotides), they will not be reported as sORFs
+#   as they are already annotated in the input.
+# - plmutils_model_dir: path to the directory for the plmutils model that will predict whether sORFs
+#   are coding or non-coding.
 
+contigs: "path/to/contigs.fa"
 orfs_amino_acids: "path/to/orfs.faa"
 orfs_nucleotides: "path/to/orfs.fa"
-contigs_shorter_than_r2t_minimum_length: "path/to/short_contigs.fa"
-contigs_longer_than_r2t_minimum_length: "path/to/long_contigs.fa"
 plmutils_model_dir: "inputs/models/plmutils/"
diff --git a/demo/README.md b/demo/README.md
@@ -4,9 +4,9 @@ curl -JLo Amblyomma_americanum_transcriptome_assembly_data.tar.gz https://zenodo
 tar xf Amblyomma_americanum_transcriptome_assembly_data.tar.gz
 mv transcriptome_data/* .
 head -n 200 orthofuser_final_clean.fa.transdecoder.pep > orfs_amino_acids.faa
-head -n 200 orthofuser_final_clean.fa.dammit.fasta > all_contigs.fa
+head -n 200 orthofuser_final_clean.fa.dammit.fasta > contigs.fa
 head -n 204 orthofuser_final_clean.fa.transdecoder.cds > orfs_nucleotides.fa
 ```
 
-We pulled the "short contigs" file from an internal S3 bucket.
-It contains contigs that were filtered from the Amblyomma transcriptome prior to txome merging.
+We also pulled short contigs (less than 75 bp) from an internal S3 bucket and added these contigs to the `contigs.fa` file (50 contigs).
+These are contigs that were filtered from the *Amblyomma* transcriptome prior to transcriptome merging during the [transcriptome assembly pipeline](https://github.com/Arcadia-Science/2023-amblyomma-americanum-txome-assembly/).
diff --git a/demo/config.yml b/demo/config.yml
@@ -1,39 +1,6 @@
-################################################################################
-## Input file descriptions
-################################################################################
-
-# Config parameters in this file are used as defaults by the pipeline (Snakefile).
-# To override the defaults, create a copy of this file and pass your new file to snakemake using the --configfile flag.
-# Any parameters in the new config file will overwrite the defaults listed here.
-
-###########
-# File IO #
-###########
-
-# Specify the input and output directories
 input_dir: "inputs/"
 output_dir: "outputs/demo"
-
-# All input files are produced by reads2transcriptome.
-# While this is not a strict requirement, using these output files gives us the ability to look at very short contiguous sequences.
-# TER TODO: update names of output files based on Nextflow of reads2transcriptome.
-# - orfs_amino_acids: predicted ORFs translated into amino acids.
-#   Output by transdecoder.
-#   Used for cleavage peptide prediction and annotation of nonribosomal peptide synthetases.
-# - orfs_nucleotides: predicted ORFs as nucleotide sequences. Output by transdecoder.
-#   Used to compare peptide nucleotide sequences (clustering, dn/ds estimation, etc.).
-# - contigs_shorter_than_r2t_minimum_length: contigs that are shorter than X nucleotides (by default 75bp).
-#   The reads2transcriptome pipeline assembles RNA-seq reads into contigs (transcripts) using multiple assemblers and then merges those assemblies together.
-#   Before merging, very short contigs are removed (<75bp).
-#   However, reads2transcriptome outputs a FASTA file containing these transcripts, which is used as input to peptigate here.
-#   These contigs may contain sORFs and so are included as an input to the peptigate pipeline.
-# - contigs_longer_than_r2t_minimum_length: contigs longer than X nucleotides (default is 75bp).
-#   If the user did not use the reads2transcriptome file to generate their input files, this would be a transcriptome assembly FASTA file in nucleotide format containing transcripts.
-#   Note that the first step of sORF prediction combines the contigs_shorter_than_r2t_minimum_length and contigs_longer_than_r2t_minimum_length files, so there is no need to perform any pre-processing by length.
-# - plmutils_model_dir: path to the directory for the plmutils model that will predict whether sORFs are coding or non-coding.
-
+contigs: "demo/contigs.fa"
 orfs_amino_acids: "demo/orfs_amino_acids.faa"
 orfs_nucleotides: "demo/orfs_nucleotides.fa"
-contigs_shorter_than_r2t_minimum_length: "demo/contigs_shorter_than_r2t_minimum_length.fa"
-contigs_longer_than_r2t_minimum_length: "demo/contigs_longer_than_r2t_minimum_length.fa"
 plmutils_model_dir: "inputs/models/plmutils/"