Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the input files to only take one transcriptome assembly file instead of two #47

Merged
merged 5 commits into from
Jun 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ snakemake --software-deployment-method conda -j 1 --configfile demo/config.yml

## Input data

The [peptigate pipeline](./Snakefile) requires two pairs of input files (for a total of four input files):
* A transcriptome assembly, split into "long" and "short" contigs.
* Long transcripts/contigs: A transcriptome assembly FASTA file in nucleotide format containing transcripts or contigs longer than X nucleotides (typically, 300-500nt).
* Short transcripts/contigs: contigs that are shorter than X nucleotides (typically, 300-500nt). Some transcriptome assemblers discard short contigs and do not include them in the final assembly. However, some provide them as an intermediate output file. These contigs may contain sORFs and so are included as an input to the peptigate pipeline. If you do not have a file that contains very short contigs, provide a path to an empty file. Short contigs can also be provided as part of the previous file. If that is the case, provide a path to an empty file for this input file.
* Open reading frames predicted from the transcriptome in both amino acid and nucleotide format. The open reading frames in both files should have the same names. Tools like [Transdecoder](https://github.com/TransDecoder/TransDecoder) provide these files in the correct format.
The [peptigate pipeline](./Snakefile) requires three input files:
* A transcriptome assembly as a FASTA file in nucleotide format containing transcripts or contigs.
* Open reading frames predicted from the transcriptome in both amino acid and nucleotide format.
The open reading frames in both files should have the same names before the first period in the FASTA header name.
Tools like [TransDecoder](https://github.com/TransDecoder/TransDecoder) provide these files in the correct format.
* Open reading frames in amino acid format: A FASTA file of predicted open reading frames in amino acid format.
* Open reading frames in nucleotide format: A FASTA file of predicted open reading frames in nucleotide format.

Expand Down
58 changes: 14 additions & 44 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ OUTPUT_DIR = Path(config["output_dir"])

ORFS_AMINO_ACIDS = Path(config["orfs_amino_acids"])
ORFS_NUCLEOTIDES = Path(config["orfs_nucleotides"])
CONTIGS_SHORTER = Path(config["contigs_shorter_than_r2t_minimum_length"])
CONTIGS_LONGER = Path(config["contigs_longer_than_r2t_minimum_length"])
CONTIGS = Path(config["contigs"])
PLMUTILS_MODEL_DIR = Path(config["plmutils_model_dir"])

################################################################################
Expand All @@ -49,32 +48,6 @@ Note that we follow the conventions of the prokka tool for output file suffixes
################################################################################


rule combine_contigs:
"""
By default we assume that files provided to this pipeline are reads2transcriptome outputs.
Reads2transcriptome outputs two files that contain contigs.
The first we refer to as contigs_shorter_than_r2t_minimum (or CONTIGS_SHORTER),
which are transcripts output by assemblers that did not meet the r2t runs minimum contig length.
The second we refer to as contigs_longer_than_r2t_minimum (or CONTIGS_LONGER) and are assembled
transcripts that passed the isoform clustering and decontamination steps of r2t.
We no longer need these pools of transcripts differentiated, so we combine them in this rule.
If your input transcriptome only has one file of assembled contigs, sequences only need to be
supplied in contigs_longer_than_r2t_minimum_length.
contigs_shorter_than_r2t_minimum_length can be an empty file.
"""
input:
contigs_shorter=CONTIGS_SHORTER,
contigs_longer=CONTIGS_LONGER,
output:
all_contigs=OUTPUT_DIR / "sORF" / "contigs" / "all_input_contigs.fna",
conda:
"envs/seqkit.yml"
shell:
"""
cat {input.contigs_shorter} {input.contigs_longer} > {output.all_contigs}
"""


rule get_coding_contig_names:
"""
Extract amino acid contig names and remove everything after the first period,
Expand All @@ -95,24 +68,21 @@ rule get_coding_contig_names:

rule filter_contigs_to_no_predicted_ORF:
"""
The peptigate pipeline takes as input transcripts (provided in two files) and predicted or
annotated coding genes (provided in nucleotide and amino acid formats).
This rule removes contigs with coding genes from the full transcript file.
It assumes that the contig names are the same between to the two files (everything before the
first period) and uses an inverted grep on the sequence names in the coding file to
eliminate transcripts that contain protein-coding genes.
It keeps all other transcripts, regardless of length, to investigate the presence of an sORF
later in the pipeline.

We expect that read2transcriptome will often be used to used to create input files for
peptigate, or that transdecoder will be used to predict which transcripts contain protein-coding
genes (the r2t pipeline also uses transdecoder to predict open reading frames (ORFs) from
transcripts). By default, only ORFs that are longer than 100 amino acids are kept by
transdecoder. The peptigate pipeline predicts peptides that are 100 amino acids or shorter.
This rule eliminates transcripts that contained a transdecoder-predicted ORF.
The peptigate pipeline takes as input transcripts and predicted or annotated coding genes
(provided in nucleotide and amino acid formats). This rule removes contigs with coding genes
from the transcriptome assembly file (contigs). It assumes that the contig names are the same in
the transcriptome assembly file and the annotated coding genes file (everything before the first
period) and uses an inverted grep on the sequence names in the coding file to eliminate
transcripts that contain protein-coding genes. It keeps all other transcripts, regardless of
length, to investigate the presence of an sORF later in the pipeline.

We expect that transdecoder will often be used to predict which transcripts contain
protein-coding genes (ORF prediction). By default, only ORFs that are longer than 100 amino
acids are kept by transdecoder. The peptigate pipeline predicts peptides that are 100 amino
acids or shorter. This rule eliminates transcripts that contained a transdecoder-predicted ORF.
"""
input:
fna=rules.combine_contigs.output.all_contigs,
fna=CONTIGS,
names=rules.get_coding_contig_names.output.names,
output:
fna=OUTPUT_DIR / "sORF" / "contigs" / "contigs_with_no_annotated_orf.fna",
Expand Down
31 changes: 14 additions & 17 deletions config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,25 +14,22 @@
input_dir: "path/to/inputdir/"
output_dir: "path/to/outputdir"

# All input files are produced by reads2transcriptome.
# While this is not a strict requirement, using these output files gives us the ability to look at very short contiguous sequences.
# Input files and directories
# - contigs: transcriptome assembly contigs.
# - orfs_amino_acids: predicted ORFs translated into amino acids.
# Output by transdecoder.
# Used for cleavage peptide prediction and annotation of nonribosomal peptide synthetases.
# - orfs_nucleotides: predicted ORFs as nucleotide sequences. Output by transdecoder.
# Used to compare peptide nucleotide sequences (clustering, dn/ds estimation, etc.).
# - contigs_shorter_than_r2t_minimum_length: contigs that are shorter than X nucleotides (by default 75bp).
# The reads2transcriptome pipeline assembles RNA-seq reads into contigs (transcripts) using multiple assemblers and then merges those assemblies together.
# Before merging, very short contigs are removed (<75bp).
# However, reads2transcriptome outputs a FASTA file containing these transcripts, which is used as input to peptigate here.
# These contigs may contain sORFs and so are included as an input to the peptigate pipeline.
# - contigs_longer_than_r2t_minimum_length: contigs longer than X nucleotides (default is 75bp).
# If the user did not use the reads2transcriptome file to generate their input files, this would be a transcriptome assembly FASTA file in nucleotide format containing transcripts.
# Note that the first step of sORF prediction combines the contigs_shorter_than_r2t_minimum_length and contigs_longer_than_r2t_minimum_length files, so there is no need to perform any pre-processing by length.
# - plmutils_model_dir: path to the directory for the plmutils model that will predict whether sORFs are coding or non-coding.
# ORFs should be predicted from the same transcriptome assembly as the "contigs" input file.
# ORFs should have the same name (before the first period in the name) as the contigs in the
# "contigs" input file. TransDecoder provides files in the proper format.
# This file is used for cleavage peptide prediction and annotation of nonribosomal peptide synthetases, and to
# remove coding transcripts from the transcriptome assembly before sORF prediction.
# - orfs_nucleotides: predicted ORFs as nucleotide sequences. Should contain the same ORFs as
# "orfs_amino_acids" but in nucleotide format. TransDecoder also provides this file in the proper
# format. If this file contains short ORFs (< 300 nucleotides), they will not be reported as sORFs
# as they are already annotated in the input.
# - plmutils_model_dir: path to the directory for the plmutils model that will predict whether sORFs
# are coding or non-coding.

contigs: "path/to/contigs.fa"
orfs_amino_acids: "path/to/orfs.faa"
orfs_nucleotides: "path/to/orfs.fa"
contigs_shorter_than_r2t_minimum_length: "path/to/short_contigs.fa"
contigs_longer_than_r2t_minimum_length: "path/to/long_contigs.fa"
plmutils_model_dir: "inputs/models/plmutils/"
6 changes: 3 additions & 3 deletions demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ curl -JLo Amblyomma_americanum_transcriptome_assembly_data.tar.gz https://zenodo
tar xf Amblyomma_americanum_transcriptome_assembly_data.tar.gz
mv transcriptome_data/* .
head -n 200 orthofuser_final_clean.fa.transdecoder.pep > orfs_amino_acids.faa
head -n 200 orthofuser_final_clean.fa.dammit.fasta > all_contigs.fa
head -n 200 orthofuser_final_clean.fa.dammit.fasta > contigs.fa
head -n 204 orthofuser_final_clean.fa.transdecoder.cds > orfs_nucleotides.fa
```

We pulled the "short contigs" file from an internal S3 bucket.
It contains contigs that were filtered from the Amblyomma transcriptome prior to txome merging.
We also pulled short contigs (less than 75 bp) from an internal S3 bucket and added these contigs to the `contigs.fa` file (50 contigs).
These are contigs that were filtered from the *Amblyomma* transcriptome prior to transcriptome merging during the [transcriptome assembly pipeline](https://github.com/Arcadia-Science/2023-amblyomma-americanum-txome-assembly/).
35 changes: 1 addition & 34 deletions demo/config.yml
Original file line number Diff line number Diff line change
@@ -1,39 +1,6 @@
################################################################################
## Input file descriptions
################################################################################

# Config parameters in this file are used as defaults by the pipeline (Snakefile).
# To override the defaults, create a copy of this file and pass your new file to snakemake using the --configfile flag.
# Any parameters in the new config file will overwrite the defaults listed here.

###########
# File IO #
###########

# Specify the input and output directories
input_dir: "inputs/"
output_dir: "outputs/demo"

# All input files are produced by reads2transcriptome.
# While this is not a strict requirement, using these output files gives us the ability to look at very short contiguous sequences.
# TER TODO: update names of output files based on Nextflow of reads2transcriptome.
# - orfs_amino_acids: predicted ORFs translated into amino acids.
# Output by transdecoder.
# Used for cleavage peptide prediction and annotation of nonribosomal peptide synthetases.
# - orfs_nucleotides: predicted ORFs as nucleotide sequences. Output by transdecoder.
# Used to compare peptide nucleotide sequences (clustering, dn/ds estimation, etc.).
# - contigs_shorter_than_r2t_minimum_length: contigs that are shorter than X nucleotides (by default 75bp).
# The reads2transcriptome pipeline assembles RNA-seq reads into contigs (transcripts) using multiple assemblers and then merges those assemblies together.
# Before merging, very short contigs are removed (<75bp).
# However, reads2transcriptome outputs a FASTA file containing these transcripts, which is used as input to peptigate here.
# These contigs may contain sORFs and so are included as an input to the peptigate pipeline.
# - contigs_longer_than_r2t_minimum_length: contigs longer than X nucleotides (default is 75bp).
# If the user did not use the reads2transcriptome file to generate their input files, this would be a transcriptome assembly FASTA file in nucleotide format containing transcripts.
# Note that the first step of sORF prediction combines the contigs_shorter_than_r2t_minimum_length and contigs_longer_than_r2t_minimum_length files, so there is no need to perform any pre-processing by length.
# - plmutils_model_dir: path to the directory for the plmutils model that will predict whether sORFs are coding or non-coding.

contigs: "demo/contigs.fa"
orfs_amino_acids: "demo/orfs_amino_acids.faa"
orfs_nucleotides: "demo/orfs_nucleotides.fa"
contigs_shorter_than_r2t_minimum_length: "demo/contigs_shorter_than_r2t_minimum_length.fa"
contigs_longer_than_r2t_minimum_length: "demo/contigs_longer_than_r2t_minimum_length.fa"
plmutils_model_dir: "inputs/models/plmutils/"
Loading
Loading