The APAtrap R package does the following three steps:
- Refines annotated 3'UTR and identifies novel 3' UTRs and 3' UTR extensions
- Aims to identify all potential APA (alternative polyadenylation) sites
- Detects genes with differential APA site usage between conditions by leveraging the resolution of RNA-seq data
The paper is titled APAtrap: identification and quantification of
alternative polyadenylation sites from RNA-seq data
The application download is free to download
and user manual was used as a referrence
to create the nextflow pipeline flow of this module
Required files are to be specified in the input samplesheet_example_files.csv
. Each row in the sample sheet has four
columns:
- condition: name of the condition (e.g Control)
- sample: unique name of the sample for logs (e.g control_replicate1)
- bam: BAM input file for the sample
- bai: BAI index file for sample's bam input
It is important to name samples of the same condition with the exact condition name under the condition column in the sample sheet since samples are grouped per condition to be processed in the differential step. In addition, sample names should be unique.
To run APAtrap with test data provided for APAeval, check the path to APAtrap with pwd
and replace
the path_to
in samplesheet_example_files.csv with the path from the pwd
command.
When using your own data and input file instead of the provided test data and samplesheet_example_files.csv
, make sure to include in the
input file you are using the absolute path to the file, with the four column names following the column
names in samplesheet_example_files.csv
.
This workflow uses docker containers. To run with docker, make sure that docker is installed and running
(e.g. to ensure docker is running, run the command docker --help
and a help message should be printed).
Additionally, make sure that line 49 in Apatrap/nextflow.config file docker.enabled=true
is uncommented while line
51 singularity.enabled=true
is commented out.
To run with singularity, comment out line 49 in Apatrap/nextflow.config file docker.enabled=true
and make sure that line
51 singularity.enabled=true
is uncommented.
Parameters used to run APAtrap are specified in conf/modules.config file. Parameters relevant to the workflow itself are:
run_identification
- set to true to obtain identification challenge output. Specifying any other value will throw an error.run_quantification
- set to true to obtain quantification challenge output. Specifying any other value will throw an error.run_relative_usage_quantification
- set to true to obtain relative usage quantification challenge output. Specifying any other value will throw an error.run_differential
- set to true to obtain differential challenge output. Specifying any other value will throw an error.output_dir
- name of the folder that the final output files are going to be in, located under Apatrap/results/apatrap/identification_out_suffix
- suffix of the output file(s) for the current run ending with .bed when running identification, the prefix will be the different sample names obtained from the sample column in the sample sheetquantification_out_suffix
- suffix of the output file(s) for the current run ending with .bed when running quantification, the prefix will be the different sample names obtained from the sample column in the sample sheetrelative_usage_quantification_out_suffix
- suffix of the output file(s) for the current run ending with .bed when running relative usage quantification, the prefix will be the different sample names obtained from the sample column in the sample sheetdifferential_out
- name of the output file for the current run ending with .tsv when running differentialgenome_file
- absolute path from the APAtrap folder to the input GTF annotation file can be obtained by replacingpath_to
with the path to APAtrap, and if using your own genome file, make sure to use the absolute path to your genome file
- Set the 'run_differential' parameter in conf/modules.config to true
- Change 'differential_out' parameter in conf/modules.config to the desired file name that ends with '.tsv'
- Ensure the sample sheet contains exactly two distinct conditions in the condition column
- Set the 'run_identification' and/or 'run_quantification' and/or 'run_relative_usage_quantification' parameters in conf/modules.config to true to run identification and/or quantification and/or relative usage quantificationworkflows
- Change 'identification_out_suffix' and/or 'quantification_out_suffix' and/or 'relative_usage_quantification_out_suffix' parameters in conf/modules.config to the desired file suffix that ends with '.bed'
- In the sample sheet, at least one row is required to be provided
We have seen that APAtrap uses a lot of memory when running big datasets. As such, the processes use process_big_mem
to allocate a generous amount of memory and run time. Depending on your resources availability, this can be tuned in conf/base.config.
Because of the amount of memory and time needed for APAtrap to run, we have to split samples by chromosome so that the workflow could complete in a reasonable amount of time. Even then, each process required around 20G. When tuning workflow resources, this should be kept in mind in case the workflow stops from memory or run time related errors.
Once parameters have been set in conf/modules.config file, run the nextflow pipeline with
nextflow main.nf --input samplesheet_example_files.csv
.
Note that it's recommended to delete the existing results
folder to remove results from the previous run, if any.
This is to avoid polluting the next run with the results from the previous run when we are using the same sample name(s),
which would likely result in an error.
When using the default output_dir
parameter value in conf/modules.config, APAtrap run results in files of the outputs
of the challenges located under APAtrap/results/apatrap/challenges_outputs folder.
For identification, quantification, and relative usage quantification outputs, the files have sample names as prefixes to differentiate the different runs.
The differential output file will stay as the name specified in modules.config file.
- When running differential, as clarified with APAtrap author, Dr. Congting Ye, predictAPA step requires that the input files are placed
by group and in the same order of values specified by parameter -n. So for example, if we have two replicates for each condition, we'll then do
predictAPA -i condition1_replicate1.bedgraph condition1_replicate2.bedgraph condition2_replicate1.bedgraph condition2_replicate2.bedgraph -g 2 -n 2 2 -u hg19.utr.bed -o output.txt
. This means that we have to loop through the sample files by condition, where each condition has a folder in the results folder. As such, the results folder needs to be clean of files and folders from the previous run as to not pollute the predictAPA step of the next run. This is the reason we require the results folder to be deleted before running the workflow.
If you have any question or comment about APAtrap, please contact with Dr. Congting Ye ([email protected]).