tRFtarget-pipeline is designed to find RNA-RNA interaction sites between transfer RNA-derived fragments (tRFs) and target RNAs. It utilizes RNAhybrid and IntaRNA to provide binding sites predicted via two different mechanisms, evaluate the consensus of these two predictions, and output fully-featured binding sites in a unified structured format together with enhanced ASCII RNA-RNA interaction illustrations (shown in the example below). It can also be used to find target genes for other small RNAs such as miRNAs.
All binding sites in tRFtarget database (http://trftarget.net/) are predicted using tRFtarget-pipeline v0.3.0.
-
2.1 Command
2.2 Input
2.3 Output
2.4 Binding sites in CSV files
2.5 All Options
-
3.1 Enclosed Package version (after version 0.3.0)
3.2 Calling RNAhybrid & post-processing
3.3 Calling IntaRNA & post-processing
We provide online service of tRFtarget-pipeline in http://trftarget.net/online_targets.
For local usage, we provide Docker or Singularity images for immediate use. NOTE The tRFtarget-pipeline image currently supports Linux operating systems such as Ubuntu and macOS with Intel CPUs (not Apple Silicon series such as M1, M2, M3, etc.).
# For Docker
docker pull az7jh2/trftarget:0.3.2
# For Singularity
singularity build trftarget-0.3.2.sif docker://az7jh2/trftarget:0.3.2
To test the installation (should print the version of tRFtarget-pipeline)
# For Docker
docker run -it --rm az7jh2/trftarget:0.3.2 tRFtarget -v
# For Singularity
singularity exec trftarget-0.3.2.sif tRFtarget -v
The command to run tRFtarget-pipeline with default setting is:
# For Docker
docker run -it --rm -v <path>:/data az7jh2/trftarget:0.3.2 tRFtarget -q <query_fasta_file_name> -t <target_fasta_file_name> -n 1 --e_rnahybrid -15 --e_intarna 0 -b 1 -s 6
# For Singularity
singularity exec -B <path>:/data trftarget-0.3.2.sif tRFtarget -q <query_fasta_file_name> -t <target_fasta_file_name> -n 1 --e_rnahybrid -15 --e_intarna 0 -b 1 -s 6
<path>
is the valid and absolute path of the folder in the host machine to be mounted in the Docker/Singularity image for data exchanging (readlink -f
can be used to get the absolute path of folder).
<query_fasta_file_name>
and <target_fasta_file_name>
are the file names (without path) of FASTA files of query small RNAs and target RNAs respectively. Both of them are required to be located in the <path>
folder.
- A FASTA file of query small RNAs. Compressed file (such as
.gz
) currently not supported. - A FASTA file of target RNAs (Optional). Compressed file (such as
.gz
) currently not supported. If not provided, use 100,218 Protein-coding transcript sequences (GRCh38.p13) as target RNAs instead.
The output of tRFtarget-pipeline are 6 CSV files located in the <path>
folder:
trfs_info.csv
: show tRF ID, sequence and sequence length.transcripts_info.csv
: show transcript ID, sequence and length.rnahybrid_results.csv
: processed tRF-RNA interactions predicted by RNAHybrid.intarna_results.csv
: processed tRF-RNA interactions predicted by IntaRNA.consensus_results.csv
: consensus binding sites between RNAHybrid and IntaRNA predictions. For definition of consensus please refer 3.4 Consensus evaluation.tRF_level_consensus_stats.csv
: a summary of numbers of binding sites predicted by RNAHybrid and IntaRNA, as well as the number of consensus binding sites. It also includes the percentage of consensus binding sites in RNAHybrid and IntaRNA predictions, respectively.
The CSV files containing predicted binding sites (rnahybrid_results.csv
, intarna_results.csv
and consensus_results.csv
) have the unified format. The total 14 columns are shown as below:
Column | Description |
---|---|
tRF_ID |
ID of query sequence, corresponding to the sequence ID in FASTA file of query small RNAs. |
Transcript_ID |
ID of target sequence, corresponding to the sequence ID in FASTA file of target RNAs. |
Demo |
An ASCII RNA-RNA interaction illustration. |
Start_tRF |
Start index of RNA hybrid in query sequence. Index starts at 5', and index number starts from 1. |
End_tRF |
End index of RNA hybrid in query sequence. Index starts at 5', and index number starts from 1. |
Start_Target |
Start index of RNA hybrid in target sequence. Index starts at 5', and index number starts from 1. |
End_Target |
End index of RNA hybrid in target sequence. Index starts at 5', and index number starts from 1. |
MFE |
Calculated Free Energy of that binding site. |
HybridDP |
VRNA dot-bracket notation for RNA hybrid (interaction sites only). |
SubseqDP |
Hybrid subsequences compatible with HybridDP . |
Max_Hit_DP |
Hybrid subsequences in maximum complementary region. Please refer 3.5 Definition of Maximum Complementary Length (MCL) for detailed description of maximum complementary region and maximum complementary length. |
Max_Hit_Len |
Sequence length of maximum complementary region. Please refer the 3.5 Definition of Maximum Complementary Length (MCL) for detailed description of maximum complementary region and maximum complementary length. |
Tool |
Tool used to predict that binding site (RNAhybrid or IntaRNA). |
Consensus |
Indicate the consensus entries (=1 ). Non-consensus entries are labelled as 0 . |
Option | Description |
---|---|
-q or--query |
FASTA file of query small RNAs. Required. |
-t or --target |
FASTA file of target RNAs. If not provided, use 100,218 Human Protein-coding transcript sequences (GRCh38.p13) as target RNAs. |
-n or --n_cores |
Number of CPU cores used for parallel computing. Default value is 1 (no parallel computing). |
--e_rnahybrid |
Free energy threshold for RNAhybrid, used for RNAhybrid -e option. Default value is -15. |
--e_intarna |
Free energy threshold for IntaRNA, used for IntaRNA --outMaxE option. Default value is 0. |
-b or --suboptimal |
Reported number of interaction sites on each target RNA, used for RNAhybrid -b option and IntaRNA -n option. Default value is 1. |
-s or --seed_len |
For RNAhybrid, threshold of maximum complementary length interactions with maximum complementary length less than it are filtered out. For IntaRNA, threshold of the number of base pairs within the seed sequences, used for IntaRNA -seedBP option.Default value is 6 |
Take 1 tRF (tRF-1001) and the default target RNAs (100,218 Protein-coding transcript sequences) for example. All options are leaving as default (No parallel computing)
Elapsed time for whole pipeline: 48.26 hours
- running RNAhybrid: 0.72 hours
- running IntaRNA: 47.50 hours
- consensus target predictions: 0.02 hours
File size of output CSV files:
rnahybrid_results.csv
: 60 MB; including 90,398 target site entriesintarna_results.csv
: 58 MB; including 100,141 target site entriesconsensus_results.csv
: 28 MB; including 22,492 RNAhybrid entries and 22,492 IntaRNA entries
It's recommended to turn on the parallel computing by specifying -n
or --n_cores
option, which will significantly reduce the running time of IntaRNA
- RNAhybrid: 2.1.2
- IntaRNA: 3.3.1 with Vienna RNA 2.5.1 and boost 1.74.0
Generally speaking, both prediction tools are tuned to provide binding sites with different prediction mechanisms:
- RNAhybrid: minimum free energy
- IntaRNA: minimum free energy + seed match + accessibility feature
The command options for calling RNAhybrid
-b <suboptimal> -e <e_rnahybrid> -m 150000 -n 70 -s 3utr_human
<suboptimal>
is the value of option -b
or --suboptimal
in tRFtarget-pipeline, while <e_rnahybrid>
is from --e_rnahybrid
option.
Post-processing procedures of RNAhybrid outputs include:
- filter out binding sites with Maximum Complementary Length (MCL) <
<seed_len>
, where<seed_len>
is the value of option-s
or--seed_len
. - parse unstructured RNAhybrid output into the unified output format of tRFtarget-pipeline.
The command options for calling IntaRNA
--mode=H -n <suboptimal> --seedBP=<seed_len> --outMaxE=<e_intarna> --outOverlap=Q --outMode=C
<suboptimal>
is the value of option -b
or --suboptimal
in tRFtarget-pipeline, <seed_len>
is the value of option -s
or --seed_len
, and <e_intarna>
is from --e_intarna
option.
Post-processing procedures of IntaRNA outputs include:
- filter out potential duplicated entries (see 3.6 Definition of duplicated entries in IntaRNA output)
- parse IntaRNA output CSV into the unified output format of tRFtarget-pipeline.
1 RNAhybrid and 1 IntaRNA predicted binding site will be defined as consensus binding sites if they have a similar interaction structure. The interaction need to be on the identical location of the target RNA, allowing an offset of 2 bases on the beginning and/or ending of the interaction area.
For example, one RNAhybrid and one IntaRNA predicted binding site between tRF-1001 and transcript ARF1-213 are defined as consensus (see figure below). The beginning site of both predictions are identical, while the ending site has an offset of 1 base (RNAhybrid prediction ends on the 1683rd base while IntaRNA prediction ends on the 1682nd base).
tRFtarget-pipeline proposes Maximum Complementary Length (MCL) as a metric of binding site stability in addition to free energy. It's close to the definition of seed length in seed matching rule of miRNA targets, but does not require the interaction occurs on the so-called "seed region" of tRF, since whether tRF has seed region is still under controversy.
MCL is defined as the length of the longest successively complementary sequences. For example, the figure below shows a binding site between tRF-5013b and transcript ARF-201 predicted by RNAhybrid. The interaction area has 3 distinct parts of complementary sequences as highlighted by green lines. The lengths of complementary sequences of the 3 parts are 6, 5, and 7, respectively. Therefore, the MCL of this interaction is 7 bases.
IntaRNA may return similar binding sites between the same tRF and target RNA, and among these similar entries only the one with the lowest free energy will be kept.
For example, all the 5 binding sites between tRF-3001a and transcript DCTPP1-201 predicted by IntaRNA are shown in the figure below. These 5 binding sites have similar interaction structure, and are located at the nearly identical location on the transcript. So only the uppermost entry with the lowest free energy (-20.36 kcal/mol) is kept, while the other 4 entries will be defined as duplicated entries and discarded.
If you use tRFtarget-pipeline, please cite:
Ningshan Li, Nayang Shan, Lingeng Lu, Zuoheng Wang. tRFtarget: a database for transfer RNA-derived fragment targets. Nucleic Acids Research, 2020, gkaa831. https://doi.org/10.1093/nar/gkaa831
Users are welcome to send feedbacks, suggestions or comments related to the tRFtarget database through our GitHub repository tRFtarget.
For issues in using tRFtarget-pipeline, please report to this GitHub repository tRFtarget-pipeline.