LncRAnalyzer

Pipeline for identification of lncRNAs and Novel Protein Coding Transcripts (NPCTs)

Introduction

LncRAnalyzer can be used to identify lncRNAs and Novel Protein Coding Transcripts (NPCT) with a large number of RNA-seq datasets, it contains genome-guided assembly, merge annotations, annotation compare, class code selection, and final retrieval of transcripts in fasta format. The putative lncRNAs and NPCTs will be further tested for their coding potentials with CPC2, CPAT, PLEK (Time-consuming), LGC, and RNAsamba. Based on coding potentials lncRNAs and NPCTs will be selected. Additionally, if someone has Lifover files for the organism and related species; conservation analysis will be also performed with Slncky. We integrated the FEELnc plugin to detect the mRNA spliced and intergenic lncRNAs in RNA-seq samples. For NPCTs, one can go for TransDecoder followed by Pfamscan to retrieve protein family annotations. The pipeline will be executed in a conda environment.

Implementation

To execute the steps in the pipeline, download the latest release of LncRAnalyzer to your local system with the following command

git clone https://github.com/nikhilshinde0909/LncRAnalyzer.git

Download and install the latest release of Mambaforge from github [https://github.com/conda-forge/miniforge] to install the required software and tools.
Once the Mambaforge is installed, Install the required software by updating the base environment from the LncRAnalyzer.yml file as follows

mamba env update --file LncRAnalyzer.yml

Create a conda environment for FEELnc with the following command

mamba create -n FEELnc -c bioconda feelnc

Create the environment for CPC2, CPAT, and Slncky from environment file

mamba env create -f cpc2-cpat-slncky.yml

Create the conda environment for RNAsamba

mamba env create -f rnasamba.yml

Run bash script named "add_paths_for_tools.sh" to add the path of conda environments and software in tools.groovy file

chmod +x add_paths_for_tools.sh && bash add_paths_for_tools.sh

Prepare your inputs and data.txt in the working directory

mkdir data
Working directory
├── data
│   ├── SRR975551_1.fastq.gz
│   ├── SRR975552_1.fastq.gz
│   └── (and other fastq.gz files)
│   ├── SRR975551_2.fastq.gz
│   ├── SRR975552_2.fastq.gz
│   └── (and other fastq.gz files)
│   └── hg38.rRNA.fasta
|   └── hg38.genome.fasta
|   └── hg38.annotation.gtf
|   └── (and other files)
└── data.txt

Copy your RNA-seq reads (*.fastq.gz), rRNA sequences (*.fa), reference genomes (*.fa), related sp. reference genome (*.fa), annotations (*.gtf) and liftover files in data directory; create file data.txt in the same by using data_template.txt and add paths for raw fastq.gz, rRNA sequences, reference genome, rel sp. reference genome, annotations and liftover files in the same
If you don't have a reference genome, annotations, and rRNA sequence information; you can download the same with the script provided with the pipeline as follows

python check_ensembl.py org_name
eg. python find_species_in_ensembl.py Sorghum
> sbicolor
python ensembl.py org_name_in_ensembl
eg. python download_datasets_ensembl.py sbicolor
> Ensembl version 56 <- download the datasets

Similarly, if you don't have liftover files for conservation analysis then you can generate it through genome alignments of reference and query species genomes as follows

python Liftover.py <threads> <genome> <org_name> <genome_related_species> <rel_sp_name> <params_distance>
eg.
python Liftover.py 16 Sorghum_bicolor.dna.toplevel.fa Sbicolor Zea_mays.dna.toplevel.fa Zmays near

We also provide an additional script which will take ensembl gtf and produce bed files to run Slncky as follows

python ensembl_gtf2bed.py <ensembl_gtf> <output_prefix>
eg.
python ensembl_gtf2bed.py Sorghum_bicolor.58.gtf Sorghum_bicolor

This will produce protein-coding, non-coding, mirRNA, and snoRNA bed files for Slncky. 9. Pipeline is ready for execution
Run the following command and execute the steps for lncRNAs and NPCTs analysis

bpipe run -n ${threads} ~/Path_to_LncRAnalyzer/Main.groovy data/data.txt

Note: If the pipeline reports a "core-dumped" error for PfamScan then replace your existing hmmer installation with hmmer=3.1b1 using the script in the utils directory as follows

bash install_hmmer3.1.sh

Thanks for using LncRAnalyzer !!

Peformace

The performance of coding potential prediction using CPAT, CPC2, LGC, RNAsamba, and FEELnc was estimated with 50 RNA-Seq accessions of sorghum cultivar PR22 from past studies [https://doi.org/10.1186/s12864-019-5734-x]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LncRAnalyzer

Introduction

Implementation

Thanks for using LncRAnalyzer !!

Peformace

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 283 Commits
scripts		scripts
slncky		slncky
stages		stages
utils		utils
FEELnc.yml		FEELnc.yml
LncRAnalyzer.yml		LncRAnalyzer.yml
Main.groovy		Main.groovy
README.md		README.md
add_paths_for_tools.sh		add_paths_for_tools.sh
cpc2-cpat-slncky.yml		cpc2-cpat-slncky.yml
data.tamplate.txt		data.tamplate.txt
rnasamba.yml		rnasamba.yml
tools.groovy		tools.groovy

nikhilshinde0909/LncRAnalyzer

Folders and files

Latest commit

History

Repository files navigation

LncRAnalyzer

Introduction

Implementation

Thanks for using LncRAnalyzer !!

Peformace

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages