Skip to content

Strand and WGD aware syntenic gene identification

License

Notifications You must be signed in to change notification settings

xiaodli/quota_Anchor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

quota_Anchor( Proustian monstrosity)

Here are the scripts and documents to conduct strand and WGD aware syntenic gene identification for a pair of genomes using the longest path algorithm implemented in AnchorWave. We currently provide three visualization methods for syntenic results.

Installation

You can simple by the following command get this software in a independent conda envirment. This is a beta version, so we haven't uploaded it to bioconda yet.

conda install xiaodli::quota_anchor

Usage

Help info

quota_Anchor -h
usage: quota_Anchor [-h] [-v] {pre_col,col,get_chr_length,dotplot,circle,line_2,line_proali} ...

Conduct strand and WGD aware syntenic gene identification for a pair of genomes using the longest path algorithm implemented in AnchorWave:
options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

gene collinearity analysis:
  {pre_col,col,get_chr_length,dotplot,circle,line_2,line_proali}
    pre_col             Get longest protein file from gffread result and input file for gene collinearity analysis
    col                 Get gene collinearity result file
    get_chr_length      Get chromosome length and name info from fai file
    dotplot             Collinearity result visualization
    circle              Collinearity result visualization
    line_2              Collinearity result visualization
    line_proali         Anchors file from AnchorWave proali visualization

Example1

Here is an example to identify syntenic genes between maize and sorghum. The maize lineage has undergone a whole genome duplication (WGD) since its divergence with sorghum, but subsequent chromosomal fusions resulted in these species having the same chromosome number (n = 10). AnchorWave can allow up to two collinear paths for each sorghum anchor while one collinear path for each maize anchor.

Make some folders rather than a folder may be more clearer

Working directory structure are as follows. We may give you an option to create this directory later, but you will need to do it yourself now.

├── length_file
│   └── get_length.conf
├── raw_data
│   ├── Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3
│   ├── Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa
│   ├── Zm-B73-REFERENCE-NAM-5.0.fa
│   └── Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3
└── sb_zm
    └── config_file
        ├── circle.conf
        ├── collinearity.conf
        ├── dotplot.conf
        ├── line.conf
        └── pre_collinearity.conf

Genome and annotation data preparation

Put the genome and gff file into the sb_zm/raw_data directory.

wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0.fa.gz
wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3.gz
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/fasta/sorghum_bicolor/dna/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.gz
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/gff3/sorghum_bicolor/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3.gz
gunzip *gz

Modify config file and running pre_collinearity(maize vs sorghum)

This includes four steps(implemented in "quota_Anchor pre_col")

  1. Extract and translate protein sequences from genome sequences and annotations
  2. Identify and extract the longest protein sequence encoded by each gene
  3. Protein sequence alignment using DIAMOND or conduct protein sequence alignment using BLASTp
  4. Put the gene strand information and the blast result into a single file
Header Parameter Description
[gffread] ref_genome_seq Please provide e path to your reference FASTA file
ref_gff_file Please provide a path to your reference GFF file
output_ref_pep_seq Please type a filename for the reference protein sequence file that the software will generate
query_genome_seq Please provide a path to your Query FASTA file
query_gff_file Please provide a path to your Query GFF file
output_query_pep_seq Please type a filename for a query protein sequence file that the software will generate
use_S_parameter When using gffread -y option, whether the -S parameter is used, which means using '*' instead of '.' as the stop codon for translation(Not recommended to modify, default:True)
[longest_pep] out_ref_longest_pep_name Please type a filename for longest protein sequence file that software will generate
out_query_longest_pep_name Please type a filename for longest protein sequence file that software will generate
[align] align Protein sequence alignment using DIAMOND or BLASTp(please type: diamond or blastp)
[diamond] database_name Please type a diamond-blastp database Path/Name that software will generate
output_blast_result Please type a diamond-blastp result Path that software will generate
max_target_seqs Maximum number of target sequences to report alignments for diamond-blastp
evalue Maximum e-value to report alignments
[blastp] database_name Please type a blastp database Path/Name that software will generate
dtype Database date type(please type:prot)
output_blast_result Please type a blastp result Path that software will generate
evalue Maximum e-value to report alignments
max_target_seqs Maximum number of target sequences to report alignments for diamond-blastp
thread Number of CPU threads
outfmt Please type 6(BLAST tabular)
[combineBlastAndStrand] out_file Please type a filename for a longest protein sequence file that software will generate
bitscore Filter by blast minimum bitscore(default:100)
align_length Filter by blast minimum alignment length(default:0)

Put the following information into the sb_zm/config_file/pre_collinearity.conf file.

[gffread]
ref_genome_seq = ../raw_data/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa
ref_gff_file = ../raw_data/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3
output_ref_pep_seq = sb.p.fa
query_genome_seq = ../raw_data/Zm-B73-REFERENCE-NAM-5.0.fa
query_gff_file = ../raw_data/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3
output_query_pep_seq = zm.p.fa
# The next line is the description of the S parameter of gffread(https://github.com/gpertea/gffread), you need to set True in general.
# -S    for -y option, use '*' instead of '.' as stop codon translation
use_S_parameter = True

[longest_pep]
out_ref_longest_pep_name = sorghum.protein.fa
out_query_longest_pep_name = maize.protein.fa

[align]
align=  diamond

[diamond]
# use ref protein seq construct database
database_name = sorghum.db
output_blast_result = sorghum.maize.diamond.blastp
max_target_seqs = 20
evalue = 1e-10

[blastp]
database_name = sorghum.blastp.db
dtype = prot
output_blast_result = sorghum.maize.blastp
evalue = 1e-10
max_target_seqs = 20
thread = 6
outfmt = 6

[combineBlastAndStrand]
out_file = sb_zm.table
bitscore = 100 
align_length = 0  

You can run this command in the sb_zm directory.

quota_Anchor pre_col -c ./config_file/pre_collinearity.conf

Collinearity analysis(maize vs sorghum)

[AnchorWave] R The R value indicates the maximum number of occurrences of a reference gene in the collinearity file
Q The Q value indicates the maximum number of occurrences of a query gene in the collinearity file
maximum_gap_size maximum gap size for chain
collapse_blast_matches Specify whether to collapse blast mathches in the input file (default: 0) Options: 0 don't collapse blast matches; 1 or any other integer to collapse them.
overlap_window When -m is set to collapse blast matches(1 or any other integer), specify the maximum distance(overlap window) allowed between two homologous gene pairs before they are considered for deletion(default: 5) This parameter is ignored if -m is not set to collapse overlap window(set 0).
input_file_name quotaAnchor pre_col command module output file path(Please type combineBlastAndStrand.out_file 's value)
output_coll_name Please type a syntenic result filename that software will generate

Put the following information into the sb_zm/config_file/collinearity.conf file and running colllinearity analysis.

[AnchorWave]
# The R value indicates the maximum number of occurrences of a gene in the collinearity file, and Q means the same as R.
# For maize and sorghum, maize has undergone an additional whole-genome duplication compared to sorghum.
# If sorghum is used as a reference, you can set R to 2 and Q to 1.
R = 2
Q = 1
maximum_gap_size = 25
collapse_blast_matches = 0
overlap_window = 5
input_file_name = sb_zm.table
output_coll_name = sb_zm.table.collinearity
quota_Anchor col -c ./config_file/collinearity.conf

Visualzing by R code

The sb_zm.table could be visualized via the following R code:

library(ggplot2)
changetoM <- function ( position ){
  position=position/1000000;
  paste(position, "M", sep="")
}
data =read.table("sb.zm.table")
data$strand = data$V6==data$V12
data[which(data$strand),]$strand = "+"
data[which(data$strand==FALSE),]$strand = "-"

data = data[which(data$V8 %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")),]
data = data[which(data$V2 %in% c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10")),]
data$V8 = factor(data$V8, levels=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
data$V2 = factor(data$V2, levels=c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10"))

plot = ggplot(data=data, aes(x=V10, y=V4))+geom_point(size=0.5, aes(color=strand))+facet_grid(V2~V8, scales="free", space="free" )+ theme_grey(base_size = 30) +
    labs(x="sorghum", y="maize")+scale_x_continuous(labels=changetoM) + scale_y_continuous(labels=changetoM) +
    theme(axis.line = element_blank(),
          panel.background = element_blank(),
          panel.border = element_rect(fill=NA,color="black", linewidth=0.5, linetype="solid"),
          axis.text.y = element_text( colour = "black"),
          legend.position='none',
          axis.text.x = element_text(angle=300, hjust=0, vjust=1, colour = "black") )
png("sorghum.maize.table.png" , width=2000, height=1500)
plot
dev.off()

This file of sb_zm.table.colinearity could be visualized via the following R code:

library(ggplot2)
changetoM <- function ( position ){
  position=position/1000000;
  paste(position, "M", sep="")
}

data = read.table("sb_zm.table.collinearity", header=T)
data = data[which(data$refChr %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")),]
data = data[which(data$queryChr %in% c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10")),]
data$refChr = factor(data$refChr, levels=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
data$queryChr = factor(data$queryChr, levels=c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10"))

plot = ggplot(data=data, aes(x=queryStart, y=referenceStart))+geom_point(size=0.5, aes(color=strand))+facet_grid(refChr~queryChr, scales="free", space="free" )+ 
  theme_grey(base_size = 30) +
  labs(x="maize", y="sorghum")+scale_x_continuous(labels=changetoM) + scale_y_continuous(labels=changetoM) +
  theme(axis.line = element_blank(),
        panel.background = element_blank(),
        panel.border = element_rect(fill=NA,color="black", linewidth=0.5, linetype="solid"),
        axis.text.y = element_text( colour = "black"),
        legend.position='none',
        axis.text.x = element_text(angle=300, hjust=0, vjust=1, colour = "black") )

png("sorghum.maize.colinearity.png" , width=2000, height=1500)
plot
dev.off()

Get chromosome length info

[length] fai_file species1.fai, species2.fai, species3.fai, species4.fai
gff_file species1.gff3, species2.gff3, species3.gff3, species4.gff3
select_fai_chr_startswith number,CHR,chr,Chr:number,CHR,chr,Chr:number,CHR,chr,Chr:number,CHR,chr,Chr
length_file This module will output species1.txt, species2.txt, species3.txt, species4.txt

Put the following information into the sb_zm/length_file/get_length.conf file

# In the process of quotaAnchor pre_col, you can get fai file. 
# By fai file and raw GFF file , you can get length information.
# The maize length information example file are as follows.

#chr     length  total_gene
#chr1    308452471       5892
#chr2    243675191       4751
#chr3    238017767       4103
#chr4    250330460       4093
#chr5    226353449       4485
#chr6    181357234       3412
#chr7    185808916       3070
#chr8    182411202       3536
#chr9    163004744       2988
#chr10   152435371       2705
# select_fai_chr_startswith parameter: 
# number: software selects chromosome name starting with number then count the chromosome length.
# chr: software selects chromosome name starting with chr string then count the chromosome length.
# Chr: software selects chromosome name starting with Chr string then count the chromosome length.
[length]
fai_file = ../raw_data/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.fai, ../raw_data/Zm-B73-REFERENCE-NAM-5.0.fa.fai
gff_file = ../raw_data/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3, ../raw_data/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3
# By default, the first column of the lines starting with chr or Chr or CHR in the fai file are extracted for plotting.
select_fai_chr_startswith = number,CHR,chr,Chr:number,CHR,chr,Chr
length_file = sb_length.txt, zm_length.txt
quota_Anchor get_chr_length -c get_length.conf

Visualzing by quota_Anchor

[circle] collinearity Gene collinearity file
ref_length Reference Species length file
query_length Query Species length file
ref_prefix Reference Species chromosome prefix(two letters are better)
query_prefix Query Species chromosome prefix(two letters are better)
font_size Text font size(you can enter /you/conda/path/envs/envs_name/lib/python3.12/site-packages/quota_anchor/ modify circle.py 's other parameter)
savefig Specify the file name to save

Put the following information into the sb_zm/length_file/circle.conf file

# support two species
[circle]
collinearity = sb_zm.table.collinearity
ref_length = ../length_file/sb_length.txt
query_length = ../length_file/zm_length.txt
ref_prefix = sb-
query_prefix = zm-
font_size = 7
savefig = sb_zm.circle.png
quota_Anchor circle -c ./config_file/circle.conf

Visualzing by quota_Anchor

[line] collinearity Gene collinearity file, e.g. species1_species2.collinearity, species2_species3.collinearity, species3_species4.collinearity
length_file Species length file, e.g. species1_length.txt, species2_length.txt, species3_length.txt, species4_length.txt
prefix Species prefix, e.g. species1, species2, species3, species4
remove_chromosome_prefix Remove chromosome prefix in the result plot (default: chr,CHR,Chr)
text_font_size Adjust the text font size in the picture
savefig Specify the file name to save

Put the following information into the sb_zm/length_file/line.conf file

#Figure from bottom to top (ref:species1, query:species2, ref:species2,query:species3, ref:species3,query:species4 ....)
[line]
collinearity = sb_zm.table.collinearity
length_file = ../length_file/sb_length.txt, ../length_file/zm_length.txt
prefix = Sorghum, Maize
remove_chromosome_prefix = chr,CHR,Chr
text_font_size = 8
savefig = sb_zm.line.png
quota_Anchor line -c ./config_file/line.conf

Visualzing by quota_Anchor

Put the following information into the sb_zm/length_file/dotplot.conf file

# set(width=1500, heigth=1200) works well[for maize(query) vs sorghum_bicolor(ref)]
[dotplot]
input_file = sb_zm.table
ref_length = ../length_file/sb_length.txt
query_length = ../length_file/zm_length.txt
type = order
query_name = Maize
ref_name = Sorghum bicolor
plotnine_figure_width=1500 
plotnine_figure_height=1200
filename= sb_zm.order.table.png
quota_Anchor dotplot -c ./config_file/dotplot.conf

Visualzing by quota_Anchor

Put the following information into the sb_zm/length_file/dotplot.conf file

# set(width=1500, heigth=1200) works well[for maize(query) vs sorghum_bicolor(ref)]
[dotplot]
input_file = sb_zm.table.collinearity
ref_length = ../length_file/sb_length.txt
query_length = ../length_file/zm_length.txt
type = order
query_name = Maize
ref_name = Sorghum bicolor
plotnine_figure_width=1500 
plotnine_figure_height=1200
filename= sb_zm.table.collinearity.png
quota_Anchor dotplot -c ./config_file/dotplot.conf

Visualzing by quota_Anchor

Put the following information into the sb_zm/length_file/line.conf file

# Figure from bottom to top (ref:species1, query:species2, ref:species2,query:species3, ref:species3,query:species4)
[line]
collinearity = oryza.sorghum.table.collinearity, sorghum.maize.table.collinearity, zm_sv/maize.setaria.table.collinearity
length_file = os_length.txt, sb_length.txt, zm_length.txt, sv_length.txt
prefix = Oryza setaria, Sorghum bicolor, Maize, Setaria viridis
remove_chromosome_prefix = chr,CHR,Chr
text_font_size = 7
savefig = os_sb_zm_sv.line.png
quota_Anchor line_2 -c line.conf

Example2

Only identify and extract the longest protein sequence encoded by each gene. Put the following information into the pre_collinearity.conf file

[gffread]
genome_seq = species1.fa, species2.fa, species3.fa, species4.fa
gff_file = species1.gff3, species2.gff3, species3.gff3, species4.gff3
out_pep_seq = species1.p.fa, species2.p.fa, species3.p.fa, species4.p.fa
# The next line is the description of the S parameter of gffread(https://github.com/gpertea/gffread), you need to set True in general.
# -S    for -y option, use '*' instead of '.' as stop codon translation
use_S_parameter = True

[longest_pep]
out_longest_pep_name = species1.protein.fa, species2.protein.fa, species3.protein.fa, species4.protein.fa
thread = 4
quota_Anchor pre_col -c pre_collinearity.conf -only_longest_pep

About

Strand and WGD aware syntenic gene identification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%