quota_Anchor( Proustian monstrosity)

Here are the scripts and documents to conduct strand and WGD aware syntenic gene identification for a pair of genomes using the longest path algorithm implemented in AnchorWave. We currently provide three visualization methods for syntenic results.

Installation

You can simple by the following command get this software in a independent conda envirment. This is a beta version, so we haven't uploaded it to bioconda yet.

conda install xiaodli::quota_anchor

Usage

Help info

quota_Anchor -h

usage: quota_Anchor [-h] [-v] {pre_col,col,get_chr_length,dotplot,circle,line_2,line_proali} ...

Conduct strand and WGD aware syntenic gene identification for a pair of genomes using the longest path algorithm implemented in AnchorWave:
options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

gene collinearity analysis:
  {pre_col,col,get_chr_length,dotplot,circle,line_2,line_proali}
    pre_col             Get longest protein file from gffread result and input file for gene collinearity analysis
    col                 Get gene collinearity result file
    get_chr_length      Get chromosome length and name info from fai file
    dotplot             Collinearity result visualization
    circle              Collinearity result visualization
    line_2              Collinearity result visualization
    line_proali         Anchors file from AnchorWave proali visualization

Example1

Here is an example to identify syntenic genes between maize and sorghum. The maize lineage has undergone a whole genome duplication (WGD) since its divergence with sorghum, but subsequent chromosomal fusions resulted in these species having the same chromosome number (n = 10). AnchorWave can allow up to two collinear paths for each sorghum anchor while one collinear path for each maize anchor.

Make some folders rather than a folder may be more clearer

Working directory structure are as follows. We may give you an option to create this directory later, but you will need to do it yourself now.

├── length_file
│   └── get_length.conf
├── raw_data
│   ├── Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3
│   ├── Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa
│   ├── Zm-B73-REFERENCE-NAM-5.0.fa
│   └── Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3
└── sb_zm
    └── config_file
        ├── circle.conf
        ├── collinearity.conf
        ├── dotplot.conf
        ├── line.conf
        └── pre_collinearity.conf

Genome and annotation data preparation

Put the genome and gff file into the sb_zm/raw_data directory.

wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0.fa.gz
wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3.gz
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/fasta/sorghum_bicolor/dna/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.gz
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/gff3/sorghum_bicolor/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3.gz
gunzip *gz

Modify config file and running pre_collinearity(maize vs sorghum)

This includes four steps(implemented in "quota_Anchor pre_col")

Extract and translate protein sequences from genome sequences and annotations
Identify and extract the longest protein sequence encoded by each gene
Protein sequence alignment using DIAMOND or conduct protein sequence alignment using BLASTp
Put the gene strand information and the blast result into a single file

Header	Parameter	Description
[gffread]	ref_genome_seq	Please provide e path to your reference FASTA file
	ref_gff_file	Please provide a path to your reference GFF file
	output_ref_pep_seq	Please type a filename for the reference protein sequence file that the software will generate
	query_genome_seq	Please provide a path to your Query FASTA file
	query_gff_file	Please provide a path to your Query GFF file
	output_query_pep_seq	Please type a filename for a query protein sequence file that the software will generate
	use_S_parameter	When using gffread -y option, whether the -S parameter is used, which means using '*' instead of '.' as the stop codon for translation(Not recommended to modify, default:True)
[longest_pep]	out_ref_longest_pep_name	Please type a filename for longest protein sequence file that software will generate
[longest_pep]	out_query_longest_pep_name	Please type a filename for longest protein sequence file that software will generate
[align]	align	Protein sequence alignment using DIAMOND or BLASTp(please type: diamond or blastp)
[diamond]	database_name	Please type a diamond-blastp database Path/Name that software will generate
	output_blast_result	Please type a diamond-blastp result Path that software will generate
	max_target_seqs	Maximum number of target sequences to report alignments for diamond-blastp
	evalue	Maximum e-value to report alignments
[blastp]	database_name	Please type a blastp database Path/Name that software will generate
	dtype	Database date type(please type:prot)
	output_blast_result	Please type a blastp result Path that software will generate
	evalue	Maximum e-value to report alignments
	max_target_seqs	Maximum number of target sequences to report alignments for diamond-blastp
	thread	Number of CPU threads
	outfmt	Please type 6(BLAST tabular)
[combineBlastAndStrand]	out_file	Please type a filename for a longest protein sequence file that software will generate
	bitscore	Filter by blast minimum bitscore(default:100)
	align_length	Filter by blast minimum alignment length(default:0)

Put the following information into the sb_zm/config_file/pre_collinearity.conf file.

[gffread]
ref_genome_seq = ../raw_data/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa
ref_gff_file = ../raw_data/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3
output_ref_pep_seq = sb.p.fa
query_genome_seq = ../raw_data/Zm-B73-REFERENCE-NAM-5.0.fa
query_gff_file = ../raw_data/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3
output_query_pep_seq = zm.p.fa
# The next line is the description of the S parameter of gffread(https://github.com/gpertea/gffread), you need to set True in general.
# -S    for -y option, use '*' instead of '.' as stop codon translation
use_S_parameter = True

[longest_pep]
out_ref_longest_pep_name = sorghum.protein.fa
out_query_longest_pep_name = maize.protein.fa

[align]
align=  diamond

[diamond]
# use ref protein seq construct database
database_name = sorghum.db
output_blast_result = sorghum.maize.diamond.blastp
max_target_seqs = 20
evalue = 1e-10

[blastp]
database_name = sorghum.blastp.db
dtype = prot
output_blast_result = sorghum.maize.blastp
evalue = 1e-10
max_target_seqs = 20
thread = 6
outfmt = 6

[combineBlastAndStrand]
out_file = sb_zm.table
bitscore = 100 
align_length = 0

You can run this command in the sb_zm directory.

quota_Anchor pre_col -c ./config_file/pre_collinearity.conf

Collinearity analysis(maize vs sorghum)

[AnchorWave]	R	The R value indicates the maximum number of occurrences of a reference gene in the collinearity file
	Q	The Q value indicates the maximum number of occurrences of a query gene in the collinearity file
	maximum_gap_size	maximum gap size for chain
	collapse_blast_matches	Specify whether to collapse blast mathches in the input file (default: 0) Options: 0 don't collapse blast matches; 1 or any other integer to collapse them.
	overlap_window	When -m is set to collapse blast matches(1 or any other integer), specify the maximum distance(overlap window) allowed between two homologous gene pairs before they are considered for deletion(default: 5) This parameter is ignored if -m is not set to collapse overlap window(set 0).
	input_file_name	quotaAnchor pre_col command module output file path(Please type combineBlastAndStrand.out_file 's value)
	output_coll_name	Please type a syntenic result filename that software will generate

Put the following information into the sb_zm/config_file/collinearity.conf file and running colllinearity analysis.

[AnchorWave]
# The R value indicates the maximum number of occurrences of a gene in the collinearity file, and Q means the same as R.
# For maize and sorghum, maize has undergone an additional whole-genome duplication compared to sorghum.
# If sorghum is used as a reference, you can set R to 2 and Q to 1.
R = 2
Q = 1
maximum_gap_size = 25
collapse_blast_matches = 0
overlap_window = 5
input_file_name = sb_zm.table
output_coll_name = sb_zm.table.collinearity

quota_Anchor col -c ./config_file/collinearity.conf

Visualzing by R code

The sb_zm.table could be visualized via the following R code:

library(ggplot2)
changetoM <- function ( position ){
  position=position/1000000;
  paste(position, "M", sep="")
}
data =read.table("sb.zm.table")
data$strand = data$V6==data$V12
data[which(data$strand),]$strand = "+"
data[which(data$strand==FALSE),]$strand = "-"

data = data[which(data$V8 %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")),]
data = data[which(data$V2 %in% c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10")),]
data$V8 = factor(data$V8, levels=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
data$V2 = factor(data$V2, levels=c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10"))

plot = ggplot(data=data, aes(x=V10, y=V4))+geom_point(size=0.5, aes(color=strand))+facet_grid(V2~V8, scales="free", space="free" )+ theme_grey(base_size = 30) +
    labs(x="sorghum", y="maize")+scale_x_continuous(labels=changetoM) + scale_y_continuous(labels=changetoM) +
    theme(axis.line = element_blank(),
          panel.background = element_blank(),
          panel.border = element_rect(fill=NA,color="black", linewidth=0.5, linetype="solid"),
          axis.text.y = element_text( colour = "black"),
          legend.position='none',
          axis.text.x = element_text(angle=300, hjust=0, vjust=1, colour = "black") )
png("sorghum.maize.table.png" , width=2000, height=1500)
plot
dev.off()

This file of sb_zm.table.colinearity could be visualized via the following R code:

library(ggplot2)
changetoM <- function ( position ){
  position=position/1000000;
  paste(position, "M", sep="")
}

data = read.table("sb_zm.table.collinearity", header=T)
data = data[which(data$refChr %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")),]
data = data[which(data$queryChr %in% c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10")),]
data$refChr = factor(data$refChr, levels=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
data$queryChr = factor(data$queryChr, levels=c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10"))

plot = ggplot(data=data, aes(x=queryStart, y=referenceStart))+geom_point(size=0.5, aes(color=strand))+facet_grid(refChr~queryChr, scales="free", space="free" )+ 
  theme_grey(base_size = 30) +
  labs(x="maize", y="sorghum")+scale_x_continuous(labels=changetoM) + scale_y_continuous(labels=changetoM) +
  theme(axis.line = element_blank(),
        panel.background = element_blank(),
        panel.border = element_rect(fill=NA,color="black", linewidth=0.5, linetype="solid"),
        axis.text.y = element_text( colour = "black"),
        legend.position='none',
        axis.text.x = element_text(angle=300, hjust=0, vjust=1, colour = "black") )

png("sorghum.maize.colinearity.png" , width=2000, height=1500)
plot
dev.off()

Get chromosome length info

[length]	fai_file	species1.fai, species2.fai, species3.fai, species4.fai
	gff_file	species1.gff3, species2.gff3, species3.gff3, species4.gff3
	select_fai_chr_startswith	number,CHR,chr,Chr:number,CHR,chr,Chr:number,CHR,chr,Chr:number,CHR,chr,Chr
	length_file	This module will output species1.txt, species2.txt, species3.txt, species4.txt

Put the following information into the sb_zm/length_file/get_length.conf file

# In the process of quotaAnchor pre_col, you can get fai file. 
# By fai file and raw GFF file , you can get length information.
# The maize length information example file are as follows.

#chr     length  total_gene
#chr1    308452471       5892
#chr2    243675191       4751
#chr3    238017767       4103
#chr4    250330460       4093
#chr5    226353449       4485
#chr6    181357234       3412
#chr7    185808916       3070
#chr8    182411202       3536
#chr9    163004744       2988
#chr10   152435371       2705
# select_fai_chr_startswith parameter: 
# number: software selects chromosome name starting with number then count the chromosome length.
# chr: software selects chromosome name starting with chr string then count the chromosome length.
# Chr: software selects chromosome name starting with Chr string then count the chromosome length.
[length]
fai_file = ../raw_data/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.fai, ../raw_data/Zm-B73-REFERENCE-NAM-5.0.fa.fai
gff_file = ../raw_data/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3, ../raw_data/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3
# By default, the first column of the lines starting with chr or Chr or CHR in the fai file are extracted for plotting.
select_fai_chr_startswith = number,CHR,chr,Chr:number,CHR,chr,Chr
length_file = sb_length.txt, zm_length.txt

quota_Anchor get_chr_length -c get_length.conf

Visualzing by quota_Anchor

[circle]	collinearity	Gene collinearity file
	ref_length	Reference Species length file
	query_length	Query Species length file
	ref_prefix	Reference Species chromosome prefix(two letters are better)
	query_prefix	Query Species chromosome prefix(two letters are better)
	font_size	Text font size(you can enter /you/conda/path/envs/envs_name/lib/python3.12/site-packages/quota_anchor/ modify circle.py 's other parameter)
	savefig	Specify the file name to save

Put the following information into the sb_zm/length_file/circle.conf file

# support two species
[circle]
collinearity = sb_zm.table.collinearity
ref_length = ../length_file/sb_length.txt
query_length = ../length_file/zm_length.txt
ref_prefix = sb-
query_prefix = zm-
font_size = 7
savefig = sb_zm.circle.png

quota_Anchor circle -c ./config_file/circle.conf

Visualzing by quota_Anchor

[line]	collinearity	Gene collinearity file, e.g. species1_species2.collinearity, species2_species3.collinearity, species3_species4.collinearity
	length_file	Species length file, e.g. species1_length.txt, species2_length.txt, species3_length.txt, species4_length.txt
	prefix	Species prefix, e.g. species1, species2, species3, species4
	remove_chromosome_prefix	Remove chromosome prefix in the result plot (default: chr,CHR,Chr)
	text_font_size	Adjust the text font size in the picture
	savefig	Specify the file name to save

Put the following information into the sb_zm/length_file/line.conf file

#Figure from bottom to top (ref:species1, query:species2, ref:species2,query:species3, ref:species3,query:species4 ....)
[line]
collinearity = sb_zm.table.collinearity
length_file = ../length_file/sb_length.txt, ../length_file/zm_length.txt
prefix = Sorghum, Maize
remove_chromosome_prefix = chr,CHR,Chr
text_font_size = 8
savefig = sb_zm.line.png

quota_Anchor line -c ./config_file/line.conf

Visualzing by quota_Anchor

Put the following information into the sb_zm/length_file/dotplot.conf file

# set(width=1500, heigth=1200) works well[for maize(query) vs sorghum_bicolor(ref)]
[dotplot]
input_file = sb_zm.table
ref_length = ../length_file/sb_length.txt
query_length = ../length_file/zm_length.txt
type = order
query_name = Maize
ref_name = Sorghum bicolor
plotnine_figure_width=1500 
plotnine_figure_height=1200
filename= sb_zm.order.table.png

quota_Anchor dotplot -c ./config_file/dotplot.conf

Visualzing by quota_Anchor

Put the following information into the sb_zm/length_file/dotplot.conf file

# set(width=1500, heigth=1200) works well[for maize(query) vs sorghum_bicolor(ref)]
[dotplot]
input_file = sb_zm.table.collinearity
ref_length = ../length_file/sb_length.txt
query_length = ../length_file/zm_length.txt
type = order
query_name = Maize
ref_name = Sorghum bicolor
plotnine_figure_width=1500 
plotnine_figure_height=1200
filename= sb_zm.table.collinearity.png

quota_Anchor dotplot -c ./config_file/dotplot.conf

Visualzing by quota_Anchor

Put the following information into the sb_zm/length_file/line.conf file

# Figure from bottom to top (ref:species1, query:species2, ref:species2,query:species3, ref:species3,query:species4)
[line]
collinearity = oryza.sorghum.table.collinearity, sorghum.maize.table.collinearity, zm_sv/maize.setaria.table.collinearity
length_file = os_length.txt, sb_length.txt, zm_length.txt, sv_length.txt
prefix = Oryza setaria, Sorghum bicolor, Maize, Setaria viridis
remove_chromosome_prefix = chr,CHR,Chr
text_font_size = 7
savefig = os_sb_zm_sv.line.png

quota_Anchor line_2 -c line.conf

Example2

Only identify and extract the longest protein sequence encoded by each gene. Put the following information into the pre_collinearity.conf file

[gffread]
genome_seq = species1.fa, species2.fa, species3.fa, species4.fa
gff_file = species1.gff3, species2.gff3, species3.gff3, species4.gff3
out_pep_seq = species1.p.fa, species2.p.fa, species3.p.fa, species4.p.fa
# The next line is the description of the S parameter of gffread(https://github.com/gpertea/gffread), you need to set True in general.
# -S    for -y option, use '*' instead of '.' as stop codon translation
use_S_parameter = True

[longest_pep]
out_longest_pep_name = species1.protein.fa, species2.protein.fa, species3.protein.fa, species4.protein.fa
thread = 4

quota_Anchor pre_col -c pre_collinearity.conf -only_longest_pep

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
quota_anchor		quota_anchor
.Rhistory		.Rhistory
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quota_Anchor( Proustian monstrosity)

Installation

Usage

Help info

Example1

Make some folders rather than a folder may be more clearer

Genome and annotation data preparation

Modify config file and running pre_collinearity(maize vs sorghum)

Collinearity analysis(maize vs sorghum)

Visualzing by R code

Get chromosome length info

Visualzing by quota_Anchor

Visualzing by quota_Anchor

Visualzing by quota_Anchor

Visualzing by quota_Anchor

Visualzing by quota_Anchor

Example2

About

Releases

Packages

Languages

License

xiaodli/quota_Anchor

Folders and files

Latest commit

History

Repository files navigation

quota_Anchor( Proustian monstrosity)

Installation

Usage

Help info

Example1

Make some folders rather than a folder may be more clearer

Genome and annotation data preparation

Modify config file and running pre_collinearity(maize vs sorghum)

Collinearity analysis(maize vs sorghum)

Visualzing by R code

Get chromosome length info

Visualzing by quota_Anchor

Visualzing by quota_Anchor

Visualzing by quota_Anchor

Visualzing by quota_Anchor

Visualzing by quota_Anchor

Example2

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages