I'm working on scaffolding a HiCanu assembly for a shark species that has had duplicates purged. It was assembled with ~45x PacBio Hifi (Median length: 13,710 bp; Mean length: 13,590 bp; Max. length: 62,066 bp). Here are some assembly stats: 3,054 contigs, largest contig is 37,942,190 bp, total length is 4,155,925,466 bp, L50 is 186, L90 is 906.
SALSA placed the 3,054 contigs into 1,873 scaffolds with the following stats: largest scaffold is 166,764,294 bp, total length is 4,156,643,966 bp, L50 is 19, L90 is 317.
BUSCO scores for the scaffolded assembly (without polishing) are decent (92.4% complete; 4.0% fragmented; 3.6% missing) but I would like to improve on these and (more importantly) get the assembly closer to chromosome level, if possible with the data I have.
Any advice would be much appreciated!
Here is my script:
#SBATCH -J scaff
#SBATCH -o scaff.out
#SBATCH -e scaff.err
#SBATCH -n 64
#SBATCH -p normal
#SBATCH [email protected]
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --time=96:00:00
source /home/dswift/.bashrc
conda activate scaffold
# choose preliminary assembly to scaffold and assign
# copy to scaffolding directory, rename, and assign
mv $ASSEMBLY prelim_assembly.fa
# assign prelim prefix
PREFIX=$(basename $PRELIM | cut -d. -f1)
# index assembly ; this step is done when prepping files for SALSA so can be skipped if you've already done it
samtools faidx $PRELIM
bwa index $PRELIM -a bwtsw
# assign forward and reverse hi-c reads and trim
zcat $HIC_F | awk '{ if(NR%2==0) {print substr($1,6)} else {print}}' | gzip > ../hic/hic_R1_trim.fastq.gz
zcat $HIC_R | awk '{ if(NR%2==0) {print substr($1,6)} else {print}}' | gzip > ../hic/hic_R2_trim.fastq.gz
# align trimmed forward and reverse hi-c reads to preliminary assembly independently
bwa mem -t 64 $PRELIM ../hic/hic_R1_trim.fastq.gz | samtools view -@ 64 -Sb - > $PREFIX"_1.bam"
bwa mem -t 64 $PRELIM ../hic/hic_R2_trim.fastq.gz | samtools view -@ 64 -Sb - > $PREFIX"_2.bam"
# filter bam files
samtools view -h $PREFIX"_1.bam" -@ 64 | perl ~/bin/ | samtools view -@ 64 -Sb - > $PREFIX"_filt_1.bam"
samtools view -h $PREFIX"_2.bam" -@ 64 | perl ~/bin/ | samtools view -@ 64 -Sb - > $PREFIX"_filt_2.bam"
# pair filtered bam files and sort
perl ~/bin/ $PREFIX"_filt_1.bam" $PREFIX"_filt_2.bam" samtools 10 | samtools view -bS -t $PRELIM".fai" - | samtools sort -@ 64 -o temp.bam -
# add read groups to bam
java -Xmx4G -jar /home/dswift/bin/miniconda3/envs/scaffold/share/picard-2.18.29-0/picard.jar AddOrReplaceReadGroups INPUT=temp.bam OUTPUT=paired.bam ID=$PRELIM LB=$PRELIM SM=$ASSEMBLY PL=ILLUMINA PU=none
# discard PCR duplicates
java -Xmx60G -XX:-UseGCOverheadLimit -jar /home/dswift/bin/miniconda3/envs/scaffold/share/picard-2.18.29-0/picard.jar MarkDuplicates INPUT=paired.bam OUTPUT=align.bam METRICS_FILE=metrics.alignment.txt TMP_DIR=temp/ ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=TRUE
# index alignment.bam
samtools index align.bam -@ 64
# produce stats
perl ~/bin/ align.bam > bamstats.txt
samtools flagstat align.bam -@ 64 > flagstats.txt
# convert bam to bed format required by SALSA2 and sort by read name
bamToBed -i align.bam > align.bed
sort -k 4 align.bed > tmp && mv tmp align.bed
# change conda env
conda deactivate
conda activate salsa2_new
# SALSA2 -a prelim_assembly.fa -l prelim_assembly.fa.fai -b align.bed -e GATC,GACTC,GAGTC,GATTC,GAATC -o ctau_canu_purged_scaffold_r3 -m yes
I think your assembly is close to chromosome scale, at least at the N50 level. According to this: the species has 43 diploid chromosomes so having half the genome in 19 is on par with that. An assembly from another shark species ( supports this. The BUSCO wouldn't really be affected by scaffolding since no new sequence is added to the assembly and any joins have gaps so a gene that is partial will still likely be partial.
