Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similar number of scaffolds to contigs #182

Open
dgs108 opened this issue Jul 17, 2024 · 2 comments
Open

Similar number of scaffolds to contigs #182

dgs108 opened this issue Jul 17, 2024 · 2 comments

Comments

@dgs108
Copy link

dgs108 commented Jul 17, 2024

I'm working on scaffolding a HiCanu assembly for a shark species that has had duplicates purged. It was assembled with ~45x PacBio Hifi (Median length: 13,710 bp; Mean length: 13,590 bp; Max. length: 62,066 bp). Here are some assembly stats: 3,054 contigs, largest contig is 37,942,190 bp, total length is 4,155,925,466 bp, L50 is 186, L90 is 906.

I have scaffolded using the most recent version of SALSA2 after following the Arima preparation pipeline (https://github.com/ArimaGenomics/mapping_pipeline/blob/master/Arima_Mapping_UserGuide_A160156_v03.pdf) with 559 million paired-end 150 bp Hi-C reads produced in a single library prep.

SALSA placed the 3,054 contigs into 1,873 scaffolds with the following stats: largest scaffold is 166,764,294 bp, total length is 4,156,643,966 bp, L50 is 19, L90 is 317.

BUSCO scores for the scaffolded assembly (without polishing) are decent (92.4% complete; 4.0% fragmented; 3.6% missing) but I would like to improve on these and (more importantly) get the assembly closer to chromosome level, if possible with the data I have.

Any advice would be much appreciated!

Here is my script:

#!/bin/bash
#SBATCH -J scaff
#SBATCH -o scaff.out
#SBATCH -e scaff.err
#SBATCH -N 1
#SBATCH -n 64
#SBATCH -p normal
#SBATCH [email protected]
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --time=96:00:00

source /home/dswift/.bashrc
conda activate scaffold

# choose preliminary assembly to scaffold and assign

ASSEMBLY_PATH="$WORK/Workspace/sand_tiger/genome/assemblies/canu/purge_dups/purged.fa"

ASSEMBLY=$(basename "$ASSEMBLY_PATH")

# copy to scaffolding directory, rename, and assign

cp -v $ASSEMBLY_PATH .

mv $ASSEMBLY prelim_assembly.fa

PRELIM="prelim_assembly.fa"

# assign prelim prefix

PREFIX=$(basename $PRELIM | cut -d. -f1)

# index assembly ; this step is done when prepping files for SALSA so can be skipped if you've already done it

samtools faidx $PRELIM

bwa index $PRELIM -a bwtsw

# assign forward and reverse hi-c reads and trim

HIC_F="$WORK/Workspace/sand_tiger/genome/hic/20043-07-01_S0_L001_R1_001.fastq.gz"

HIC_R="$WORK/Workspace/sand_tiger/genome/hic/20043-07-01_S0_L001_R2_001.fastq.gz"

zcat $HIC_F | awk '{ if(NR%2==0) {print substr($1,6)} else {print}}' | gzip > ../hic/hic_R1_trim.fastq.gz

zcat $HIC_R | awk '{ if(NR%2==0) {print substr($1,6)} else {print}}' | gzip > ../hic/hic_R2_trim.fastq.gz

# align trimmed forward and reverse hi-c reads to preliminary assembly independently 

bwa mem -t 64 $PRELIM ../hic/hic_R1_trim.fastq.gz | samtools view -@ 64 -Sb - > $PREFIX"_1.bam"

bwa mem -t 64 $PRELIM ../hic/hic_R2_trim.fastq.gz | samtools view -@ 64 -Sb - > $PREFIX"_2.bam"

# filter bam files

samtools view -h $PREFIX"_1.bam" -@ 64 | perl ~/bin/filter_five_end.pl | samtools view -@ 64 -Sb - > $PREFIX"_filt_1.bam"

samtools view -h $PREFIX"_2.bam" -@ 64 | perl ~/bin/filter_five_end.pl | samtools view -@ 64 -Sb - > $PREFIX"_filt_2.bam"

# pair filtered bam files and sort

perl ~/bin/two_read_bam_combiner.pl $PREFIX"_filt_1.bam" $PREFIX"_filt_2.bam" samtools 10 | samtools view -bS -t $PRELIM".fai" - | samtools sort -@ 64 -o temp.bam -

# add read groups to bam

java -Xmx4G -Djava.io.tmpdir=temp/ -jar /home/dswift/bin/miniconda3/envs/scaffold/share/picard-2.18.29-0/picard.jar AddOrReplaceReadGroups INPUT=temp.bam OUTPUT=paired.bam ID=$PRELIM LB=$PRELIM SM=$ASSEMBLY PL=ILLUMINA PU=none

# discard PCR duplicates

java -Xmx60G -XX:-UseGCOverheadLimit -Djava.io.tmpdir=temp/ -jar /home/dswift/bin/miniconda3/envs/scaffold/share/picard-2.18.29-0/picard.jar MarkDuplicates INPUT=paired.bam OUTPUT=align.bam METRICS_FILE=metrics.alignment.txt TMP_DIR=temp/ ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=TRUE

# index alignment.bam

samtools index align.bam -@ 64

# produce stats

perl ~/bin/get_stats.pl align.bam > bamstats.txt

samtools flagstat align.bam -@ 64 > flagstats.txt

# convert bam to bed format required by SALSA2 and sort by read name

bamToBed -i align.bam > align.bed

sort -k 4 align.bed > tmp && mv tmp align.bed

# change conda env

conda deactivate

conda activate salsa2_new

# SALSA2

run_pipeline.py -a prelim_assembly.fa -l prelim_assembly.fa.fai -b align.bed -e GATC,GACTC,GAGTC,GATTC,GAATC -o ctau_canu_purged_scaffold_r3 -m yes

@skoren
Copy link
Member

skoren commented Jul 17, 2024

I think your assembly is close to chromosome scale, at least at the N50 level. According to this: https://www.genomesize.com/result_species.php?id=1665 the species has 43 diploid chromosomes so having half the genome in 19 is on par with that. An assembly from another shark species (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_037974335.1/) supports this. The BUSCO wouldn't really be affected by scaffolding since no new sequence is added to the assembly and any joins have gaps so a gene that is partial will still likely be partial.

I'd suggest also trying YAHS (https://github.com/c-zhou/yahs) and relying on curation (e.g. https://github.com/BGAcademy23/manual-curation) to finalize the assembly.

@dgs108
Copy link
Author

dgs108 commented Jul 17, 2024

Thanks for the quick response! It is the sand tiger shark and its genome is a fair bit larger than that: https://www.genomesize.com/results.php?page=1

Based on the other genomes I have scaffolded, I expected <1500 scaffolds given the number of contigs. I will look into those other tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants