You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on scaffolding a HiCanu assembly for a shark species that has had duplicates purged. It was assembled with ~45x PacBio Hifi (Median length: 13,710 bp; Mean length: 13,590 bp; Max. length: 62,066 bp). Here are some assembly stats: 3,054 contigs, largest contig is 37,942,190 bp, total length is 4,155,925,466 bp, L50 is 186, L90 is 906.
SALSA placed the 3,054 contigs into 1,873 scaffolds with the following stats: largest scaffold is 166,764,294 bp, total length is 4,156,643,966 bp, L50 is 19, L90 is 317.
BUSCO scores for the scaffolded assembly (without polishing) are decent (92.4% complete; 4.0% fragmented; 3.6% missing) but I would like to improve on these and (more importantly) get the assembly closer to chromosome level, if possible with the data I have.
Any advice would be much appreciated!
Here is my script:
#!/bin/bash
#SBATCH -J scaff
#SBATCH -o scaff.out
#SBATCH -e scaff.err
#SBATCH -N 1
#SBATCH -n 64
#SBATCH -p normal
#SBATCH [email protected]
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --time=96:00:00
source /home/dswift/.bashrc
conda activate scaffold
# choose preliminary assembly to scaffold and assign
ASSEMBLY_PATH="$WORK/Workspace/sand_tiger/genome/assemblies/canu/purge_dups/purged.fa"
ASSEMBLY=$(basename "$ASSEMBLY_PATH")
# copy to scaffolding directory, rename, and assign
cp -v $ASSEMBLY_PATH .
mv $ASSEMBLY prelim_assembly.fa
PRELIM="prelim_assembly.fa"
# assign prelim prefix
PREFIX=$(basename $PRELIM | cut -d. -f1)
# index assembly ; this step is done when prepping files for SALSA so can be skipped if you've already done it
samtools faidx $PRELIM
bwa index $PRELIM -a bwtsw
# assign forward and reverse hi-c reads and trim
HIC_F="$WORK/Workspace/sand_tiger/genome/hic/20043-07-01_S0_L001_R1_001.fastq.gz"
HIC_R="$WORK/Workspace/sand_tiger/genome/hic/20043-07-01_S0_L001_R2_001.fastq.gz"
zcat $HIC_F | awk '{ if(NR%2==0) {print substr($1,6)} else {print}}' | gzip > ../hic/hic_R1_trim.fastq.gz
zcat $HIC_R | awk '{ if(NR%2==0) {print substr($1,6)} else {print}}' | gzip > ../hic/hic_R2_trim.fastq.gz
# align trimmed forward and reverse hi-c reads to preliminary assembly independently
bwa mem -t 64 $PRELIM ../hic/hic_R1_trim.fastq.gz | samtools view -@ 64 -Sb - > $PREFIX"_1.bam"
bwa mem -t 64 $PRELIM ../hic/hic_R2_trim.fastq.gz | samtools view -@ 64 -Sb - > $PREFIX"_2.bam"
# filter bam files
samtools view -h $PREFIX"_1.bam" -@ 64 | perl ~/bin/filter_five_end.pl | samtools view -@ 64 -Sb - > $PREFIX"_filt_1.bam"
samtools view -h $PREFIX"_2.bam" -@ 64 | perl ~/bin/filter_five_end.pl | samtools view -@ 64 -Sb - > $PREFIX"_filt_2.bam"
# pair filtered bam files and sort
perl ~/bin/two_read_bam_combiner.pl $PREFIX"_filt_1.bam" $PREFIX"_filt_2.bam" samtools 10 | samtools view -bS -t $PRELIM".fai" - | samtools sort -@ 64 -o temp.bam -
# add read groups to bam
java -Xmx4G -Djava.io.tmpdir=temp/ -jar /home/dswift/bin/miniconda3/envs/scaffold/share/picard-2.18.29-0/picard.jar AddOrReplaceReadGroups INPUT=temp.bam OUTPUT=paired.bam ID=$PRELIM LB=$PRELIM SM=$ASSEMBLY PL=ILLUMINA PU=none
# discard PCR duplicates
java -Xmx60G -XX:-UseGCOverheadLimit -Djava.io.tmpdir=temp/ -jar /home/dswift/bin/miniconda3/envs/scaffold/share/picard-2.18.29-0/picard.jar MarkDuplicates INPUT=paired.bam OUTPUT=align.bam METRICS_FILE=metrics.alignment.txt TMP_DIR=temp/ ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=TRUE
# index alignment.bam
samtools index align.bam -@ 64
# produce stats
perl ~/bin/get_stats.pl align.bam > bamstats.txt
samtools flagstat align.bam -@ 64 > flagstats.txt
# convert bam to bed format required by SALSA2 and sort by read name
bamToBed -i align.bam > align.bed
sort -k 4 align.bed > tmp && mv tmp align.bed
# change conda env
conda deactivate
conda activate salsa2_new
# SALSA2
run_pipeline.py -a prelim_assembly.fa -l prelim_assembly.fa.fai -b align.bed -e GATC,GACTC,GAGTC,GATTC,GAATC -o ctau_canu_purged_scaffold_r3 -m yes
The text was updated successfully, but these errors were encountered:
I think your assembly is close to chromosome scale, at least at the N50 level. According to this: https://www.genomesize.com/result_species.php?id=1665 the species has 43 diploid chromosomes so having half the genome in 19 is on par with that. An assembly from another shark species (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_037974335.1/) supports this. The BUSCO wouldn't really be affected by scaffolding since no new sequence is added to the assembly and any joins have gaps so a gene that is partial will still likely be partial.
I'm working on scaffolding a HiCanu assembly for a shark species that has had duplicates purged. It was assembled with ~45x PacBio Hifi (Median length: 13,710 bp; Mean length: 13,590 bp; Max. length: 62,066 bp). Here are some assembly stats: 3,054 contigs, largest contig is 37,942,190 bp, total length is 4,155,925,466 bp, L50 is 186, L90 is 906.
I have scaffolded using the most recent version of SALSA2 after following the Arima preparation pipeline (https://github.com/ArimaGenomics/mapping_pipeline/blob/master/Arima_Mapping_UserGuide_A160156_v03.pdf) with 559 million paired-end 150 bp Hi-C reads produced in a single library prep.
SALSA placed the 3,054 contigs into 1,873 scaffolds with the following stats: largest scaffold is 166,764,294 bp, total length is 4,156,643,966 bp, L50 is 19, L90 is 317.
BUSCO scores for the scaffolded assembly (without polishing) are decent (92.4% complete; 4.0% fragmented; 3.6% missing) but I would like to improve on these and (more importantly) get the assembly closer to chromosome level, if possible with the data I have.
Any advice would be much appreciated!
Here is my script:
The text was updated successfully, but these errors were encountered: