Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trio binning - high error rates and significant assembly size differences #709

Open
BrianSmart opened this issue Oct 1, 2024 · 0 comments

Comments

@BrianSmart
Copy link

Hello!

I recently ran hifiasm in trio-binning mode on sunflower samples. We used two lines that are thought to be largely isogenic beyond an anthocyanin/male fertility locus, and crossed them to get a "heterozygous" line. We did Illumina short read "parental" sequencing on the two parents (~20x coverage) and PacBio Revio long read sequencing on the heterozygous line (~50x coverage). I then used the following hifiasm command:
hifiasm -o Sunflower_1_Hetero_Red_trioBinning.asm -t 128 -1 Sunflower_2_Homo_Red.yak -2 Sunflower_3_Homo_Green.yak ../MutagenesisPacBioLongReadsMerged.fastq.gz

The resulting hap1 and hap2 p_ctg outputs were then used for assembly. Gfastats, BUSCO and Merqury show the following:
Haplotype 1:
Total length: 718.08 Mb​
Scaffold N50: 1.12 Mb​
Largest scaffold: 32.20 Mb​
Number of scaffolds: 4,129​
GC content: 39.72%
BUSCO: 23.2% complete
Merqury: 25.5% complete, QV score 58.2

Haplotype 2:
Total length: 2.97 Gb​
Scaffold N50: 92.20 Mb​
Largest scaffold: 202.28 Mb​
Number of scaffolds: 727​
GC content: 38.70%
BUSCO: 95.7% complete
Merqury: 96.5% complete, QV score 63.7

The main concern I have about these results before proceeding with publication is the size difference between the haplotypes, and the switch, hamming, and error rates being:
Trio Hap1: 11.84% 14.89% 18.02%
Trio Hap2: 21.74% 26.82% 27.82%

Is this level of size difference between haplotypes normal for highly similar parents?
How can I determine if the high error rates are due to similarity or actual errors?
Are there any additional analyses you'd recommend to validate these assemblies?

The main reason I'm not totally worried is simply because these two haplotypes should be largely identical, so size differences and high error rates might be expected. Perhaps the high error rates just indicate the high sequence similarity?

Thanks for this fantastic program! The resulting scaffolds from YaHS using the OmniC data look fantastic regardless of these concerns.


For reference, the genomescope2 summary for the HiFi reads is:
GenomeScope version 2.0
input file = meryl_hifi_kmers_k21_histogram.txt
output directory = .
p = 2
k = 21
property min max
Homozygous (aa) 89.6889% 93.8683%
Heterozygous (ab) 6.13174% 10.3111%
Genome Haploid Length 1,538,516,725 bp 1,543,677,918 bp
Genome Repeat Length 1,217,987,977 bp 1,222,073,907 bp
Genome Unique Length 320,528,748 bp 321,604,011 bp
Model Fit 28.0299% 82.9%
Read Error Rate 0.105375% 0.105375%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant