-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assembled genome contains variants not supported by reads #387
Comments
Hi, Very likely related to the repetitive nature of these regions. Flye is using secondary alignment to error-correct repetitive regions, but in this case the reads from different repeat copy might have outvoted the true reads. If you have bam alignment with secondary reads and enable them in IGV, you should be able to see reads supporting the alternative nucleotide. But otherwise this is indeed an issue. We have recently made an update in alignment selection logic, which I hope should fix this issue. Would you mind trying the latest version from the |
Hi @fenderglass , thank you for your reply, We used version 2.8.2-b1689 for the assembly. Just for clarity, do you mean using 2.8.3 instead? or a develop branch? Thanks. |
You need to get and compile the latest code from github, |
I can confirm that some variants disappear in the assemblies generated with the latest version (including the one shown as example above), but it is only a small fraction, and most other false positives remain. Is there a way to avoid the use of secondary reads here? there would be any drawbacks in doing so? We usually remove these from read-to-reference alignments. Thanks |
@elcortegano with the latest code Flye should only use lower confidence alignment, if the coverage of "reliable" alignments (that is, non-secondary, non-supplementary, MAPQ>30) is low. Do you use MAPQ filter for your alignments? In repetitive regions, some reads that map ambiguously to multiple repeat copies will be arbitrarily assigned to one copy as "primary", so it is also necessary to filter out alignments with low MAPQ (e.g. <10, <20 etc; Flye is using a conservative <30 cutoff). Could you post a couple examples with IGV view of the problematic regions? Preferably zoomed out (e.g. several kb region), coverage track and with secondary alignments. This should tell us if there is any issue in the alignment selection logic. |
We do not use MAPQ filter. Alignments for the reads are done with the recommended PacBio tool
Next, alignments are sorted with Snapshots for different levels of zooming below. The top line is the assembly with version 2.8.3, and below are reads mapped with Is that ok? Do you have any tool preference for generating these alignments including secondary reads? There are other examples in non-repeated regions. Should I submit one of these as well? Thanks |
@elcortegano yes, according to documentation pbmm2 discards all supplementary alignments. I suggest doing the same analysis with minimap2 with |
Well, I guess one last thing to check is if showing secondary (and supplementary) alignments is enabled in your IGV settings. Check both "alignment" and "third gen" tabs. It seems suspicious to me that there are no secondary alignments within repetitive regions, provided that the copies are similar enough. Otherwise I'm out of ideas. If you want, you can share the assembly / reads and I can take a look as well. |
IGV is set to allow secondary and supplementary alignments. For our purposes, we will use variant callers on the reads, and calls from the assembly as one step to confirm these calls. So it is not a big deal to get these false positives. Hopefully this is something that will eventually be fixed in future versions. Thanks for all the feedback and support. |
We met the same issue using Flye thus it introduces many false positive SNV calls from contig based method. |
Hi, we have noted that assemblies generated with Flye very frequently introduce variants that are not supported by the reads, particularly at highly repetitive regions (but also in others as well). One example is show below, were no reads support the G -> A variant.
The first grey line is the assembly obtained with Flye, and the second one is an assembly obtained with a different assembler. The following in the bottom are the reads.
Using assembly-to-reference variant calling with
paftools
and reviewing calls against the assemblies and the reads, we observe a false positive rate that exceeds the 80%. These false positive calls will not be called by variant callers based on read-to-reference assemblies.We wonder why could this happen, and if it can be solve some way,
Our reads come from PacBio HiFi. Genomes were assembled with:
Postprocessing was limited to
purge_dups
to remove duplicates following https://github.com/dfguan/purge_dups. Scaffolds were also broken into contigs. No polishing (e.g. arrow or pilon) was used for our data.The text was updated successfully, but these errors were encountered: