INDEL harmonized coordinate inconsistency #388
Labels
data issue
Issue with data
documentation
Improvements or additions to documentation
question
Further information is requested
Hi PGS Catalog team, we love this resource but we wanted to bring to your attention an issue with INDEL harmonization:
My team and I noticed that the ENSEMBL-harmonized coordinates provided for INDELs in the PGS Catalog are systematically shifted from the coordinate assigned to the same variant in our genetic data files. We have GRCh38-based variant calls from both sequencing experiments and microarray genotyping. VCFs from either format report INDELs that are almost always 1bp off from the harmonized PGS catalog coordinate. We suspect that this is a result of differing conventions for INDEL reporting between VCF format (reporting insertions relative to the base immediately prior to the inserted bases) and the ENSEMBL reference (which reports insertions relative to the start of the actual insertion). Alternatively, this may be due to differing conventions between the ENSEMBL curated reference (one-based coordinates) and the UCSC curated reference (zero-based coordinates): https://useast.ensembl.org/Help/Faq?id=286#:~:text=Ensembl%20uses%20a%20one%2Dbased,genome%20housed%20at%20the%20GRC.
For example:
PGS Catalog coordinate file entry
rs34295433 from PGS000662 (PGS000662_hmPOS_GRCh38.txt.gz)
ENSEMBL record (matching):
Chromosome 1:183063313-183063315 (forward strand)|VCF:1 183063313 rs34295433 T TAAAT,TAAGT
dbSNP record (not matching):
VCF record (aligned to the UCSC-based GRCh38 reference provided by the GATK toolkit ) (not matching):
Since most PGS software (e.g. PLINK, pgsc_calc) matches genotype data to PGS coordinate files via CHROM/POS (not sure how else you could go about it other than the discouraged rsID matching), we noticed a systematic failure to match INDELs when using harmonized data. We wanted to point out that this discrepancy does not seem to come up as a warning in PGS Catalog documentation. There also doesn't seem to be any guidance in
pgsc_calc
documentation, which does not account for a potential mismatch in coordinate systems:PGS000662_hmPOS_GRCh38,1,183063313,CTAAG,C,0.041323726,,,,,,,,,,,,,,unmatched,my_dataset
I dug through some of the matching source code and did not find any pre-processing that might be accounting for this.
Since
pgsc_calc
uses harmonized data automatically when original data from the correct reference genome is not available, the consequence of potentially dropping all INDEL variants would be good to advertise.We're curious about the decision to use the ENSEMBL style for harmonization - it seems that a substantial number of PGS coordinate files are submitted by the original authors in non-ENSEMBL format. Particularly odd are the GRCh37 -> GRCh37 harmonized files which seem to primarily just shift INDEL coordinates and no others:
e.g. From PGS000662_hmPOS_GRCh37.txt.gz:
We were wondering if the PGS Catalog team was aware of this issue, and if you have any advice on how best to approach correcting this. Presumably the options available to implement by the user would be:
pgsc_calc
).Ideally we would wish for an additional harmonization field that uses our (commonly used) standard.
Thanks!
The text was updated successfully, but these errors were encountered: