Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allele flip seems not work #188

Open
htqqdd opened this issue Jul 31, 2024 · 3 comments
Open

Allele flip seems not work #188

htqqdd opened this issue Jul 31, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@htqqdd
Copy link

htqqdd commented Jul 31, 2024

1. Bug description

The FRQ and BETA is flipped as expected, but A1 and A2 keeps original

Console output

::NOTE::

  • Formatted results will be saved to tempdir() by default.
  • This means all formatted summary stats will be deleted upon ending the R session.
  • To keep formatted summary stats, change save_path ( e.g. save_path=file.path('./formatted',basename(path)) ), or make sure to copy files elsewhere after processing ( e.g. file.copy(save_path, './formatted/' ).

Formatted summary statistics will be saved to ==> /tmp/Rtmp9w0Cmk/file2426307a5ba18f.tsv.gz
Infer Effect Column
First line of summary statistics file:
SNP CHR BP A1 A2 FRQ BETA SE P
Allele columns are ambiguous, attempting to infer direction
Standardising column headers.
First line of summary statistics file:
SNP CHR BP A1 A2 FRQ BETA SE P
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 10 seconds.
Effect/frq column(s) relate to A1 in the inputted sumstats
Found direction from matchine reference genome - NOTE this assumes non-effect allele will macth the reference genome
Standardising column headers.
First line of summary statistics file:
SNP CHR BP A2 A1 FRQ BETA SE P
Summary statistics report:

  • 1 rows
  • 1 unique variants
  • 0 genome-wide significant variants (P<5e-8)
  • 1 chromosomes
    Checking for multi-GWAS.
    Checking for multiple RSIDs on one row.
    Checking SNP RSIDs.
    Checking for merged allele column.
    Checking A1 is uppercase
    Checking A2 is uppercase
    Checking for incorrect base-pair positions
    Ensuring all SNPs are on the reference genome.
    Loading SNPlocs data.
    Loading reference genome data.
    Preprocessing RSIDs.
    Validating RSIDs of 1 SNPs using BSgenome::snpsById...
    BSgenome::snpsById done in 8 seconds.
    Checking for correct direction of A1 (reference) and A2 (alternative allele).
    There are 1 SNPs where A1 doesn't match the reference genome.
    These will be flipped with their effect columns.
    Reordering so first three column headers are SNP, CHR and BP in this order.
    Reordering so the fourth and fifth columns are A1 and A2.
    Checking for missing data.
    Checking for duplicate columns.
    Checking for duplicate SNPs from SNP ID.
    Checking for SNPs with duplicated base-pair positions.
    INFO column not available. Skipping INFO score filtering step.
    Filtering SNPs, ensuring SE>0.
    Ensuring all SNPs have N<5 std dev above mean.
    Checking for bi-allelic SNPs.
    1 SNPs (100%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
    The FRQ column was mapped from one of the following from the inputted summary statistics file:
    FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.B, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, ALL_AF
    As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
    set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
    Sorting coordinates with 'data.table'.
    Writing in tabular format ==> /tmp/Rtmp9w0Cmk/file2426307a5ba18f.tsv.gz
    Summary statistics report:
  • 1 rows (100% of original 1 rows)
  • 1 unique variants
  • 0 genome-wide significant variants (P<5e-8)
  • 1 chromosomes
    Done munging in 1.427 minutes.
    Successfully finished preparing sumstats file, preview:
    Reading header.
    SNP CHR BP A1 A2 FRQ BETA SE P
    1: rs1182172 7 2838469 G A 0.638431 0.0453284 0.00988289 4.5063e-06
    Returning data directly.

Expected behaviour

I think when flip the FRQ and BETA, the A1 and A2 should be exchanged right?

2. Reproducible example

Here is my example gwas data:
image

And here is the formmated data
image

3. Session info

R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] metafor_3.8-1 metadat_1.2-0 Matrix_1.6-1.1 TwoSampleMR_0.6.6 biomaRt_2.54.0
[6] MungeSumstats_1.6.0 dplyr_1.0.10 data.table_1.14.6

loaded via a namespace (and not attached):
[1] nlme_3.1-160 bitops_1.0-7
[3] matrixStats_0.63.0 fs_1.5.2
[5] bit64_4.0.5 filelock_1.0.2
[7] progress_1.2.2 httr_1.4.4
[9] GenomeInfoDb_1.34.3 googleAuthR_2.0.0
[11] tools_4.2.2 utf8_1.2.2
[13] R6_2.5.1 DBI_1.1.3
[15] BiocGenerics_0.44.0 withr_2.5.0
[17] mnormt_2.1.1 tidyselect_1.2.0
[19] prettyunits_1.1.1 bit_4.0.5
[21] curl_4.3.3 compiler_4.2.2
[23] cli_3.6.3 Biobase_2.58.0
[25] xml2_1.3.3 DelayedArray_0.24.0
[27] rtracklayer_1.65.0 psych_2.2.9
[29] rappdirs_0.3.3 stringr_1.5.1
[31] digest_0.6.30 Rsamtools_2.14.0
[33] R.utils_2.12.2 XVector_0.38.0
[35] pkgconfig_2.0.3 MatrixGenerics_1.10.0
[37] dbplyr_2.2.1 fastmap_1.1.0
[39] BSgenome_1.66.3 rlang_1.1.3
[41] rstudioapi_0.14 RSQLite_2.2.18
[43] BiocIO_1.8.0 generics_0.1.3
[45] jsonlite_1.8.3 BiocParallel_1.32.1
[47] R.oo_1.25.0 VariantAnnotation_1.44.0
[49] RCurl_1.98-1.9 magrittr_2.0.3
[51] GenomeInfoDbData_1.2.9 Rcpp_1.0.11
[53] S4Vectors_0.36.2 fansi_1.0.3
[55] lifecycle_1.0.3 R.methodsS3_1.8.2
[57] stringi_1.7.8 yaml_2.3.6
[59] mathjaxr_1.6-0 SummarizedExperiment_1.28.0
[61] zlibbioc_1.44.0 plyr_1.8.8
[63] BiocFileCache_2.6.0 grid_4.2.2
[65] blob_1.2.3 crayon_1.5.2
[67] lattice_0.20-45 Biostrings_2.66.0
[69] GenomicFeatures_1.50.2 hms_1.1.2
[71] KEGGREST_1.38.0 pillar_1.8.1
[73] GenomicRanges_1.50.1 rjson_0.2.21
[75] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.23
[77] codetools_0.2-18 stats4_4.2.2
[79] XML_3.99-0.12 glue_1.6.2
[81] BiocManager_1.30.19 png_0.1-7
[83] vctrs_0.6.5 purrr_1.0.2
[85] tidyr_1.3.1 assertthat_0.2.1
[87] cachem_1.0.6 restfulr_0.0.15
[89] gargle_1.2.1 tibble_3.1.8
[91] GenomicAlignments_1.34.0 AnnotationDbi_1.60.0
[93] memoise_2.0.1 IRanges_2.32.0
[95] ellipsis_0.3.2

@htqqdd htqqdd added the bug Something isn't working label Jul 31, 2024
@Al-Murphy
Copy link
Owner

Hi, It looks like you're using a very old version of MSS (v1.6.0), could you try and update to the current release of v1.12.0 and let me know if this sorts your issue first?

Thanks,
Alan.

@htqqdd
Copy link
Author

htqqdd commented Jul 31, 2024

Thank you for replying so quickly. I apologize that I seem to be providing the wrong Session information. I have tried different versions of MungeSumstats and I found that this issue happened in the latest version: v1.13.2 and when I reinstalled old version 1.6.0 the issue does not seem to exist.
I've checkced the example SNP in dbSNP and it seems the ref allele should be G, so perhaps nothing should be flipped, I hope this helps you find the problem in v1.13.2.
Thank you agiain for this helpful package.

@Al-Murphy
Copy link
Owner

Hey,

So I was able to replicate your issue and have found the error. See below for a detailed explanation of what was going on. This is the data before running MSS.

         SNP   CHR      BP     A1     A2      FRQ       BETA         SE          P
      <char> <num>   <num> <char> <char>    <num>      <num>      <num>      <num>
1: rs1182172     7 2838469      G      A 0.361569 -0.0453284 0.00988289 4.5063e-06

First thing to note is that A1 and A2 naming is ambiguous as we don't know which of A1 and A2 the effect values (like beta) relate to. This is not consistent across studies and varies massively. Because of this, MSS first checks to see if the effect columns relate to A1 or A2. Note MSS needs to know this info as it assumes all effect values relate to A2 column, making it the effect allele, so we need to correct for this. This is checked in infer_effect_column(). There are three checks made to infer which allele the effect/frequency information relates to:

  1. Check if ambiguous naming conventions are used (i.e. allele 1 and 2 or equivalent). If not exit, otherwise continue to next checks. This can be checked by using the mapping file and splitting A1/A2 mappings by those that contain 1 or 2 (ambiguous) or doesn't contain 1 or 2 e.g. effect, tested (unambiguous so fine for MSS to handle as is).
  2. Look for effect column/frequency column where the A1/A2 explicitly mentioned, if found then we know the direction and should update A1/A2 naming so A2 is the effect column. We can look for such columns by getting every combination of A1/A2 naming and effect/frq naming.
  3. If not found in 2, a final check should be against the reference genome, whichever of A1 and A2 has more of a match with the reference genome should be taken as not the effect allele. There is an assumption in this but is still better than guessing the ambiguous allele naming.

So for this data, MSS found ambiguous naming based on check 1:

Allele columns are ambiguous, attempting to infer direction

It then looked for allele specific effect info but couldn't find any based on check 2. So it then goes to check 3 which is the final, 'last chance' check which checks against the reference genome to see which of A1 or A2 has more matches to the reference genome. There is the assumption here that the minor allele will be the effect allele which may not necessarily be the case which is why this is a final check.

MSS prints a few messages about this step:

Effect/frq column(s) relate to A1 in the inputted sumstats
Found direction from matchine reference genome - NOTE this assumes non-effect allele will macth the reference genome

So it would appear it found that A2 i.e. A for rs1182172 was on the reference genome - meaning it is the major allele. So this means MSS infers that the effect columns relate to A1, the minor allele. However, MSS always has effect info relating to A2 rather than A1 so it swaps the A1 and A2 alleles. We can see this in the output of infer_effect_column() when I run the format_sumstats() code function by function:

         SNP   CHR      BP     A2     A1      FRQ       BETA         SE          P
      <char> <num>   <num> <char> <char>    <num>      <num>      <num>      <num>
1: rs1182172     7 2838469      G      A 0.361569 -0.0453284 0.00988289 4.5063e-06

So you can see that at this stage, A2 and A1 naming has been flipped but the effect columns stay the same. MSS relates these effect columns to A2 allele, G.

However finding the effect columns relating to A1 was a bug - it actually found it relates to A2. I have now implemented a fix and when we rerun the data with MSS we get the formatted output as:

         SNP   CHR      BP     A1     A2      FRQ       BETA         SE          P
      <char> <int>   <int> <char> <char>    <num>      <num>      <num>      <num>
1: rs1182172     7 2838469      G      A 0.361569 -0.0453284 0.00988289 4.5063e-06

This is available in MSS v1.13.3 and release v1.12.1. Thank you for bringing this to my attention. Please do try the new version out and make sure it works on your end too?

Code:

ss <- data.table("SNP"="rs1182172","CHR"=7,"BP"=2838469,"A1"="G", "A2"="A", 
                 "FRQ"=0.361569,"BETA"=-0.0453284, "SE"=0.00988289, 
                 "P"=4.5063e-06)

formatted_ss <- MungeSumstats::format_sumstats(ss,return_data = TRUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants