-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allele flip seems not work #188
Comments
Hi, It looks like you're using a very old version of MSS (v1.6.0), could you try and update to the current release of v1.12.0 and let me know if this sorts your issue first? Thanks, |
Thank you for replying so quickly. I apologize that I seem to be providing the wrong Session information. I have tried different versions of MungeSumstats and I found that this issue happened in the latest version: v1.13.2 and when I reinstalled old version 1.6.0 the issue does not seem to exist. |
Hey, So I was able to replicate your issue and have found the error. See below for a detailed explanation of what was going on. This is the data before running MSS.
First thing to note is that A1 and A2 naming is ambiguous as we don't know which of A1 and A2 the effect values (like beta) relate to. This is not consistent across studies and varies massively. Because of this, MSS first checks to see if the effect columns relate to A1 or A2. Note MSS needs to know this info as it assumes all effect values relate to A2 column, making it the effect allele, so we need to correct for this. This is checked in
So for this data, MSS found ambiguous naming based on check 1:
It then looked for allele specific effect info but couldn't find any based on check 2. So it then goes to check 3 which is the final, 'last chance' check which checks against the reference genome to see which of A1 or A2 has more matches to the reference genome. There is the assumption here that the minor allele will be the effect allele which may not necessarily be the case which is why this is a final check. MSS prints a few messages about this step:
So it would appear it found that A2 i.e. A for rs1182172 was on the reference genome - meaning it is the major allele. So this means MSS infers that the effect columns relate to A1, the minor allele. However, MSS always has effect info relating to A2 rather than A1 so it swaps the A1 and A2 alleles. We can see this in the output of
So you can see that at this stage, A2 and A1 naming has been flipped but the effect columns stay the same. MSS relates these effect columns to A2 allele, G. However finding the effect columns relating to A1 was a bug - it actually found it relates to A2. I have now implemented a fix and when we rerun the data with MSS we get the formatted output as:
This is available in MSS v1.13.3 and release v1.12.1. Thank you for bringing this to my attention. Please do try the new version out and make sure it works on your end too? Code:
|
1. Bug description
The FRQ and BETA is flipped as expected, but A1 and A2 keeps original
Console output
::NOTE::
tempdir()
by default.save_path
( e.g.save_path=file.path('./formatted',basename(path))
), or make sure to copy files elsewhere after processing ( e.g.file.copy(save_path, './formatted/' )
.Formatted summary statistics will be saved to ==> /tmp/Rtmp9w0Cmk/file2426307a5ba18f.tsv.gz
Infer Effect Column
First line of summary statistics file:
SNP CHR BP A1 A2 FRQ BETA SE P
Allele columns are ambiguous, attempting to infer direction
Standardising column headers.
First line of summary statistics file:
SNP CHR BP A1 A2 FRQ BETA SE P
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 10 seconds.
Effect/frq column(s) relate to A1 in the inputted sumstats
Found direction from matchine reference genome - NOTE this assumes non-effect allele will macth the reference genome
Standardising column headers.
First line of summary statistics file:
SNP CHR BP A2 A1 FRQ BETA SE P
Summary statistics report:
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 8 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 1 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
1 SNPs (100%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.B, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, ALL_AF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /tmp/Rtmp9w0Cmk/file2426307a5ba18f.tsv.gz
Summary statistics report:
Done munging in 1.427 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 FRQ BETA SE P
1: rs1182172 7 2838469 G A 0.638431 0.0453284 0.00988289 4.5063e-06
Returning data directly.
Expected behaviour
I think when flip the FRQ and BETA, the A1 and A2 should be exchanged right?
2. Reproducible example
Here is my example gwas data:
And here is the formmated data
3. Session info
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] metafor_3.8-1 metadat_1.2-0 Matrix_1.6-1.1 TwoSampleMR_0.6.6 biomaRt_2.54.0
[6] MungeSumstats_1.6.0 dplyr_1.0.10 data.table_1.14.6
loaded via a namespace (and not attached):
[1] nlme_3.1-160 bitops_1.0-7
[3] matrixStats_0.63.0 fs_1.5.2
[5] bit64_4.0.5 filelock_1.0.2
[7] progress_1.2.2 httr_1.4.4
[9] GenomeInfoDb_1.34.3 googleAuthR_2.0.0
[11] tools_4.2.2 utf8_1.2.2
[13] R6_2.5.1 DBI_1.1.3
[15] BiocGenerics_0.44.0 withr_2.5.0
[17] mnormt_2.1.1 tidyselect_1.2.0
[19] prettyunits_1.1.1 bit_4.0.5
[21] curl_4.3.3 compiler_4.2.2
[23] cli_3.6.3 Biobase_2.58.0
[25] xml2_1.3.3 DelayedArray_0.24.0
[27] rtracklayer_1.65.0 psych_2.2.9
[29] rappdirs_0.3.3 stringr_1.5.1
[31] digest_0.6.30 Rsamtools_2.14.0
[33] R.utils_2.12.2 XVector_0.38.0
[35] pkgconfig_2.0.3 MatrixGenerics_1.10.0
[37] dbplyr_2.2.1 fastmap_1.1.0
[39] BSgenome_1.66.3 rlang_1.1.3
[41] rstudioapi_0.14 RSQLite_2.2.18
[43] BiocIO_1.8.0 generics_0.1.3
[45] jsonlite_1.8.3 BiocParallel_1.32.1
[47] R.oo_1.25.0 VariantAnnotation_1.44.0
[49] RCurl_1.98-1.9 magrittr_2.0.3
[51] GenomeInfoDbData_1.2.9 Rcpp_1.0.11
[53] S4Vectors_0.36.2 fansi_1.0.3
[55] lifecycle_1.0.3 R.methodsS3_1.8.2
[57] stringi_1.7.8 yaml_2.3.6
[59] mathjaxr_1.6-0 SummarizedExperiment_1.28.0
[61] zlibbioc_1.44.0 plyr_1.8.8
[63] BiocFileCache_2.6.0 grid_4.2.2
[65] blob_1.2.3 crayon_1.5.2
[67] lattice_0.20-45 Biostrings_2.66.0
[69] GenomicFeatures_1.50.2 hms_1.1.2
[71] KEGGREST_1.38.0 pillar_1.8.1
[73] GenomicRanges_1.50.1 rjson_0.2.21
[75] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.23
[77] codetools_0.2-18 stats4_4.2.2
[79] XML_3.99-0.12 glue_1.6.2
[81] BiocManager_1.30.19 png_0.1-7
[83] vctrs_0.6.5 purrr_1.0.2
[85] tidyr_1.3.1 assertthat_0.2.1
[87] cachem_1.0.6 restfulr_0.0.15
[89] gargle_1.2.1 tibble_3.1.8
[91] GenomicAlignments_1.34.0 AnnotationDbi_1.60.0
[93] memoise_2.0.1 IRanges_2.32.0
[95] ellipsis_0.3.2
The text was updated successfully, but these errors were encountered: