Skip to content

Commit

Permalink
drop miss dat select cols
Browse files Browse the repository at this point in the history
  • Loading branch information
Al-Murphy committed Apr 24, 2024
1 parent 588a898 commit bf83eaa
Show file tree
Hide file tree
Showing 12 changed files with 75 additions and 14 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: MungeSumstats
Type: Package
Title: Standardise summary statistics from GWAS
Version: 1.11.8
Version: 1.11.9
Authors@R:
c(person(given = "Alan",
family = "Murphy",
Expand Down
7 changes: 7 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
## CHANGES IN VERSION 1.11.9

### New features
* Can now control what columns are checked for missing data (`drop_na_cols` in
`format_sumstats()`). By default, SNP, effect columns and P/N columns are
checked. Set to Null to check all columns or choose specific columns.

## CHANGES IN VERSION 1.11.7

### Bug fix
Expand Down
11 changes: 11 additions & 0 deletions R/check_miss_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,17 @@ check_miss_data <- function(sumstats_dt, path, log_folder_ind, check_save_out,
c(drop_na_cols)[drop_na_cols %in% names(sumstats_dt)]
incl_cols <-
c(drop_na_cols_in_sumstats)[!drop_na_cols_in_sumstats %in% ignore_cols]
if(length(incl_cols)<1){
msg <- paste0(
"WARNING: None of the inputted columns:\n",
paste(drop_na_cols,collapse=" "),"\n",
"To be checked for missing data were found in the sumstats. Sumstats",
" columns:\n",
paste(names(sumstats_dt),collapse=" "),"\n",
"This check will not be run."
)
message(msg)
}
} else {
incl_cols <- names(sumstats_dt)[!names(sumstats_dt) %in% ignore_cols]
}
Expand Down
16 changes: 10 additions & 6 deletions R/format_sumstats.R
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,12 @@
#' dropped? These can not be checked against a reference dataset and will have
#' the same RS ID and position as SNPs which can affect downstream analysis.
#' Default is False.
#' @param drop_na_cols A character vector of column names to be checked for
#' missing values. Rows with missing values in any of these columns (if present
#' in the dataset) will be dropped. If `NULL`, all columns will be checked for
#' missing values. Default columns are SNP, chromosome, position, allele 1,
#' allele2, effect columns (frequency, beta, Z-score, standard error, log odds,
#' signed sumstats, odds ratio), p value and N columns.
#' @param dbSNP version of dbSNP to be used for imputation (144 or 155).
#' @param check_dups whether to check for duplicates - if formatting QTL
#' datasets this should be set to FALSE otherwise keep as TRUE. Default is TRUE.
Expand Down Expand Up @@ -221,11 +227,6 @@
#' give is incorrect you can supply your own mapping file. Must be a 2 column
#' dataframe with column names "Uncorrected" and "Corrected". See
#' data(sumstatsColHeaders) for default mapping and necessary format.
#' @param drop_na_cols A character vector of column names to be checked for missing values.
#' Rows with missing values in any of these columns (if present in the dataset) will be dropped. If `NULL`,
#' all columns will be checked for missing values. Default columns are SNP,
#' chromosome, position, allele 1, allele2, frequency, beta, standard error, p
#' value and N columns.
#'
#' @importFrom data.table fread
#' @importFrom data.table fwrite
Expand Down Expand Up @@ -272,6 +273,10 @@ format_sumstats <- function(path,
frq_is_maf = TRUE,
indels = TRUE,
drop_indels = FALSE,
drop_na_cols = c("SNP", "CHR", "BP", "A1", "A2",
"FRQ", "BETA", "Z", "OR",
"LOG_ODDS", "SIGNED_SUMSTAT", "SE",
"P", "N"),
dbSNP = 155,
check_dups = TRUE,
sort_coordinates = TRUE,
Expand All @@ -289,7 +294,6 @@ format_sumstats <- function(path,
imputation_ind = FALSE,
force_new = FALSE,
mapping_file = sumstatsColHeaders,
drop_na_cols = c("SNP", "CHR", "BP", "A1", "A2", "FRQ", "BETA", "SE", "P", "N"),
#deprecated parameters
rmv_chrPrefix = NULL
) {
Expand Down
1 change: 1 addition & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -142,5 +142,6 @@ development:
* [Jonathan Griffiths](https://github.com/jonathangriffiths)
* [Kitty Murphy](https://github.com/KittyMurphy)
* [Mykhaylo Malakhov](https://github.com/MykMal)
* [Alasdair Warwick](https://github.com/rmgpanw)

# References
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,19 @@
<i>Authors</i>: Alan Murphy, Brian Schilder and Nathan Skene
</h5>
<h5>
<i>Updated</i>: Jan-15-2024
<i>Updated</i>: Apr-24-2024
</h5>

<!-- Readme.md is generated from Readme.Rmd. Please edit that file -->
<!-- badges: start -->

[![](https://img.shields.io/badge/release%20version-1.10.1-black.svg)](https://www.bioconductor.org/packages/MungeSumstats)
[![](https://img.shields.io/badge/devel%20version-1.11.3-black.svg)](https://github.com/neurogenomics/MungeSumstats)
[![](https://img.shields.io/badge/devel%20version-1.11.9-black.svg)](https://github.com/neurogenomics/MungeSumstats)
[![R build
status](https://github.com/neurogenomics/MungeSumstats/workflows/rworkflows/badge.svg)](https://github.com/neurogenomics/MungeSumstats/actions)
[![](https://img.shields.io/github/last-commit/neurogenomics/MungeSumstats.svg)](https://github.com/neurogenomics/MungeSumstats/commits/master)
[![](https://codecov.io/gh/neurogenomics/MungeSumstats/branch/master/graph/badge.svg)](https://codecov.io/gh/neurogenomics/MungeSumstats)
[![](https://img.shields.io/badge/download-11379/total-blue.svg)](https://bioconductor.org/packages/stats/bioc/MungeSumstats)
[![](https://img.shields.io/badge/download-15314/total-blue.svg)](https://bioconductor.org/packages/stats/bioc/MungeSumstats)
[![License:
Artistic-2.0](https://img.shields.io/badge/license-Artistic--2.0-blue.svg)](https://cran.r-project.org/web/licenses/Artistic-2.0)
[![](https://img.shields.io/badge/doi-https://doi.org/10.1093/bioinformatics/btab665-blue.svg)](https://doi.org/https://doi.org/10.1093/bioinformatics/btab665)
Expand Down Expand Up @@ -150,6 +150,7 @@ We would like to acknowledge all those who have contributed to
- [Jonathan Griffiths](https://github.com/jonathangriffiths)
- [Kitty Murphy](https://github.com/KittyMurphy)
- [Mykhaylo Malakhov](https://github.com/MykMal)
- [Alasdair Warwick](https://github.com/rmgpanw)

# References

Expand Down
10 changes: 9 additions & 1 deletion man/check_miss_data.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 9 additions & 0 deletions man/format_sumstats.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions man/import_sumstats.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions man/validate_parameters.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions tests/testthat/test-missing_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -59,13 +59,13 @@ test_that("Handle missing data", {
dbSNP=144
)
reformatted_lines <- readLines(reformatted)
expect_equal(reformatted_lines, org_lines)
testthat::expect_equal(reformatted_lines, org_lines)

# set `drop_na_cols` to `NULL`
miss_extra_col <- miss
miss_extra_col$extra <- NA

expect_error(MungeSumstats::format_sumstats(
testthat::expect_error(MungeSumstats::format_sumstats(
miss_extra_col,
ref_genome = "GRCh37",
on_ref_genome = FALSE,
Expand All @@ -87,7 +87,7 @@ test_that("Handle missing data", {
allele_flip_check = FALSE,
sort_coordinates = FALSE,
dbSNP = 144,
drop_na_cols = c("CHR", "POS")
drop_na_cols = c("CHRA", "APOS")
)

reformatted_extra_col_lines <- readLines(reformatted_extra_col)
Expand Down
6 changes: 6 additions & 0 deletions vignettes/MungeSumstats.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,12 @@ conducted by *MungeSumstats* are:
dropped? These can not be checked against a reference dataset and will have
the same RS ID and position as SNPs which can affect downstream analysis.
Default is False.
- **drop_na_cols** A character vector of column names to be checked for
missing values. Rows with missing values in any of these columns (if present
in the dataset) will be dropped. If `NULL`, all columns will be checked for
missing values. Default columns are SNP, chromosome, position, allele 1,
allele 2, effect columns (frequency, beta, Z-score, standard error,
log odds, signed sumstats, odds ratio), p value and N columns.
- **dbSNP** The dbSNP version to use as a reference - defaults to the most
recent version available (155). Note that with the 9x more SNPs in dbSNP
155 vs 144, run times will increase.
Expand Down

0 comments on commit bf83eaa

Please sign in to comment.