Skip to content

Commit

Permalink
v0.4.1
Browse files Browse the repository at this point in the history
  • Loading branch information
Zilong-Li committed Mar 23, 2024
1 parent 6500b3f commit e40e687
Show file tree
Hide file tree
Showing 5 changed files with 84 additions and 47 deletions.
18 changes: 10 additions & 8 deletions R/vcf-compare.R
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
#' @param out output prefix for saving objects into RDS file
#'
#' @param vartype restrict to specific type of variants. supports "snps","indels", "sv", "multisnps","multiallelics"
#' @param ids character vector. restrict to sites with ID in the given vector. default NULL won't filter any sites.
#' @param ids character vector. restrict to sites with ID in the given vector.
#' @param qual logical. restrict to variants with QUAL > qual.
#'
#' @param pass logical. restrict to variants with FILTER = "PASS".
Expand All @@ -41,14 +41,12 @@
#' library('vcfppR')
#' test <- system.file("extdata", "raw.gt.vcf.gz", package="vcfppR")
#' truth <- system.file("extdata", "raw.gt.vcf.gz", package="vcfppR")
#' suppressWarnings(
#' res <- vcfcomp(test, truth, stats = "f1", format = c("GT", "GT"))
#' )
#' res <- vcfcomp(test, truth, stats = "f1", format = c("GT", "GT"))
#' str(res[!is.na(res)])
#' @export
vcfcomp <- function(test, truth, region = "", samples = "-", names = NULL,
format = c("DS", "GT"), stats = "r2", bins = NULL, af = NULL, out = NULL,
vartype = "snps", ids = NULL, qual = 0, pass = TRUE) {
vartype = "snps", ids = NULL, qual = 0, pass = FALSE) {
if(is.null(bins)){
bins <- sort(unique(c(
c(0, 0.01 / 1000, 0.02 / 1000, 0.05 / 1000),
Expand All @@ -63,14 +61,14 @@ vcfcomp <- function(test, truth, region = "", samples = "-", names = NULL,
vartype = vartype, ids = NULL, qual = qual, pass = pass)
if(!is.null(names) & is.vector(names)) d1$samples <- names
samples <- paste0(d1$samples, collapse = ",")
d2 <- tryCatch( { readRDS(truth) }, error = function(e) {
d2 <- tryCatch( { suppressWarnings(readRDS(truth)) }, error = function(e) {
vcftable(test, region = region, samples = samples, setid = TRUE, info = FALSE, format = format[2],
vartype = vartype, ids = NULL, qual = qual, pass = pass)
} )
sites <- intersect(d1$id, d2$id)
## chr pos ref alt af
if(!is.null(af)){
af <- tryCatch( { readRDS(af) }, error = function(e) {
af <- tryCatch( { suppressWarnings(readRDS(af)) }, error = function(e) {
af <- read.table(af, header = TRUE)
af$id <- paste0(af[,"chr"], "_", af[,"pos"], "_", af[,"ref"], "_", af[,"alt"])
subset(af, select = c(id, af))
Expand All @@ -91,7 +89,11 @@ vcfcomp <- function(test, truth, region = "", samples = "-", names = NULL,
rownames(ds) <- sites
res <- NULL
if(stats=="r2") {
af <- af[match(sites, af[,"id"]), "af"]
if(is.null(af)){
af <- rowMeans(gt, na.rm = TRUE) / 2
} else {
af <- af[match(sites, af[,"id"]), "af"]
}
names(af) <- sites
res <- r2_by_freq(bins, af, gt, ds, which_snps = sites, flip = FALSE)
}
Expand Down
44 changes: 31 additions & 13 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,20 +31,10 @@ You can install the development version of vcfppR with the system dependencies o

``` r
## install.package("vcfppR") ## from CRAN
devtools::install_github("Zilong-Li/vcfppR")
remotes::install_github("Zilong-Li/vcfppR")
```

## API

There are two classes i.e. ***vcfreader*** and ***vcfwriter*** offering the full R-bindings of vcfpp.h. Check out the examples in the [tests](tests/testthat) folder or refer to the manual.

```r
?vcfppR::vcfreader
```

In addtion to two classes for R-bindings of vcfpp.h, there are some useful functions in the package, e.g. ***vcftable***, ***vcfsummary*** and ***vcfpopgen***. In the following examples, we use the URL link as filename, which can be directly fed to vcfppR, and the performance will depend on your connection to the servers.

## Read VCF as tabular data in R
## vcftable: read VCF as tabular data in R

This example shows how to read only SNP variants with genotypes in the VCF/BCF into R list:
```r
Expand All @@ -69,8 +59,24 @@ vcffile <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_25
res <- vcftable(vcffile, "chr21:1-5100000", vartype = "indels", format = "DP")
str(res)
```
## vcfcomp: compare two VCF files and report concordance statistics

This example shows how to calculate genotype correlation (r2) between two VCF files

```r
vcffile <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
res <- vcfcomp(test = vcffile, truth = vcffile, region = "chr21:1-5100000", stats = "r2", format = c('GT','GT'))
as.data.frame(res)
```
This example shows how to calculate genotype concordance (F1-score) between two VCF files

```r
vcffile <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
res <- vcfcomp(test = vcffile, truth = vcffile, region = "chr21:1-5100000", stats = "f1", format = c('GT','GT'))
str(res)
```

## Variants summarization
## vcfsummary: variants characterization

This example shows how to summarize small variants discovered by GATK. The data is from the 1000 Genome Project.

Expand Down Expand Up @@ -101,3 +107,15 @@ allsvs <- sv$summary[-1]
bar <- barplot(allsvs, ylim = c(0, 1.1*max(allsvs)),
main = "Variant Counts on chr20 (all SVs)")
```


## API

There are two classes i.e. ***vcfreader*** and ***vcfwriter*** offering the full R-bindings of vcfpp.h. Check out the examples in the [tests](tests/testthat) folder or refer to the manual.

```r
?vcfppR::vcfreader
```

In addtion to two classes for R-bindings of vcfpp.h, there are some useful functions in the package, e.g. ***vcftable***, ***vcfsummary*** and ***vcfpopgen***. In the following examples, we use the URL link as filename, which can be directly fed to vcfppR, and the performance will depend on your connection to the servers.

58 changes: 39 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,10 @@ dependencies of `libcurl`.

``` r
## install.package("vcfppR") ## from CRAN
devtools::install_github("Zilong-Li/vcfppR")
remotes::install_github("Zilong-Li/vcfppR")
```

## API

There are two classes i.e. ***vcfreader*** and ***vcfwriter*** offering
the full R-bindings of vcfpp.h. Check out the examples in the
[tests](tests/testthat) folder or refer to the manual.

``` r
?vcfppR::vcfreader
```

In addtion to two classes for R-bindings of vcfpp.h, there are some
useful functions in the package, e.g. ***vcftable***, ***vcfsummary***
and ***vcfpopgen***. In the following examples, we use the URL link as
filename, which can be directly fed to vcfppR, and the performance will
depend on your connection to the servers.

## Read VCF as tabular data in R
## vcftable: read VCF as tabular data in R

This example shows how to read only SNP variants with genotypes in the
VCF/BCF into R list:
Expand Down Expand Up @@ -73,7 +57,27 @@ res <- vcftable(vcffile, "chr21:1-5100000", vartype = "indels", format = "DP")
str(res)
```

## Variants summarization
## vcfcomp: compare two VCF files and report concordance statistics

This example shows how to calculate genotype correlation (r2) between
two VCF files

``` r
vcffile <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
res <- vcfcomp(test = vcffile, truth = vcffile, region = "chr21:1-5100000", stats = "r2", format = c('GT','GT'))
as.data.frame(res)
```

This example shows how to calculate genotype concordance (F1-score)
between two VCF files

``` r
vcffile <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
res <- vcfcomp(test = vcffile, truth = vcffile, region = "chr21:1-5100000", stats = "f1", format = c('GT','GT'))
str(res)
```

## vcfsummary: variants characterization

This example shows how to summarize small variants discovered by GATK.
The data is from the 1000 Genome Project.
Expand Down Expand Up @@ -106,3 +110,19 @@ allsvs <- sv$summary[-1]
bar <- barplot(allsvs, ylim = c(0, 1.1*max(allsvs)),
main = "Variant Counts on chr20 (all SVs)")
```

## API

There are two classes i.e. ***vcfreader*** and ***vcfwriter*** offering
the full R-bindings of vcfpp.h. Check out the examples in the
[tests](tests/testthat) folder or refer to the manual.

``` r
?vcfppR::vcfreader
```

In addtion to two classes for R-bindings of vcfpp.h, there are some
useful functions in the package, e.g. ***vcftable***, ***vcfsummary***
and ***vcfpopgen***. In the following examples, we use the URL link as
filename, which can be directly fed to vcfppR, and the performance will
depend on your connection to the servers.
3 changes: 1 addition & 2 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
Fix issue for upcoming Rtools on windows

address X-CRAN-Comment: Archived on 2024-03-15 as issues were not corrected in time.
8 changes: 3 additions & 5 deletions man/vcfcomp.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit e40e687

Please sign in to comment.