-
Notifications
You must be signed in to change notification settings - Fork 4
/
README.Rmd
320 lines (261 loc) · 13.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
output: github_document
bibliography: "`r here::here('vignettes', 'dciferpkg.bib')`"
csl: "`r here::here('vignettes', 'vancouver-brackets.csl')`"
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
knitr::opts_chunk$set(dpi=300, fig.width=7)
)
```
# Dcifer: Genetic Distance Between Polyclonal Infections
Package `dcifer` (Distance for Complex Infections: Fast Estimation of Relatedness) calculates genetic distance between polyclonal infections by estimating relatedness from biallelic and multiallelic data @gerlovina2022dcifer. In addition to estimates, the package provides a likelihood function and statistical inference. Functions for reading and reformatting data, performing preparatory steps, and visualizing the results are also included. We will illustrate the analysis process using microhaplotype data from two health facilities in Maputo and Inhambane provinces of Mozambique @tessema2022sensitive.
![](man/figures/aqua10f.png)<!-- -->
You can install the released version of dcifer from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("dcifer")
```
and the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("EPPIcenter/dcifer")
# or, to include a vignette:
remotes::install_github("EPPIcenter/dcifer", build_vignettes = TRUE)
```
```{r setup, warning = FALSE}
library(dcifer)
help(package = "dcifer")
vignette(package = "dcifer", "vignetteDcifer")
pardef <- par(no.readonly = TRUE)
```
## Read and reformat data, estimate COI and population allele frequencies
Dcifer relatedness estimation requires sample data in a specific format as well as estimates of complexity of infection (COI) and population allele frequencies. Here is an example of some preparation steps using Mozambique dataset. First, read in original data stored in a `.csv` file in a long format.
| sampleID | locus | allele | clinic | province |
| ---------- | ----- | ------ | ------ | -------- |
| 8025874217 | t1 | HB3.0 | Inhassoro | Inhambane |
| 8025874217 | t1 | D10--D6--FCR3--V1-S.0 | Inhassoro | Inhambane |
| 8025874217 | t20 | U659.0 | Inhassoro | Inhambane |
The name of the file is the first argument to the function that reads and reformats the data:
```{r}
sfile <- system.file("extdata", "MozParagon.csv", package = "dcifer")
dsmp <- readDat(sfile, svar = "sampleID", lvar = "locus", avar = "allele")
str(dsmp, list.len = 2)
```
Or, to reformat an R data frame (`dlong`) containing the same dataset:
```{r, eval = FALSE}
dsmp <- formatDat(dlong, svar = "sampleID", lvar = "locus", avar = "allele")
```
```{r}
# optionally, extract location information
meta <- unique(read.csv(sfile)[c("sampleID", "province")])
meta <- meta[match(names(dsmp), meta$sampleID), ] # order samples as in dsmp
```
Next, estimate COI for all the samples - here we use naive estimation, first ranking loci of a sample by the number of detected alleles, and then using a locus with a prescribed rank (`lrank`) to determine COI:
```{r}
lrank <- 2
coi <- getCOI(dsmp, lrank = lrank)
```
Finally, estimate population allele frequencies, adjusting for COI:
```{r}
afreq <- calcAfreq(dsmp, coi, tol = 1e-5)
str(afreq, list.len = 2)
```
In some situations, population allele frequencies might be estimated from a different (e.g. larger) dataset and provided separately (e.g. in a `.csv` file or R data frame):
| locus | allele | freq |
| ----- | ------ | ---- |
| t1 | D10--D6--FCR3--V1-S.0 | 0.44377920 |
| t1 | HB3.0 | 0.27713450 |
| t1 | t1.0 | 0.09375438 |
In that case, after the frequencies are read in, they need to be checked against the the existing data object to make sure that all the loci and alleles are in the same order. Function `matchAfreq` performs the checking and rearranges sample data to conform to the provided allele frequencies. For that procedure, loci and alleles in both lists (`dsmp` and `afreq`) have to be named; otherwise the names are not required, and the order of loci and alleles is assumed to be the same for sample data and allele frequencies. If `afreq` contains "extra" alleles that are not listed in `dsmp`, these alleles are added to `dsmp`. The opposite situation (alleles listed and present in `dsmp` but not listed in `afreq`) will result in an error.
```{r}
afile <- system.file("extdata", "MozAfreq.csv", package = "dcifer")
afreq2 <- readAfreq(afile, lvar = "locus", avar = "allele", fvar = "freq")
```
```{r, eval = FALSE}
# alternatively, if allele frequencies are provided in an R data frame (aflong):
afreq2 <- formatAfreq(aflong, lvar = "locus", avar = "allele", fvar = "freq")
```
```{r}
dsmp2 <- matchAfreq(dsmp, afreq2)
```
## Estimate relatedness
As a first step in relatedness estimation, we set $M = 1$ (only one pair of strains between two infections can be related) and test the hypothesis that infections are unrelated ($H_0\!: r = 0$). This is done with `ibdDat` function. Then we explore significantly related pairs in more detail.
For the purposes of demonstration, we first estimate relatedness and display results using `dsmp` as created previously where samples are not sorted by location.
```{r}
dres0 <- ibdDat(dsmp, coi, afreq, pval = TRUE, confint = TRUE, rnull = 0,
alpha = 0.05, nr = 1e3)
```
Look at the results for a single pair of samples:
```{r}
dres0[17, 10, ]
```
Alternatively, we may want to first sort samples by clinic or geographic location:
```{r}
provinces <- c("Maputo", "Inhambane")
nsite <- table(meta$province)[provinces]
ord <- order(factor(meta$province, levels = provinces))
dsmp <- dsmp[ord]
coi <- coi[ ord]
```
```{r, eval = FALSE}
dres <- ibdDat(dsmp, coi, afreq, nr = 1e3)
```
## Visualize the results
When pairwise relatedness is calculated within a single dataset, `ibdDat` returns triangular matrices. For plotting, we can make them symmetric. Then significantly related pairs can be outlined in either or both triangles.
Display the results for unsorted samples, with sample ID's written on the margins. Label colors correspond to locations (health facilities):
```{r, fig.width = 6, fig.height = 6}
par(mar = c(3, 3, 1, 1))
alpha <- 0.05 # significance level
dmat <- dres0[, , "estimate"]
# create symmetric matrix
dmat[upper.tri(dmat)] <- t(dmat)[upper.tri(t(dmat))]
# determine significant, reverse columns for upper triangle
isig <- which(dres0[, , "p_value"] <= alpha, arr.ind = TRUE)[, 2:1]
plotRel(dmat, isig = isig, draw_diag = TRUE, lwd_diag = 0.5, idlab = TRUE,
col_id = c(3:4)[factor(meta$province)])
```
For the sorted samples:
```{r, fig.width = 6, fig.height = 6}
par(mar = c(1, 3, 3, 1))
nsmp <- length(dsmp)
atsep <- cumsum(nsite)[-length(nsite)]
isig <- which(dres[, , "p_value"] <= alpha, arr.ind = TRUE)
dmat <- dres[, , "estimate"]
dmat[upper.tri(dmat)] <- t(dmat)[upper.tri(t(dmat))]
col_id <- rep(c("plum4", "lightblue4"), nsite)
plotRel(dmat, isig = rbind(isig, isig[, 2:1]), draw_diag = TRUE, alpha = alpha,
idlab = TRUE, side_id = c(2, 3), col_id = col_id, srt_id = c(25, 65))
abline(v = atsep, h = atsep, col = "gray45", lty = 5)
```
For these symmetric distance measures, one of the triangles can be used to display other relevant information, such as p-values, geographic distance, or a number of non-missing loci between two samples. For that, use `add = TRUE`.
```{r, fig.height = 3.85}
par(mfrow = c(1, 2), mar = c(1, 0, 1, 0.2))
plotRel(dres, draw_diag = TRUE, alpha = alpha)
mtext("p-values", 3, 0.2)
# p-values for upper triangle
pmat <- matrix(NA, length(dsmp), length(dsmp))
pmat[upper.tri(pmat)] <- t(log(dres[, , "p_value"]))[upper.tri(pmat)]
pmat[pmat == -Inf] <- min(pmat[is.finite(pmat)])
plotRel(pmat, rlim = NULL, draw_diag = TRUE, col = hcl.colors(101, "Red-Purple"),
sig = FALSE, add = TRUE, col_diag = "white", border_diag = "gray45")
abline(v = atsep, h = atsep, col = "gray45", lty = 5)
# number of non-missing loci for upper triangle
dmiss <- lapply(dsmp, function(lst) sapply(lst, function(v) all(!v)))
nmat <- matrix(NA, nsmp, nsmp)
for (jsmp in 2:nsmp) {
for (ismp in 1:(jsmp - 1)) {
nmat[ismp, jsmp] <- sum(!dmiss[[ismp]] & !dmiss[[jsmp]])
}
}
nrng <- range(nmat, na.rm = TRUE)
par(mar = c(1, 0.2, 1, 0))
plotRel(dres, draw_diag = TRUE, alpha = alpha)
mtext("number of loci", 3, 0.2)
coln <- hcl.colors(diff(nrng)*2.4, "Purple-Blue", rev = TRUE)[1:(diff(nrng) + 1)]
plotRel(nmat, rlim = NA, col = coln, add = TRUE,
draw_diag = TRUE, col_diag = "gray45", border_diag = "white")
par(pardef)
```
For reference, add a colorbar legend to the plot. It can be placed beside the main plot:
```{r, fig.width = 6.7, fig.height = 6}
layout(matrix(1:2, 1), width = c(7, 1))
par(mar = c(1, 1, 2, 1))
plotRel(dmat, draw_diag = TRUE, isig = rbind(isig, isig[, 2:1]))
atclin <- cumsum(nsite) - nsite/2
abline(v = atsep, h = atsep, col = "gray45", lty = 5)
mtext(provinces, side = 3, at = atclin, line = 0.2)
mtext(provinces, side = 2, at = atclin, line = 0.2)
par(mar = c(1, 0, 2, 3))
plotColorbar()
par(pardef)
```
The colorbar can also be located in the empty space left by the triangular matrix. In the example below, we specify horizontal colorbar and provide custom tick mark locations:
```{r, fig.width = 6.1, fig.height = 5.7}
# horizontal colorbar
par(mar = c(1, 1, 1, 3))
border_sig <- "darkviolet"
plotRel(dres, draw_diag = TRUE, border_diag = border_sig, alpha = alpha,
border_sig = border_sig, lwd_sig = 2)
legend(32, 20, pch = 0, col = border_sig, pt.lwd = 2, pt.cex = 1.4,
box.col = "gray", legend = expression("significant at" ~~ alpha == 0.05))
text(1:nsmp + 0.3, 1:nsmp - 0.5, labels = names(dsmp), col = col_id, adj = 0,
cex = 0.55, xpd = TRUE)
par(fig = c(0.25, 1, 0.81, 0.92), mar = c(1, 1, 1, 1), new = TRUE)
plotColorbar(at = c(0.2, 0.4, 0.6, 0.8), horiz = TRUE)
ncol <- 301
lines(c(0, ncol, ncol, 0, 0), c(0, 0, 1, 1, 0), col = "gray")
par(pardef)
```
## Further analysis of related samples
Examine pairs that are determined to be significantly related at the significance level $\alpha$ more closely by allowing multiple pairs of strains to be related between two infections. Using `ibdEstM`, we also estimate the number of positively related pairs $M'$ of strains and compare results yielded by a constrained model assuming $r_1 = \dotso = r_M$ (faster method) and without the constraint. In addition, we look at the estimates $\hat{r}_{total}$ of overall relatedness.
```{r, eval = FALSE}
# First, create a grid of r values to evaluate over
revals <- mapply(generateReval, 1:5, nr = c(1e3, 1e2, 32, 16, 12))
```
```{r}
sig1 <- sig2 <- vector("list", nrow(isig))
for (i in 1:nrow(isig)) {
sig1[[i]] <- ibdEstM(dsmp[isig[i, ]], coi[isig[i, ]], afreq, Mmax = 5,
equalr = FALSE, reval = revals)
}
for (i in 1:nrow(isig)) {
sig2[[i]] <- ibdEstM(dsmp[isig[i, ]], coi[isig[i, ]], afreq, equalr = TRUE)
}
M1 <- sapply(sig1, function(r) sum(r > 0))
M2 <- sapply(sig2, length)
rtotal1 <- sapply(sig1, sum)
rtotal2 <- sapply(sig2, sum)
cor(M1, M2)
cor(rtotal1, rtotal2)
```
Create a list of significant pairs:
```{r}
samples <- names(dsmp)
sig <- as.data.frame(isig, row.names = FALSE)
sig[c("id1", "id2")] <- list(samples[isig[, 1]], samples[isig[, 2]])
sig[c("M1", "M2")] <- list(M1, M2)
sig[c("rtotal1", "rtotal2")] <- list(round(rtotal1, 3), round(rtotal2, 3))
head(sig)
```
For full control, we can use the function `ibdPair` for any two samples and explore the outputs:
```{r}
i <- 18
pair <- dsmp[isig[i, ]]
coii <- coi[isig[i, ]]
res1 <- ibdPair(pair, coii, afreq, M = M1[i], pval = TRUE, equalr = FALSE,
reval = revals[[M1[i]]])
res2 <- ibdPair(pair, coii, afreq, M = M2[i], pval = TRUE, equalr = TRUE,
confreg = TRUE, llik = TRUE, reval = revals[[1]])
res1$rhat # estimate with equalr = FALSE
rep(res2$rhat, M2[i]) # estimate with equalr = TRUE
```
When `llik = TRUE`, output includes log-likelihood (for a single parameter if we use `equalr = TRUE`), which provides the basis for statistical inference:
```{r, fig.width = 6, fig.height = 4.5}
CI <- range(res2$confreg)
llikCI <- max(res2$llik) - qchisq(1 - alpha, df = 1)/2
llrng <- range(res2$llik[is.finite(res2$llik)])
yCI <- llrng + diff(llrng)*0.25
yLR <- (res2$llik[1] + max(res2$llik))/2
cols <- c("purple", "cadetblue3", "gray60")
par(mar = c(3, 2.5, 2, 0.1), mgp = c(1.5, 0.3, 0))
plot(revals[[1]], res2$llik, type = "l", xlab = "r", ylab = "log-likelihood",
yaxt = "n")
abline(v = res2$rhat, lty = 1, col = cols[1])
abline(h = c(max(res2$llik), llikCI, res2$llik[[1]]), lty = 2, col = cols[2])
abline(v = CI, col = cols[1], lty = 5)
arrows(CI[1], yCI, CI[2], yCI, angle = 20, length = 0.1, code = 3,
col = cols[3])
arrows(0.15, res2$llik[1], 0.15, max(res2$llik), angle = 20, length = 0.1,
code = 3, col = cols[3])
text(0.165, yLR, adj = 0, "0.5 LR statistic", col = cols[3])
text(mean(CI), yCI + 0.05*diff(llrng), "confidence interval", col = cols[3])
text(res2$rhat - 0.025, yLR, "MLE", col = cols[3], srt = 90)
par(pardef)
```
<p align="center">
<img src="man/figures/logo.svg" alt="" width=160>
</p>