forked from poldrack/psych10-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08b-ResamplingInR.Rmd
156 lines (115 loc) · 3.51 KB
/
08b-ResamplingInR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
output:
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
pdf_document: default
html_document: default
---
# Resampling and simulation in R
```{r echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(ggplot2)
library(cowplot)
library(knitr)
set.seed(123456) # set random seed to exactly replicate results
opts_chunk$set(tidy.opts=list(width.cutoff=80))
options(tibble.width = 60)
# load the NHANES data library
library(NHANES)
# drop duplicated IDs within the NHANES dataset
NHANES <- NHANES %>%
dplyr::distinct(ID,.keep_all=TRUE)
NHANES_adult <- NHANES %>%
drop_na(Height) %>%
subset(Age>=18)
```
## Generating random samples (Section \@ref(generating-random-numbers))
Here we will generate random samples from a number of different distributions and plot their histograms.
```{r fig.height=8, fig.width=8}
nsamples <- 10000
nhistbins <- 100
# uniform distribution
p1 <-
tibble(
x = runif(nsamples)
) %>%
ggplot((aes(x))) +
geom_histogram(bins = nhistbins) +
labs(title = "Uniform")
# normal distribution
p2 <-
tibble(
x = rnorm(nsamples)
) %>%
ggplot(aes(x)) +
geom_histogram(bins = nhistbins) +
labs(title = "Normal")
# Chi-squared distribution
p3 <-
tibble(
x = rnorm(nsamples)
) %>%
ggplot(aes(x)) +
geom_histogram(bins = nhistbins) +
labs(title = "Normal")
# Chi-squared distribution
p3 <-
tibble(
x = rchisq(nsamples, df=1)
) %>%
ggplot(aes(x)) +
geom_histogram(bins = nhistbins) +
labs(title = "Chi-squared")
# Poisson distribution
p4 <-
tibble(
x = rbinom(nsamples, 20, 0.25)
) %>%
ggplot(aes(x)) +
geom_histogram(bins = nhistbins) +
labs(title = "Binomial (p=0.25, 20 trials)")
plot_grid(p1, p2, p3, p4, ncol = 2)
```
## Simulating the maximum finishing time
Let's simulate 150 samples, collecting the maximum value from each sample, and then plotting the distribution of maxima.
```{r fig.width=4, fig.height=4, out.width="50%"}
# sample maximum value 5000 times and compute 99th percentile
nRuns <- 5000
sampSize <- 150
sampleMax <- function(sampSize = 150) {
samp <- rnorm(sampSize, mean = 5, sd = 1)
return(tibble(max=max(samp)))
}
input_df <- tibble(id=seq(nRuns)) %>%
group_by(id)
maxTime <- input_df %>% do(sampleMax())
cutoff <- quantile(maxTime$max, 0.99)
ggplot(maxTime,aes(max)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = cutoff, color = "red")
```
## The bootstrap
The bootstrap is useful for creating confidence intervals in cases where we don't have a parametric distribution. One example is for the median; let's look at how that works. We will start by implementing it by hand, to see more closely how it works. We will start by collecting a sample of individuals from the NHANES dataset, and the using the bootstrap to obtain confidence intervals on the median for the Height variable.
```{r echo=FALSE}
nRuns <- 2500
sampleSize <- 64
# take a sample
heightSample <-
NHANES_adult %>%
sample_n(sampleSize)
# create a function to collect a sample with replacement
bootMedianHeight <- function(df) {
bootSample <- sample_n(df, dim(df)[1], replace = TRUE)
return(tibble(medianHeight=median(bootSample$Height)))
}
input_df <- tibble(id=seq(nRuns)) %>%
group_by(id)
bootMeans <- do(input_df, bootMedianHeight(heightSample))
bootCI <- tibble(`Lower CI limit`=quantile(bootMeans$medianHeight,.025),
Median=median(heightSample$Height),
`Upper CI limit`=quantile(bootMeans$medianHeight,.975)
)
kable(bootCI)
```