08b-ResamplingInR.Rmd

---
output:
  bookdown::gitbook:
    lib_dir: "book_assets"
    includes:
      in_header: google_analytics.html
  pdf_document: default
  html_document: default
---

# Resampling and simulation in R

```{r echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(ggplot2)
library(cowplot)
library(knitr)

set.seed(123456) # set random seed to exactly replicate results
opts_chunk$set(tidy.opts=list(width.cutoff=80))
options(tibble.width = 60)

# load the NHANES data library
library(NHANES)

# drop duplicated IDs within the NHANES dataset
NHANES <- NHANES %>% 
  dplyr::distinct(ID,.keep_all=TRUE)

NHANES_adult <- NHANES %>%
  drop_na(Height) %>%
  subset(Age>=18)


```

## Generating random samples (Section \@ref(generating-random-numbers))

Here we will generate random samples from a number of different distributions and plot their histograms.

```{r fig.height=8, fig.width=8}

nsamples <- 10000
nhistbins <- 100

# uniform distribution

p1 <-
  tibble(
    x = runif(nsamples)
  ) %>% 
  ggplot((aes(x))) +
  geom_histogram(bins = nhistbins) + 
  labs(title = "Uniform")

# normal distribution
p2 <-
  tibble(
    x = rnorm(nsamples)
  ) %>% 
  ggplot(aes(x)) +
  geom_histogram(bins = nhistbins) +
  labs(title = "Normal")

# Chi-squared distribution
p3 <-
  tibble(
    x = rnorm(nsamples)
  ) %>% 
  ggplot(aes(x)) +
  geom_histogram(bins = nhistbins) +
  labs(title = "Normal")

# Chi-squared distribution
p3 <-
  tibble(
    x = rchisq(nsamples, df=1)
  ) %>% 
  ggplot(aes(x)) +
  geom_histogram(bins = nhistbins) +
  labs(title = "Chi-squared")

# Poisson distribution
p4 <-
  tibble(
    x = rbinom(nsamples, 20, 0.25)
  ) %>% 
  ggplot(aes(x)) +
  geom_histogram(bins = nhistbins) +
  labs(title = "Binomial (p=0.25, 20 trials)")


plot_grid(p1, p2, p3, p4, ncol = 2)

```

## Simulating the maximum finishing time

Let's simulate 150 samples, collecting the maximum value from each sample, and then plotting the distribution of maxima.

```{r fig.width=4, fig.height=4, out.width="50%"}
# sample maximum value 5000 times and compute 99th percentile
nRuns <- 5000
sampSize <- 150

sampleMax <- function(sampSize = 150) {
  samp <- rnorm(sampSize, mean = 5, sd = 1)
  return(tibble(max=max(samp)))
}

input_df <- tibble(id=seq(nRuns)) %>%
  group_by(id)

maxTime <- input_df %>% do(sampleMax())

cutoff <- quantile(maxTime$max, 0.99)


ggplot(maxTime,aes(max)) +
  geom_histogram(bins = 100) +
  geom_vline(xintercept = cutoff, color = "red")

```

## The bootstrap

The bootstrap is useful for creating confidence intervals in cases where we don't have a parametric distribution. One example is for the median; let's look at how that works. We will start by implementing it by hand, to see more closely how it works.  We will start by collecting a sample of individuals from the NHANES dataset, and the using the bootstrap to obtain confidence intervals on the median for the Height variable.

```{r echo=FALSE}

nRuns <- 2500
sampleSize <- 64

# take a sample
heightSample <- 
  NHANES_adult %>%
  sample_n(sampleSize)

# create a function to collect a sample with replacement
bootMedianHeight <- function(df) {
  bootSample <- sample_n(df, dim(df)[1], replace = TRUE)
  return(tibble(medianHeight=median(bootSample$Height)))
}

input_df <- tibble(id=seq(nRuns)) %>%
  group_by(id)

bootMeans <- do(input_df, bootMedianHeight(heightSample))

bootCI <- tibble(`Lower CI limit`=quantile(bootMeans$medianHeight,.025),
            Median=median(heightSample$Height),
            `Upper CI limit`=quantile(bootMeans$medianHeight,.975)
            )
kable(bootCI)
```