A survey of 204 Seattle-area commuters was conducted to examine the duration of time that commuters delay their work-to-home trips in an effort to avoid peak period traffic congestion. Of the 204 commuters surveyed, 96 indicated that they sometimes delayed their work-to-home trip to avoid traffic congestion. These commuters provided their average time delay. Thus, each commuter has a completed delay duration so that neither left nor right censoring is present in the data.
Your tasks:
- Plot the Kaplan-Meier estimate of the duration of time that commuters delay their work-to-home trips;
- Determine the significant factors that affect the duration of commuters’ delay using a Cox model;
- Examine the work-to-home departure delay using exponential, Weibull, and log-logistic proportional-hazards models.
library(readxl)
library(skimr)
library(survival)
library(coxme)
library(survminer)
library(ggplot2)
data.delay <- read_excel("Data/ExerciseHBDM.xlsx")
data.delay <- data.frame(data.delay)
head(data.delay) #variable names are missing
## id X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1 1000 0 0 0 1 2 1 1 1 2 0 5 1 12 1 1.4 37070 1024 1295 5610
## 2 1001 30 1 1 1 2 1 2 0 2 0 5 0 14 1 1.8 24410 6430 3206 1997
## 3 1002 0 0 0 1 5 1 4 1 1 0 3 0 5 1 1.3 24410 6430 3206 1997
## 4 1003 0 0 0 1 5 1 4 1 1 0 2 0 5 1 1.3 24410 6430 3206 1997
## 5 1004 0 0 0 2 2 1 2 1 1 0 3 0 3 0 1.2 5079 1024 899 4092
## 6 1005 38 1 3 1 2 1 7 1 2 0 5 0 14 1 1.8 24410 6430 3206 1997
names(data.delay) <- c("id","minutes","activity","number_of_times","mode","route","congested","age",
"gender","number_cars","number_children","income","flexible","distance",
"LOSD","rate_of_travel","population","retail","service","size")
df <- data.frame(data.delay, row.names = 1) #make id (1st variable) as row name
str(df)
## 'data.frame': 204 obs. of 19 variables:
## $ minutes : num 0 30 0 0 0 38 0 0 45 0 ...
## $ activity : num 0 1 0 0 0 1 0 0 1 0 ...
## $ number_of_times: num 0 1 0 0 0 3 0 0 0 0 ...
## $ mode : num 1 1 1 1 2 1 2 1 2 1 ...
## $ route : num 2 2 5 5 2 2 2 1 2 5 ...
## $ congested : num 1 1 1 1 1 1 1 0 1 0 ...
## $ age : num 1 2 4 4 2 7 2 7 3 6 ...
## $ gender : num 1 0 1 1 1 1 1 1 1 1 ...
## $ number_cars : num 2 2 1 1 1 2 2 2 2 3 ...
## $ number_children: num 0 0 0 0 0 0 0 0 0 1 ...
## $ income : num 5 5 3 2 3 5 6 4 4 5 ...
## $ flexible : num 1 0 0 0 0 0 0 0 0 0 ...
## $ distance : num 12 14 5 5 3 14 11 13 22 6 ...
## $ LOSD : num 1 1 1 1 0 1 1 0 1 0 ...
## $ rate_of_travel : chr "1.4" "1.8" "1.3" "1.3" ...
## $ population : num 37070 24410 24410 24410 5079 ...
## $ retail : num 1024 6430 6430 6430 1024 ...
## $ service : num 1295 3206 3206 3206 899 ...
## $ size : num 5610 1997 1997 1997 4092 ...
skim(df)
Name | df |
Number of rows | 204 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 18 |
________________________ | |
Group variables | None |
Data summary
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
rate_of_travel | 0 | 1 | 3 | 3 | 0 | 15 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
minutes | 0 | 1 | 24.14 | 36.27 | 0 | 0 | 0.0 | 30.00 | 240 | ▇▁▁▁▁ |
activity | 0 | 1 | 0.78 | 0.97 | 0 | 0 | 0.0 | 1.00 | 3 | ▇▃▁▂▁ |
number_of_times | 0 | 1 | 0.86 | 1.31 | 0 | 0 | 0.0 | 2.00 | 5 | ▇▂▁▁▁ |
mode | 0 | 1 | 2.08 | 1.49 | 1 | 1 | 1.0 | 4.00 | 5 | ▇▂▁▂▂ |
route | 0 | 1 | 3.49 | 1.46 | 1 | 2 | 3.5 | 5.00 | 5 | ▂▆▂▂▇ |
congested | 0 | 1 | 0.63 | 0.48 | 0 | 0 | 1.0 | 1.00 | 1 | ▅▁▁▁▇ |
age | 0 | 1 | 2.83 | 1.69 | 1 | 2 | 2.0 | 4.00 | 7 | ▇▃▂▁▁ |
gender | 0 | 1 | 0.64 | 0.48 | 0 | 0 | 1.0 | 1.00 | 1 | ▅▁▁▁▇ |
number_cars | 0 | 1 | 1.80 | 1.04 | 0 | 1 | 2.0 | 2.00 | 7 | ▇▇▂▁▁ |
number_children | 0 | 1 | 0.70 | 1.05 | 0 | 0 | 0.0 | 1.00 | 5 | ▇▂▁▁▁ |
income | 0 | 1 | 2.78 | 1.56 | 1 | 1 | 3.0 | 4.00 | 6 | ▇▃▂▂▁ |
flexible | 0 | 1 | 0.62 | 0.49 | 0 | 0 | 1.0 | 1.00 | 1 | ▅▁▁▁▇ |
distance | 0 | 1 | 7.15 | 4.81 | 1 | 4 | 6.0 | 10.00 | 25 | ▇▆▃▁▁ |
LOSD | 0 | 1 | 0.65 | 0.48 | 0 | 0 | 1.0 | 1.00 | 1 | ▅▁▁▁▇ |
population | 0 | 1 | 25367.32 | 10232.19 | 1303 | 23026 | 24410.0 | 34895.00 | 37070 | ▃▁▁▇▇ |
retail | 0 | 1 | 4604.66 | 4334.42 | 616 | 1866 | 3906.0 | 3966.00 | 16523 | ▆▇▁▁▂ |
service | 0 | 1 | 9733.06 | 10552.66 | 595 | 1606 | 10582.0 | 10582.00 | 38607 | ▇▇▁▁▂ |
size | 0 | 1 | 3087.51 | 1598.03 | 475 | 2472 | 2753.0 | 4447.75 | 5653 | ▃▇▇▁▆ |
df <- df[order(df$minutes),] #sort by time
plot(df$minutes, type="h") #high-density vertical lines
data.delay2 <-subset(df, minutes>0)
df.survfit = survfit(Surv(minutes) ~ 1, data= data.delay2)
summary(df.survfit)
## Call: survfit(formula = Surv(minutes) ~ 1, data = data.delay2)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 4 96 1 0.9896 0.0104 0.96948 1.0000
## 10 95 4 0.9479 0.0227 0.90450 0.9934
## 15 91 7 0.8750 0.0338 0.81128 0.9437
## 20 84 1 0.8646 0.0349 0.79878 0.9358
## 24 83 1 0.8542 0.0360 0.78640 0.9278
## 25 82 1 0.8438 0.0371 0.77416 0.9196
## 30 81 31 0.5208 0.0510 0.42990 0.6310
## 38 50 1 0.5104 0.0510 0.41961 0.6209
## 40 49 3 0.4792 0.0510 0.38897 0.5903
## 45 46 10 0.3750 0.0494 0.28965 0.4855
## 53 36 1 0.3646 0.0491 0.27997 0.4748
## 60 35 16 0.1979 0.0407 0.13231 0.2961
## 80 19 1 0.1875 0.0398 0.12364 0.2843
## 90 18 7 0.1146 0.0325 0.06571 0.1998
## 100 11 1 0.1042 0.0312 0.05794 0.1873
## 105 10 1 0.0937 0.0297 0.05033 0.1746
## 120 9 6 0.0312 0.0178 0.01026 0.0952
## 130 3 1 0.0208 0.0146 0.00529 0.0821
## 150 2 1 0.0104 0.0104 0.00148 0.0732
## 240 1 1 0.0000 NaN NA NA
Note: The functions
survfit()
andSurv()
create a life table survival object.
# option 1
plot(
df.survfit,
xlab = "Time (minutes)",
ylab = "Survival
probability",
conf.int = TRUE
)
#option 2
ggsurvplot(
df.survfit,
xlab = "Time (minutes)",
xlim =
range(0:250) ,
conf.int = TRUE,
pallete = "red",
ggtheme =
theme_minimal()
)
Note: It is the most widely applied nonparametric method in survival analysis.The Kaplan–Meier method provides useful estimates of survival probabilities and a graphical presentation of the survival distribution.
The Kaplan–Meier method assumes that censoring is independent of survival times. If this is false, the Kaplan–Meier method is inappropriate.
The Cox proportional-hazards model is semiparametric method that
produces estimated hazard ratios (sometimes called rate ratios or risk
ratios).
Let’s perform a Cox Proportional Hazard Model Estimate estimate the
model for Duration of Commuter Work-To-Home Delay to Avoid Congestion
result.cox <- coxph(Surv(minutes) ~ gender + rate_of_travel + distance + population, data= data.delay2)
summary(result.cox)
## Call:
## coxph(formula = Surv(minutes) ~ gender + rate_of_travel + distance +
## population, data = data.delay2)
##
## n= 96, number of events= 96
##
## coef exp(coef) se(coef) z Pr(>|z|)
## gender 1.617e-01 1.176e+00 2.608e-01 0.620 0.5352
## rate_of_travel1.3 -3.625e-01 6.960e-01 1.280e+00 -0.283 0.7771
## rate_of_travel1.4 1.233e+00 3.430e+00 1.251e+00 0.985 0.3245
## rate_of_travel1.5 -5.127e-01 5.989e-01 1.186e+00 -0.432 0.6656
## rate_of_travel1.6 3.660e-02 1.037e+00 1.147e+00 0.032 0.9746
## rate_of_travel1.7 1.018e+00 2.767e+00 1.341e+00 0.759 0.4478
## rate_of_travel1.8 -8.239e-01 4.387e-01 1.093e+00 -0.754 0.4510
## rate_of_travel1.9 -9.901e-01 3.716e-01 1.142e+00 -0.867 0.3860
## rate_of_travel2.0 -9.167e-01 3.998e-01 1.115e+00 -0.822 0.4112
## rate_of_travel2.1 -1.468e+00 2.303e-01 1.086e+00 -1.353 0.1762
## rate_of_travel2.2 -1.138e+00 3.204e-01 1.119e+00 -1.017 0.3089
## rate_of_travel2.3 -1.853e+00 1.568e-01 1.109e+00 -1.670 0.0949 .
## rate_of_travel2.4 -1.515e+00 2.197e-01 1.117e+00 -1.357 0.1748
## rate_of_travel2.5 -1.510e+00 2.208e-01 1.114e+00 -1.355 0.1753
## distance -7.486e-02 9.279e-01 3.403e-02 -2.200 0.0278 *
## population -2.157e-05 1.000e+00 1.319e-05 -1.635 0.1020
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## gender 1.1755 0.8507 0.70512 1.9597
## rate_of_travel1.3 0.6960 1.4369 0.05662 8.5547
## rate_of_travel1.4 3.4301 0.2915 0.29538 39.8330
## rate_of_travel1.5 0.5989 1.6698 0.05858 6.1226
## rate_of_travel1.6 1.0373 0.9641 0.10950 9.8255
## rate_of_travel1.7 2.7666 0.3615 0.19994 38.2812
## rate_of_travel1.8 0.4387 2.2793 0.05150 3.7373
## rate_of_travel1.9 0.3716 2.6914 0.03962 3.4847
## rate_of_travel2.0 0.3998 2.5010 0.04492 3.5593
## rate_of_travel2.1 0.2303 4.3425 0.02743 1.9335
## rate_of_travel2.2 0.3204 3.1208 0.03578 2.8698
## rate_of_travel2.3 0.1568 6.3758 0.01784 1.3792
## rate_of_travel2.4 0.2197 4.5514 0.02462 1.9612
## rate_of_travel2.5 0.2208 4.5287 0.02486 1.9616
## distance 0.9279 1.0777 0.86801 0.9919
## population 1.0000 1.0000 0.99995 1.0000
##
## Concordance= 0.677 (se = 0.036 )
## Likelihood ratio test= 30.65 on 16 df, p=0.01
## Wald test = 32.27 on 16 df, p=0.009
## Score (logrank) test = 36 on 16 df, p=0.003
Note: Regression coefficients are on a log scale.
Test the proportional hazards assumption on the basis of partial
residuals. Type of residual known as Schoenfeld residuals.
It includes an interaction between the covariate and a function of time
(or distance). Log time is often used but it could be any function. For
each covariate, the function cox.zph()
correlates the corresponding
set of scaled Schoenfeld residuals with time, to test for independence
between residuals and time. Additionally, it performs a global test for
the model as a whole.
A plot that shows a non-random pattern against time is evidence of
violation of the PH assumption.
test.ph <- cox.zph(result.cox)
par(mfrow=c(2,2))
plot(test.ph)
ggcoxzph(test.ph)
Note: If significant then the assumption is violated.
In principle, the Schoenfeld residuals are independent of time.
ggsurvplot(
survfit(result.cox),
data = data.delay2,
palette = "#2E9FDF",
ggtheme = theme_minimal()
)
ggsurvplot(
survfit(result.cox),
data = data.delay2,
conf.int = TRUE,
palette = c("#FF9E29", "#86AA00"),
risk.table = TRUE,
risk.table.col = "strata",
fun = "event"
)
result.cox$loglik
## [1] -345.3794 -330.0545
LLi = result.cox$loglik[1] # Initial log-likelihood
LLf = result.cox$loglik[2] # Final log-likelihood
1- (LLf/LLi) # pseudo R-suare
## [1] 0.0443713
Parametric Model Estimates of the Duration of Commuter Work-ToHome Delay to Avoid Congestion
- Exponential
survreg(
Surv(minutes) ~ gender + rate_of_travel + distance + population,
data = data.delay2,
dist = "exponential" )
## Call:
## survreg(formula = Surv(minutes) ~ gender + rate_of_travel + distance +
## population, data = data.delay2, dist = "exponential")
##
## Coefficients:
## (Intercept) gender rate_of_travel1.3 rate_of_travel1.4 rate_of_travel1.5 rate_of_travel1.6
## 2.879512e+00 -1.108216e-01 2.286504e-01 -6.243518e-01 2.089118e-01 -7.778191e-02
## rate_of_travel1.7 rate_of_travel1.8 rate_of_travel1.9 rate_of_travel2.0 rate_of_travel2.1 rate_of_travel2.2
## -5.431004e-01 3.753339e-01 4.918786e-01 4.186932e-01 6.440768e-01 5.006781e-01
## rate_of_travel2.3 rate_of_travel2.4 rate_of_travel2.5 distance population
## 9.390733e-01 7.377887e-01 8.025575e-01 4.070892e-02 1.261692e-05
##
## Scale fixed at 1
##
## Loglik(model)= -467.8 Loglik(intercept only)= -474
## Chisq= 12.5 on 16 degrees of freedom, p= 0.709
## n= 96
- Weibull
survreg(
Surv(minutes) ~ gender + rate_of_travel + distance + population,
data = data.delay2,
dist = "weibull" )
## Call:
## survreg(formula = Surv(minutes) ~ gender + rate_of_travel + distance +
## population, data = data.delay2, dist = "weibull")
##
## Coefficients:
## (Intercept) gender rate_of_travel1.3 rate_of_travel1.4 rate_of_travel1.5 rate_of_travel1.6
## 2.895928e+00 -9.903521e-02 2.181794e-01 -6.093365e-01 2.762764e-01 -3.648657e-03
## rate_of_travel1.7 rate_of_travel1.8 rate_of_travel1.9 rate_of_travel2.0 rate_of_travel2.1 rate_of_travel2.2
## -4.933829e-01 4.904227e-01 5.528334e-01 5.234497e-01 8.567097e-01 6.509833e-01
## rate_of_travel2.3 rate_of_travel2.4 rate_of_travel2.5 distance population
## 1.210978e+00 8.503809e-01 8.828783e-01 3.939630e-02 1.222171e-05
##
## Scale= 0.5153148
##
## Loglik(model)= -442.3 Loglik(intercept only)= -461.8
## Chisq= 38.9 on 16 degrees of freedom, p= 0.00112
## n= 96
- Log-logistic
survreg(
Surv(minutes) ~ gender + rate_of_travel + distance + population,
data = data.delay2,
dist = "loglogistic" )
## Call:
## survreg(formula = Surv(minutes) ~ gender + rate_of_travel + distance +
## population, data = data.delay2, dist = "loglogistic")
##
## Coefficients:
## (Intercept) gender rate_of_travel1.3 rate_of_travel1.4 rate_of_travel1.5 rate_of_travel1.6
## 2.892950e+00 -1.442251e-01 2.314555e-01 -6.226149e-01 1.475325e-01 -1.335428e-01
## rate_of_travel1.7 rate_of_travel1.8 rate_of_travel1.9 rate_of_travel2.0 rate_of_travel2.1 rate_of_travel2.2
## -5.913971e-01 2.521084e-01 4.608343e-01 2.887788e-01 3.701638e-01 3.211386e-01
## rate_of_travel2.3 rate_of_travel2.4 rate_of_travel2.5 distance population
## 5.197213e-01 6.084442e-01 7.333592e-01 4.171315e-02 1.217428e-05
##
## Scale= 0.3326078
##
## Loglik(model)= -442.2 Loglik(intercept only)= -456.9
## Chisq= 29.45 on 16 degrees of freedom, p= 0.0211
## n= 96
Note: The argument dist has several options to describe the parametric model used (“weibull”, “exponential”, “gaussian”, “logistic”, “lognormal”, or “loglogistic”. See more with
?survreg