-
Notifications
You must be signed in to change notification settings - Fork 1
/
6-GLM.qmd
177 lines (122 loc) · 6.18 KB
/
6-GLM.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
output: html_document
editor_options:
chunk_output_type: console
---
```{r, include=FALSE}
set.seed(42)
```
# Generalized linear models {#sec-glm}
```{r}
library(EcoData)
library(effects)
library(DHARMa)
str(titanic) #from EcoData package
titanic$Fpclass = as.factor(titanic$pclass)
m = lm(survived~age+sex, data = titanic)
summary(m)
par(mfrow = c(2, 2))
plot(m)
par(mfrow = c(1,1))
```
We can see that there is something seriously wrong with the residuals! One assumption for the lm is that the residuals and the response variable are normally distributed:
```{r}
hist(titanic$survived)
table(titanic$survived)
```
Our response consists of 0s and 1s, so it is a binomial distribution. For binomial data we cannot use a lm because we need to restrict our values to be in the range of \[0,1\] (our 'lm' needs to predict probabilites).
The idea of generalized linear models is that we can use link (and inverse link functions) to transform the expected values into the range of our response.
- All GLMs have a linear term to specify the desired dependency on explanatory variables (we already know this) $y = a x + b x_2 + c$
- a link function to scale the linear term to the correct range of values for the distribution that was chosen(this is new)
- and a probability distribution that describes the model from which the dependent variable is generated(this is also new -- so far: normal = gaussian)
LM and GLMS -- the three models you should know:
| Type | R call | Properties |
|------------------------|------------------------|------------------------|
| Linear regression (continuous data) | `lm(y ~ x)` or `glm(y~x, family = gaussian())` | Identity (no) link, normal distribution (family = gaussian) no inverse link |
| Logistic regression (1/0 or k out of n data) | `glm(y ~x, family = binomial)` | binomial distribution, logit link: $log(p/(1-p))$ |
| inverse link: $exp(y)/(1 + exp(y))$ | | |
| Poisson regression (Count data) | `glm(y ~x, family = poisson)` | Poisson distribution, log link: $log(lambda)$ |
| inverse link: $exp(y)$ | | |
Model diagnostic (residual checks) cannot be done anymore by plotting the model. This can be only done for a lm! For glms we have to use the DHARMa package!
For more details on glms, see the chapter about glms in the [advanced regression book.](https://theoreticalecology.github.io/AdvancedRegressionModels/4A-GLMs.html)
## Logistic Regression {#sec-logistic}
For the binomial model we can use the logit link:
$$
logit(y) = a_0 +a_1*x
$$
And with the inverse link:
$$
y = \frac{1}{1+e^{-(a_0 + a_1) }}
$$
You can interpret the glm outputs basically like lm outputs.
**BUT:** To get absolute response values, you have to transform the output with the inverse link function. For the logit, e.g. an intercept of 0 means a predicted value of 0.5. Different overall statistics: no R^2^ instead Pseudo R^2^ = 1 - Residual deviance / Null deviance(deviance is based on the likelihood):
```{r}
# logistic regression with categorical predictor
m1 = glm(survived ~ sex, data = titanic, family = "binomial")
summary(m1)
# 2 groups: sexmale = difference of mean for male from mean for female
# intercept = linear term for female:
0.98
# but: this has to be transformed back to original scale before being interpreted!!!
# survival probability for females
plogis(0.98)
# applies inverse logit function
# linear term for male
0.98 - 2.43
# survival probability
plogis(0.98 - 2.43)
# predicted linear term
table(predict(m1))
# predicted response
table(predict(m1, type = "response"))
plot(allEffects(m1))
# more predictors
m2 = glm(survived ~ sex + age, titanic, family = binomial)
summary(m2)
# Calculate Pseudo R2: 1 - Residual deviance / Null deviance
1 - 1101.3/1414.6 # Pseudo R2 of model
# Anova
anova(m2, test = "Chisq")
plot(allEffects(m2))
```
Residual check:
```{r}
# Model diagnostics
# do not use the plot(model) residual checks
# use DHARMa package
library(DHARMa)
res = simulateResiduals(m2)
plot(res)
```
## Poisson Regression {#sec-poisson}
Poisson regression is used for count data. Properties of count data are: no negative values, only integers, y \~ Poisson(lambda); where lambda = mean = variance, log link function (lambda must be positive).
Example:
```{r}
head(birdfeeding)
plot(feeding ~ attractiveness, birdfeeding)
fit = glm(feeding ~ attractiveness, birdfeeding, family = "poisson")
summary(fit)
# feeding for a bird with attractiveness 3
# linear term
1.47 + 0.148 * 3
# pieces of food, using inverse of the link function, log --> exp
exp(1.47 + 0.148 * 3)
plot(allEffects(fit))
# checking residuals
res = simulateResiduals(fit)
plot(res, quantreg = F)
# the warning is because of a recent change in DHARMa
# qgam requires more data points
```
Normal versus Poisson distribution:
- N(mean, sd)This means that fitting a normal distribution estimates a parameter for the variance (sd)
- Poisson(lambda)lambda = mean = varianceThis means that a Poisson regression does not fit a separate parameter for the variance
So in the glm always assume that the variance is the mean, which is a strong assumption. In reality it can often occur that the variance is greater than expected (Overdispersion) or smaller than expected (Underdispersion). (this cannot happen for the lm because there we estimate a variance parameter (residual variance)). Overdispersion can have a HUGE influence on the MLEs and particularly on the p-values!
We can use the DHARMa package to check for Over or Underdispersion:
```{r}
# test for overdispersion
testDispersion(fit)
# Dispersion test is necessary for all poisson or binomial models with k/n
# if positive, you can chose family = quasi-poisson or quasi-binomial
# or use negative binomial distribution instead of poisson
```