-
Notifications
You must be signed in to change notification settings - Fork 2
/
03-model-methods.Rmd
217 lines (174 loc) · 9.16 KB
/
03-model-methods.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# Model methods {#methods}
## Interpolation
Models that can be estimated in the presence of missing values can often be used to interpolate the unknown values. Often interpolated values can be taken from model's fitted values, and some models may support more sophisticated interpolation methods.
The [forecast package](https://github.com/robjhyndman/forecast/) provides the `na.interp` function for interpolating time series data, which uses linear interpolation for non-seasonal data, and STL decomposition for seasonal data.
Tidy time series tools should allow users to interpolate missing values using any appropriate model.
For example, the `tsibbledata::olympic_running` dataset contains Olympic men's 400m track final winning times. The winning times for the 1916, 1940 and 1944 Olympics are missing from the dataset due to the World Wars.
```{r mens400, echo = FALSE}
library(tsibbledata)
library(tidyverse)
olympic_running %>%
ggplot(aes(x=Year, y = Time, colour = Sex)) +
geom_line() +
geom_point(size = 1) +
facet_wrap(~ Length, scales = "free_y", nrow = 2) +
theme_minimal() +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "bottom", legend.title = element_blank()) +
ylab("Running time (seconds)")
```
We could then interpolate these missing values using the fitted values from a linear model with a trend:
```{r mens400-interpolated, eval = FALSE}
olympic_running %>%
model(lm = TSLM(Time ~ trend())) %>%
interpolate(olympic_running)
```
```{r mens400-interpolated-display, echo = FALSE}
olympic_complete <- olympic_running %>%
model(lm = TSLM(Time ~ trend())) %>%
interpolate(olympic_running)
olympic_complete
olympic_running %>%
ggplot(aes(x=Year, y = Time, colour = Sex)) +
geom_line(aes(linetype = "Interpolated"), data = olympic_complete) +
geom_line(aes(linetype = "Actual")) +
geom_point(size = 1) +
facet_wrap(~ Length, scales = "free_y", nrow = 2) +
theme_minimal() +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "bottom", legend.title = element_blank()) +
ylab("Running time (seconds)")
```
## Re-estimation
https://github.com/tidyverts/fable/issues/43
### refit()
The refitting a model allows the same model to be applied to a new dataset. This is similar to the `model` argument available in most modelling functions from the [forecast package](https://github.com/robjhyndman/forecast/).
The refitted model should maintain the same structure and coefficients of the original model, with fitted information updated to reflect the model's behaviour on the new dataset. It should also be possible to allow re-estimation of parameters using the `reestimate` argument, which keeps the selected model terms but updates the model coefficients/parameters.
It is expected that a refit method uses a fitted model and replacement data to return a mable.
For the ETS model for `mdeaths` estimated above:
```{r ets-mdeaths}
library(fable)
ets_fit <- as_tsibble(mdeaths) %>%
model(ETS(value))
```
We may be interested in using the same model with the same coefficients to estimate the `fdeaths` series:
```{r ets-refit}
refit(ets_fit, as_tsibble(fdeaths))
```
### stream()
Streaming data into a model allows a model to be extended to accomodate new, future data. Like `refit`, `stream` should allow re-estimation of the model parameters. As this can be a costly operation for some models, in most cases updating the parameters should not occur. However it is recommended that the model parameters are updated on a regular basis.
Suppose we are estimating electricity demand data (`tsibbledata::aus_elec`), and after fitting a model to the existing data, a new set of data from the next month becomes available.
```{r stream-tsplot, echo = FALSE}
library(fasster)
library(lubridate)
elec_tr <- tsibbledata::aus_elec %>%
filter(
State == "Victoria",
Time < ymd("2014-12-01"), Time >= ymd("2014-09-01")
)
elec_stream <- tsibbledata::aus_elec %>%
filter(State == "Victoria", Time >= ymd("2014-12-01"))
ggplot(NULL, aes(x = Time, y = Demand)) +
geom_line(aes(colour = "Existing (elec_tr)"), data = elec_tr) +
geom_line(aes(colour = "New (elec_stream)"), data = elec_stream) +
theme_minimal() +
scale_color_brewer(palette = "Dark2") +
guides(colour = guide_legend(NULL, direction = "horizontal")) +
theme(legend.position="bottom")
```
A (minimal) model for the electricity demand above can be estimated using [fasster](https://github.com/tidyverts/fasster).
```{r stream-fit, cache=TRUE}
fit <- elec_tr %>%
model(fasster = fasster(Demand ~ Holiday %S% (poly(1) + trig(10))))
```
```{r stream-fit-plot, echo = FALSE}
ggplot(NULL, aes(x = Time, y = Demand)) +
geom_line(data = elec_tr %>% mutate(facet = "Actual")) +
geom_line(aes(y = .fitted), data = fitted(fit) %>% mutate(facet = "Fitted")) +
facet_grid(facet ~ .) +
theme_minimal()
```
To extend these fitted values to include December's electricity data, we can use the `stream` functionality:
```{r stream}
fit <- fit %>%
stream(elec_stream)
```
```{r stream-plot, echo = FALSE}
ggplot(NULL, aes(x = Time, y = Demand)) +
geom_line(data = elec_tr %>% mutate(facet = "Actual")) +
geom_line(data = elec_stream %>% mutate(facet = "Actual")) +
geom_line(aes(y = .fitted), data = fitted(fit) %>% mutate(facet = "Fitted")) +
facet_grid(facet ~ .) +
theme_minimal()
```
## Simulation
Much like the [tidymodels opinion](https://tidymodels.github.io/model-implementation-principles/model-predictions.html#input-data) toward `predict`, `generate` should not default to an archived version of the training set. This allows models to be used for simulating new data sets, which is especially relevant for time series as often future paths beyond the training set are simulated.
The generate method for a fable model should accept these arguments (names chosen for consistency with `tidymodels`):
* object: The model itself
* new_data: The data used for simulation
* ~~times~~: The number of simulated series (handled by fablelite)
* ~~seed~~: Random generator initialisation (handled by fablelite)
The `new_data` dataset extends existing `stats::simulate` functionality by allowing the simulation to accept a new time index for simulating beyond the sample (`.idx`), and allows the simulation to work with a new set of exogenous regressors (say `x1` and `x2`).
It is expected that the innovations (`.innov`) for the simulation are randomly generated for each repition number (`rep`), which can be achieved using the `times` argument. However, users should also be able to provide a set of pre-generated innovations (`.innov`) for each repition (`.rep`). If these columns are provided in the `new_data`, then this data will be passed directly to the simulation method (without generating new numbers over `times` replications).
```{r sim-newdata, echo=FALSE}
library(tsibble)
tsibble(
.rep = rep(1:3, each = 3),
.idx = rep(yearmonth("2017") + lubridate::month(0:2), 3),
.innov = rnorm(9),
x1 = rnorm(9, 2, 2), x2 = rnorm(9,-2),
index = .idx, key = id(.rep))
```
For the end user, creating simulations would work like this:
```{r sim-example, eval = FALSE, cache = TRUE}
library(fable)
library(tsibbledata)
UKLungDeaths %>%
model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>%
generate(UKLungDeaths, times = 5)
```
```{r sim-example-eval, echo = FALSE, cache = TRUE}
library(fable)
library(tsibbledata)
sim1 <- UKLungDeaths %>%
model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>%
generate(UKLungDeaths, times = 5)
sim1
library(ggplot2)
ggplot(UKLungDeaths, aes(x = index, y = mdeaths)) +
geom_line(colour = "blue") +
geom_line(aes(y = .sim, group = .rep), data = sim1, alpha = 0.2) +
theme_minimal()
```
Or, to generate data beyond the sample:
```{r sim-future, eval = FALSE}
library(lubridate)
UKLungDeaths %>%
filter(year(index) <= 1978) %>%
model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>%
generate(
UKLungDeaths %>% filter(year(index) > 1978),
times = 5
)
```
```{r sim-future-eval, echo=FALSE}
library(lubridate)
sim2 <- UKLungDeaths %>%
filter(year(index) <= 1978) %>%
model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>%
generate(
UKLungDeaths %>% filter(year(index) > 1978),
times = 5
)
sim2
library(ggplot2)
UKLungDeaths %>%
filter(year(index) <= 1978) %>%
ggplot(aes(x = index, y = mdeaths)) +
geom_line(colour = "blue") +
geom_line(aes(y = .sim, group = .rep), data = sim2, alpha = 0.2) +
theme_minimal()
```
## Visualisation
Different plots are appropriate for visualising each type of model. For example, a plot of an ARIMA model may show the AR and/or MA roots from the model on a unit circle. A linear model has several common plots, including plots showing "Residuals vs Fitted" values, normality via a Q-Q plot, and measures of leverage. These model plots are further extended by the [visreg package](http://pbreheny.github.io/visreg/) to show the affects of terms on the model's response. Some models currently have no model-specific plots, such as ETS, which defaults to showing a components plot using the estimated states.
Visualising these models poses a substantial challenge for consistency across models, and is made more difficult as batch modelling becomes commonplace.