03-model-methods.Rmd

# Model methods {#methods}

## Interpolation

Models that can be estimated in the presence of missing values can often be used to interpolate the unknown values. Often interpolated values can be taken from model's fitted values, and some models may support more sophisticated interpolation methods.

The [forecast package](https://github.com/robjhyndman/forecast/) provides the `na.interp` function for interpolating time series data, which uses linear interpolation for non-seasonal data, and STL decomposition for seasonal data. 

Tidy time series tools should allow users to interpolate missing values using any appropriate model. 

For example, the `tsibbledata::olympic_running` dataset contains Olympic men's 400m track final winning times. The winning times for the 1916, 1940 and 1944 Olympics are missing from the dataset due to the World Wars.

```{r mens400, echo = FALSE}
library(tsibbledata)
library(tidyverse)
olympic_running %>%
  ggplot(aes(x=Year, y = Time, colour = Sex)) +
  geom_line() +
  geom_point(size = 1) +
  facet_wrap(~ Length, scales = "free_y", nrow = 2) + 
  theme_minimal() + 
  scale_color_brewer(palette = "Dark2") + 
  theme(legend.position = "bottom", legend.title = element_blank()) +
  ylab("Running time (seconds)")
```

We could then interpolate these missing values using the fitted values from a linear model with a trend:

```{r mens400-interpolated, eval = FALSE}
olympic_running %>% 
  model(lm = TSLM(Time ~ trend())) %>% 
  interpolate(olympic_running)
```
```{r mens400-interpolated-display, echo = FALSE}
olympic_complete <- olympic_running %>% 
  model(lm = TSLM(Time ~ trend())) %>% 
  interpolate(olympic_running)
olympic_complete

olympic_running %>%
  ggplot(aes(x=Year, y = Time, colour = Sex)) +
  geom_line(aes(linetype = "Interpolated"), data = olympic_complete) +
  geom_line(aes(linetype = "Actual")) +
  geom_point(size = 1) +
  facet_wrap(~ Length, scales = "free_y", nrow = 2) + 
  theme_minimal() + 
  scale_color_brewer(palette = "Dark2") + 
  theme(legend.position = "bottom", legend.title = element_blank()) +
  ylab("Running time (seconds)")
```


## Re-estimation
https://github.com/tidyverts/fable/issues/43

### refit()
The refitting a model allows the same model to be applied to a new dataset. This is similar to the `model` argument available in most modelling functions from the [forecast package](https://github.com/robjhyndman/forecast/).

The refitted model should maintain the same structure and coefficients of the original model, with fitted information updated to reflect the model's behaviour on the new dataset. It should also be possible to allow re-estimation of parameters using the `reestimate` argument, which keeps the selected model terms but updates the model coefficients/parameters.

It is expected that a refit method uses a fitted model and replacement data to return a mable.

For the ETS model for `mdeaths` estimated above:
```{r ets-mdeaths}
library(fable)
ets_fit <- as_tsibble(mdeaths) %>% 
  model(ETS(value))
```

We may be interested in using the same model with the same coefficients to estimate the `fdeaths` series:
```{r ets-refit}
refit(ets_fit, as_tsibble(fdeaths))
```


### stream()

Streaming data into a model allows a model to be extended to accomodate new, future data. Like `refit`, `stream` should allow re-estimation of the model parameters. As this can be a costly operation for some models, in most cases updating the parameters should not occur. However it is recommended that the model parameters are updated on a regular basis.

Suppose we are estimating electricity demand data (`tsibbledata::aus_elec`), and after fitting a model to the existing data, a new set of data from the next month becomes available.

```{r stream-tsplot, echo = FALSE}
library(fasster)
library(lubridate)
elec_tr <- tsibbledata::aus_elec %>% 
  filter(
    State == "Victoria",
    Time < ymd("2014-12-01"), Time >= ymd("2014-09-01")
  )
elec_stream <- tsibbledata::aus_elec %>% 
  filter(State == "Victoria", Time >= ymd("2014-12-01"))
ggplot(NULL, aes(x = Time, y = Demand)) + 
  geom_line(aes(colour = "Existing (elec_tr)"), data = elec_tr) + 
  geom_line(aes(colour = "New (elec_stream)"), data = elec_stream) + 
  theme_minimal() + 
  scale_color_brewer(palette = "Dark2") + 
  guides(colour = guide_legend(NULL, direction = "horizontal")) + 
  theme(legend.position="bottom")
```

A (minimal) model for the electricity demand above can be estimated using [fasster](https://github.com/tidyverts/fasster).

```{r stream-fit, cache=TRUE}
fit <- elec_tr %>% 
  model(fasster = fasster(Demand ~ Holiday %S% (poly(1) + trig(10))))
```
```{r stream-fit-plot, echo = FALSE}
ggplot(NULL, aes(x = Time, y = Demand)) + 
  geom_line(data = elec_tr %>% mutate(facet = "Actual")) +
  geom_line(aes(y = .fitted), data = fitted(fit) %>% mutate(facet = "Fitted")) + 
  facet_grid(facet ~ .) +
  theme_minimal()
```

To extend these fitted values to include December's electricity data, we can use the `stream` functionality:

```{r stream}
fit <- fit %>% 
  stream(elec_stream)
```
```{r stream-plot, echo = FALSE}
ggplot(NULL, aes(x = Time, y = Demand)) + 
  geom_line(data = elec_tr %>% mutate(facet = "Actual")) +
  geom_line(data = elec_stream %>% mutate(facet = "Actual")) +
  geom_line(aes(y = .fitted), data = fitted(fit) %>% mutate(facet = "Fitted")) + 
  facet_grid(facet ~ .) +
  theme_minimal()
```


## Simulation

Much like the [tidymodels opinion](https://tidymodels.github.io/model-implementation-principles/model-predictions.html#input-data) toward `predict`, `generate` should not default to an archived version of the training set. This allows models to be used for simulating new data sets, which is especially relevant for time series as often future paths beyond the training set are simulated.

The generate method for a fable model should accept these arguments (names chosen for consistency with `tidymodels`):

* object: The model itself
* new_data: The data used for simulation
* ~~times~~: The number of simulated series (handled by fablelite)
* ~~seed~~: Random generator initialisation (handled by fablelite)

The `new_data` dataset extends existing `stats::simulate` functionality by allowing the simulation to accept a new time index for simulating beyond the sample (`.idx`), and allows the simulation to work with a new set of exogenous regressors (say `x1` and `x2`). 

It is expected that the innovations (`.innov`) for the simulation are randomly generated for each repition number (`rep`), which can be achieved using the `times` argument.  However, users should also be able to provide a set of pre-generated innovations (`.innov`) for each repition (`.rep`). If these columns are provided in the `new_data`, then this data will be passed directly to the simulation method (without generating new numbers over `times` replications).

```{r sim-newdata, echo=FALSE}
library(tsibble)
tsibble(
  .rep = rep(1:3, each = 3),
  .idx = rep(yearmonth("2017") + lubridate::month(0:2), 3),
  .innov = rnorm(9),
  x1 = rnorm(9, 2, 2), x2 = rnorm(9,-2), 
  index = .idx, key = id(.rep))
```

For the end user, creating simulations would work like this:

```{r sim-example, eval = FALSE, cache = TRUE}
library(fable)
library(tsibbledata)
UKLungDeaths %>% 
  model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>% 
  generate(UKLungDeaths, times = 5)
```

```{r sim-example-eval, echo = FALSE, cache = TRUE}
library(fable)
library(tsibbledata)
sim1 <- UKLungDeaths %>% 
  model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>% 
  generate(UKLungDeaths, times = 5)
sim1
library(ggplot2)
ggplot(UKLungDeaths, aes(x = index, y = mdeaths)) + 
  geom_line(colour = "blue") + 
  geom_line(aes(y = .sim, group = .rep), data = sim1, alpha = 0.2) + 
  theme_minimal()
```


Or, to generate data beyond the sample:

```{r sim-future, eval = FALSE}
library(lubridate)
UKLungDeaths %>% 
  filter(year(index) <= 1978) %>% 
  model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>% 
  generate(
    UKLungDeaths %>% filter(year(index) > 1978),
    times = 5
  )
```
```{r sim-future-eval, echo=FALSE}
library(lubridate)
sim2 <- UKLungDeaths %>% 
  filter(year(index) <= 1978) %>% 
  model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>% 
  generate(
    UKLungDeaths %>% filter(year(index) > 1978),
    times = 5
  )
sim2
library(ggplot2)
UKLungDeaths %>% 
  filter(year(index) <= 1978) %>% 
  ggplot(aes(x = index, y = mdeaths)) + 
  geom_line(colour = "blue") + 
  geom_line(aes(y = .sim, group = .rep), data = sim2, alpha = 0.2) + 
  theme_minimal()
```


## Visualisation

Different plots are appropriate for visualising each type of model. For example, a plot of an ARIMA model may show the AR and/or MA roots from the model on a unit circle. A linear model has several common plots, including plots showing "Residuals vs Fitted" values, normality via a Q-Q plot, and measures of leverage. These model plots are further extended by the [visreg package](http://pbreheny.github.io/visreg/) to show the affects of terms on the model's response. Some models currently have no model-specific plots, such as ETS, which defaults to showing a components plot using the estimated states.

Visualising these models poses a substantial challenge for consistency across models, and is made more difficult as batch modelling becomes commonplace.