Forecasting with missing value during inference but available when training #1133

runyournode · 2024-08-30T14:03:17Z

runyournode
Aug 30, 2024

Hello there,

First, let me thank you for the available open-source code and the extensive documentation 🥇.

I am only discovering time-series forecasting and I think I am not starting with the easiest task 😄.
So let me expose and how I think I will handle it but also the early concerns I may have.

Any insight is welcome before I deep-dive into implementing it.
I am not asking you to do my job, but if you think my strategy is very poor with high chances of failure, I'd better revise it 😄 . Also if this is not the place for such discussion, please excuse me and delete my post.

Thank you for your guidance !

Context

System A records every second some IT network metrics: $(ds, feat_0, feat_1, ... feat_n, y)$. Features may be static or dynamic, $y$ is just another feature, but is the one I am eventually interested in forecasting.

System A df would look like:

System `A` data:

$ds$	$feat0$	$feat1$	...	$featn$	$y$
0	$feat0_0$	$feat1_0$	...	$featn_0$	$y_0$
1	$feat0_1$	$feat1_1$	...	$featn_1$	$y_1$
2	$feat0_2$	$feat1_2$	...	$featn_2$	$y_2$
...	...	...	...	...	...

A irregularly streams these data to system B.

System B df would for instance look like:

System `B` data:

ds	$feat0$	$feat1$	...	$featn$	$y$
1	$feat0_1$	$feat1_1$	...	$featn_1$	$y_1$
2	$feat0_2$	$feat1_2$	...	$featn_2$	$y_2$
5	$feat0_5$	$feat1_5$	...	$featn_5$	$y_5$
7	$feat0_7$	$feat1_7$	...	$featn_7$	$y_7$
8	$feat0_7$	$feat1_7$	...	$featn_7$	$y_7$
12	$feat0_{12}$	$feat1_{12}$	...	$featn_{12}$	$y_{12}$
...	...	...	...	...	...

I ignore the policy that dictates A to stream or not to B but I suspect that the streaming state is correlated to the value of $y$.

During the training phase, I may ask for both df, so the concatenated data would be :

Available data for training

$ds$	$feat0$	$feat1$	...	$featn$	$y$	isStreamed
0	$feat0_0$	$feat1_0$	...	$featn_0$	$y_0$	False
1	$feat0_1$	$feat1_1$	...	$featn_1$	$y_1$	True
2	$feat0_2$	$feat1_2$	...	$featn_2$	$y_2$	True
...	...	...	...	...	...	...

I would have no other time-serie, but will have severals samples of this time-serie recorded in different situations and time.

Aim

My aim is to forecast (in real time) the next values of $y$ from the data available at B.

Real-time forecasting from:

$ds$	$feat0$	$feat1$	...	$featn$	$y$	isStreamed
0	-	-	-	-	-	False
1	$feat0_1$	$feat1_1$	...	$featn_1$	$y_1$	True
2	$feat0_2$	$feat1_2$	...	$featn_2$	$y_2$	True
3	-	-	...	-	-	False
...	...	...	...	...	...	...

My Strategy

By definition, I cannot use future exogenous features but have access to some past exogenous features and some past $y$ values. I expect no trend (and maybe no seasonal) patterns in my data. I will first try neuralforecast univariate models that can handle historical exogenous features, but will also try multivariate models.

What do you think would be the smartest way to leverage the available feature / $y$ data seen during training (but unavailable when forecasting) ?

I can think of:

When training:

take df from A, create a masked copy if it by replacing the un-streamed features values with interpolation, or last streamed values for extrapolation. I have now 2 time-series (A and masked A) I can train my model on.
I can perform some data-augmentation if I repeat the process by sampling different time windows from the original df. Interpolation and extrapolation would then not occur in the same time-steps.

When forecasting:

I can fill the missing feature values of B with the same inter/extra-polation strategy. Missing $y$ could also be filled by past inferences from the same model.

What concerns me :

I did not provide to the model the fact that there is an ideal case with 100% accurate data (A) and a degraded case (masked A) that would be closer to data seen during forecasting.
$y$ values in masked-A are 100% accurate but $y$ values in B would result from interpolation or forecasting. Do you think I can easily code and train my model by filling in masked A the $y$ un-streamed values by the forecast of the model itself ?
How would react most models to missing data / irregular sampling ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forecasting with missing value during inference but available when training #1133

{{title}}

Replies: 0 comments

Select a reply

Forecasting with missing value during inference but available when training #1133

runyournode Aug 30, 2024

Context

System A data:

System B data:

Available data for training

Aim

Real-time forecasting from:

My Strategy

What concerns me :

Replies: 0 comments

runyournode
Aug 30, 2024

System `A` data:

System `B` data: