[QUESTION] Trouble creating the right dataset for historical backtest with updating covariates #2575

DataScientistET · 2024-10-28T16:44:36Z

I have data that is of 1 hour granularity. I would like to create a model that forecasts the next 7 days of values every 24 hours. My future covariates are updated every day.

Production Example:
Current time: 2024-01-01 23:00:00
Prediction horizon (7 days): 2024-01-02 00:00:00 to 2024-01-08 23:00:00
Target values: Latest target I have is until 2024-01-01 23:00:00
Future covariates: until 2024-01-08 23:00:00

I would like to predict each day's values using only the target_lag from the previous day and the future_cov from the same day. Hence, I define my model as such:

lgbm_model = LightGBMModel(
    lags=list(range(-24, 0)),
    lags_future_covariates=list(range(0, 24)),
    output_chunk_length=24,
    n_jobs=-1,
    random_state=42,
    multi_models=True,
    verbose=0,
)

When trying to create the historical backtest, this is the implementation I came up with:

for date in pd.date_range(split_date, end_date - timedelta(days = 7)):
    print(f'As of {date} predicting from: {date + relativedelta(hours = 1)} to {date+relativedelta(days = 7)}')
    target_series_train = target_series[start_date: date]
    future_cov_train = future_cov_series[start_date: date]
    
    lgbm_model.fit(
        series = target_series_train,
        future_covariates = future_cov_train
    )
    
    data_df_test_sample = best_guess_df[best_guess_df.forecast_date == date.date()]
    target_series_test = target_series_train[-24:]
    future_cov_series_test = TimeSeries.from_dataframe(data_df_test_sample[future_cov])
    forecast_results = lgbm_model.predict(n=168, series=target_series_test, future_covariates=future_cov_series_test).pd_dataframe()
    break

I then tried printing out the training set using

lagged_training_data = create_lagged_training_data(
            target_series=target_series_train,
            past_covariates=None,
            future_covariates=future_cov_series_train,
            output_chunk_shift=0,
            lags=target_lags,
            lags_past_covariates=None,
            lags_future_covariates=future_cov_lags,
            output_chunk_length=24,
            multi_models=True,
            uses_static_covariates=False
        )

From the training set, it appears that when predicting for the first day, because the target_lag is defined as [-24, -1], the target_lag values used from the second hour onwards does not exist. For example, when trying to predict for the target at 5am, the target_lag-1 is the value at 4am (which does not exist in production). The latest target value I have is from 23:00 from the previous day. How would I define my model such that it always uses the 24 target values from the day before to predict all hours on the day of prediction?

Example dataset:

What I want the training set to be (Assuming output chunk length = 2):

The text was updated successfully, but these errors were encountered:

DataScientistET added question Further information is requested triage Issue waiting for triaging labels Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Trouble creating the right dataset for historical backtest with updating covariates #2575

[QUESTION] Trouble creating the right dataset for historical backtest with updating covariates #2575

DataScientistET commented Oct 28, 2024

[QUESTION] Trouble creating the right dataset for historical backtest with updating covariates #2575

[QUESTION] Trouble creating the right dataset for historical backtest with updating covariates #2575

Comments

DataScientistET commented Oct 28, 2024