Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Trouble creating the right dataset for historical backtest with updating covariates #2575

Open
DataScientistET opened this issue Oct 28, 2024 · 0 comments
Labels
question Further information is requested triage Issue waiting for triaging

Comments

@DataScientistET
Copy link

I have data that is of 1 hour granularity. I would like to create a model that forecasts the next 7 days of values every 24 hours. My future covariates are updated every day.

Production Example:
Current time: 2024-01-01 23:00:00
Prediction horizon (7 days): 2024-01-02 00:00:00 to 2024-01-08 23:00:00
Target values: Latest target I have is until 2024-01-01 23:00:00
Future covariates: until 2024-01-08 23:00:00

I would like to predict each day's values using only the target_lag from the previous day and the future_cov from the same day. Hence, I define my model as such:

lgbm_model = LightGBMModel(
    lags=list(range(-24, 0)),
    lags_future_covariates=list(range(0, 24)),
    output_chunk_length=24,
    n_jobs=-1,
    random_state=42,
    multi_models=True,
    verbose=0,
)

When trying to create the historical backtest, this is the implementation I came up with:

for date in pd.date_range(split_date, end_date - timedelta(days = 7)):
    print(f'As of {date} predicting from: {date + relativedelta(hours = 1)} to {date+relativedelta(days = 7)}')
    target_series_train = target_series[start_date: date]
    future_cov_train = future_cov_series[start_date: date]
    
    lgbm_model.fit(
        series = target_series_train,
        future_covariates = future_cov_train
    )
    
    data_df_test_sample = best_guess_df[best_guess_df.forecast_date == date.date()]
    target_series_test = target_series_train[-24:]
    future_cov_series_test = TimeSeries.from_dataframe(data_df_test_sample[future_cov])
    forecast_results = lgbm_model.predict(n=168, series=target_series_test, future_covariates=future_cov_series_test).pd_dataframe()
    break

I then tried printing out the training set using

lagged_training_data = create_lagged_training_data(
            target_series=target_series_train,
            past_covariates=None,
            future_covariates=future_cov_series_train,
            output_chunk_shift=0,
            lags=target_lags,
            lags_past_covariates=None,
            lags_future_covariates=future_cov_lags,
            output_chunk_length=24,
            multi_models=True,
            uses_static_covariates=False
        )

From the training set, it appears that when predicting for the first day, because the target_lag is defined as [-24, -1], the target_lag values used from the second hour onwards does not exist. For example, when trying to predict for the target at 5am, the target_lag-1 is the value at 4am (which does not exist in production). The latest target value I have is from 23:00 from the previous day. How would I define my model such that it always uses the 24 target values from the day before to predict all hours on the day of prediction?

Example dataset:
image

What I want the training set to be (Assuming output chunk length = 2):
image

@DataScientistET DataScientistET added question Further information is requested triage Issue waiting for triaging labels Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triage Issue waiting for triaging
Projects
None yet
Development

No branches or pull requests

1 participant