diff --git a/docs/ml_training.md b/docs/ml_training.md index cd909b2ec..91afd30f5 100644 --- a/docs/ml_training.md +++ b/docs/ml_training.md @@ -16,6 +16,8 @@ Understand how machine learning models can be trained from within Flyte, with an - Word embedding and topic modelling on lee background corpus with Gensim * - {doc}`Forecast Sales Using Rossmann Store Sales ` - Forecast sales data with data-parallel distributed training using Horovod on Spark. +* - {doc}`Time Series Modeling ` + - Train models for making forecasts on time series data. ``` ```{toctree} @@ -28,4 +30,5 @@ auto_examples/house_price_prediction/index auto_examples/mnist_classifier/index auto_examples/nlp_processing/index auto_examples/forecasting_sales/index +auto_examples/time_series_modeling/index ``` diff --git a/docs/tutorials.md b/docs/tutorials.md index 559a71554..b2421cdc4 100644 --- a/docs/tutorials.md +++ b/docs/tutorials.md @@ -38,6 +38,8 @@ Train machine learning models from using your framework of choice. - Word embedding and topic modelling on lee background corpus with Gensim * - {doc}`Sales Forecasting ` - Use the Rossmann Store data to forecast sales with distributed training using Horovod on Spark. +* - {doc}`Time Series Modeling ` + - Train models for making forecasts on time series data. ``` ## 🛠 Feature Engineering diff --git a/examples/time_series_modeling/Dockerfile b/examples/time_series_modeling/Dockerfile new file mode 100644 index 000000000..65449f624 --- /dev/null +++ b/examples/time_series_modeling/Dockerfile @@ -0,0 +1,31 @@ +FROM python:3.8-slim-buster +LABEL org.opencontainers.image.source https://github.com/flyteorg/flytesnacks + +WORKDIR /root +ENV VENV /opt/venv +ENV LANG C.UTF-8 +ENV LC_ALL C.UTF-8 +ENV PYTHONPATH /root + +# This is necessary for opencv to work +RUN apt-get update && apt-get install -y libsm6 libxext6 libxrender-dev ffmpeg build-essential curl + +WORKDIR /root + +ENV VENV /opt/venv +# Virtual environment +RUN python3 -m venv ${VENV} +ENV PATH="${VENV}/bin:$PATH" + +# Install Python dependencies +COPY requirements.in /root +RUN pip install -r /root/requirements.in +RUN pip freeze + +# Copy the actual code +COPY . /root + +# This tag is supplied by the build script and will be used to determine the version +# when registering tasks, workflows, and launch plans +ARG tag +ENV FLYTE_INTERNAL_IMAGE $tag diff --git a/examples/time_series_modeling/README.md b/examples/time_series_modeling/README.md new file mode 100644 index 000000000..9dd6ecc25 --- /dev/null +++ b/examples/time_series_modeling/README.md @@ -0,0 +1,45 @@ +(time_series_modeling)= + +# Time Series Modeling + +```{eval-rst} +.. tags:: Advanced, MachineLearning +``` + +Time series data is fundamentally different from Independent and Identically +Distributed (IID) data, which is commonly used in many machine learning tasks. +Here are a few key differences: + +1. **Temporal Dependency**: In time series data, observations are ordered + chronologically and exhibit temporal dependencies. Each data point is related + to its past and future values. This sequential nature is crucial for + forecasting and trend analysis. In contrast, IID data assumes that each + observation is independent of others. +2. **Non-stationarity**: Time series often display trends, seasonality, or cyclic + patterns that evolve over time. This non-stationarity means that statistical + properties like mean and variance can change, making analysis more complex. IID + data, by definition, maintains constant statistical properties. +3. **Autocorrelation**: Time series data frequently shows autocorrelation, where + an observation is correlated with its own past values. This feature is essential + for many time series models but is not the case for IID data. +4. **Importance of Order**: The sequence of observations in time series data is + critical and cannot be shuffled without losing information. In IID data, the + order of observations is assumed to be irrelevant. +5. **Inference is Focused on Forecasting**: Time series analysis often aims to + predict future values based on historical patterns, whereas many machine + learning tasks with IID data focus on classification or regression without + a temporal component. +6. **Specific Modeling Techniques**: Time series data requires specialized + modeling techniques like ARIMA, Prophet, or RNNs that can capture temporal + dynamics. These models are not typically used with IID data. + +Understanding these differences is crucial for selecting appropriate analysis +methods and interpreting results in time series modeling tasks. + +Below are examples demonstrating how to use Flyte to train time series models. + +## Examples + +```{auto-examples-toc} +neural_prophet +``` diff --git a/examples/time_series_modeling/requirements.in b/examples/time_series_modeling/requirements.in new file mode 100644 index 000000000..4b5ecf623 --- /dev/null +++ b/examples/time_series_modeling/requirements.in @@ -0,0 +1,4 @@ +flytekit>=1.7.0 +wheel +matplotlib +flytekitplugins-deck-standard diff --git a/examples/time_series_modeling/time_series_modeling/__init__.py b/examples/time_series_modeling/time_series_modeling/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/time_series_modeling/time_series_modeling/neural_prophet.py b/examples/time_series_modeling/time_series_modeling/neural_prophet.py new file mode 100644 index 000000000..7f1426bbd --- /dev/null +++ b/examples/time_series_modeling/time_series_modeling/neural_prophet.py @@ -0,0 +1,116 @@ +# %% [markdown] +# # Train a Neural Prophet Model +# +# This script demonstrates how to train a model for time series forecasting +# using the [neural prophet](https://neuralprophet.com/) library. + +# %% [markdown] +# ## Imports and Setup +# +# First, we import necessary libraries to run the training workflow. + +import pandas as pd +from flytekit import Deck, ImageSpec, current_context, task, workflow +from flytekit.types.file import FlyteFile + +# %% [markdown] +# ## Define an ImageSpec +# +# For reproducibility, we create an `ImageSpec` object with required packages +# for our tasks. + +image = ImageSpec( + name="neuralprophet", + packages=[ + "neuralprophet", + "matplotlib", + "ipython", + "pandas", + "pyarrow", + ], + # This registry is for a local flyte demo cluster. Replace this with your + # own registry, e.g. `docker.io//` + registry="localhost:30000", +) + +# %% [markdown] +# ## Data Loading Task +# +# This task loads the time series data from the specified URL. In this case, +# we use a hard-coded URL for a sample dataset that ships with the neural prophet. + +URL = "https://github.com/ourownstory/neuralprophet-data/raw/main/kaggle-energy/datasets/tutorial01.csv" + + +@task(container_image=image) +def load_data() -> pd.DataFrame: + return pd.read_csv(URL) + + +# %% [markdown] +# ## Model Training Task +# +# This task trains the Neural Prophet model on the loaded data. +# We train the model in the hourly frequency for ten epochs. + + +@task(container_image=image) +def train_model(df: pd.DataFrame) -> FlyteFile: + from neuralprophet import NeuralProphet, save + + working_dir = current_context().working_directory + model = NeuralProphet() + model.fit(df, freq="H", epochs=10) + model_fp = f"{working_dir}/model.np" + save(model, model_fp) + return FlyteFile(model_fp) + + +# %% [markdown] +# ## Forecasting Task +# +# This task loads the trained model, makes predictions, and visualizes the +# results using a Flyte Deck. + + +@task( + container_image=image, + enable_deck=True, +) +def make_forecast(df: pd.DataFrame, model_file: FlyteFile) -> pd.DataFrame: + from neuralprophet import load + + model_file.download() + model = load(model_file.path) + + # Create a new dataframe reaching 365 into the future + # for our forecast, n_historic_predictions also shows historic data + df_future = model.make_future_dataframe( + df, + n_historic_predictions=True, + periods=365, + ) + + # Predict the future + forecast = model.predict(df_future) + + # Plot on a Flyte Deck + fig = model.plot(forecast) + Deck("Forecast", fig.to_html()) + + return forecast + + +# %% [markdown] +# ## Main Workflow +# +# Finally, this workflow orchestrates the entire process: loading data, +# training the model, and making forecasts. + + +@workflow +def main() -> pd.DataFrame: + df = load_data() + model_file = train_model(df) + forecast = make_forecast(df, model_file) + return forecast