diff --git a/book/chapters/Predictions.ipynb b/book/chapters/Predictions.ipynb index 5467ae9..42e6c5f 100644 --- a/book/chapters/Predictions.ipynb +++ b/book/chapters/Predictions.ipynb @@ -18,21 +18,32 @@ "Finally, we are at the final Chapter where we see the end-product of the model created. As we venture into this critical phase, the model_predict script emerges, guiding the way toward understanding and anticipating the future of snow water equivalent (SWE) through the ExtraTree model. This chapter delves into the intricacies of this script, unraveling the processes that transforms raw, unprocessed data into precise predictions that illuminate the path forward." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9.1 Data Loading and Preprocessing" + ] + }, { "cell_type": "markdown", "metadata": { "id": "sFDziC_Df-lP" }, "source": [ - "**Preparing for Prediction:**\n", - "This begins with loading and pre-processing of data\n", - "\n", - "Loading Data: The script starts by ingesting data from a CSV file, bringing into the fold the vast array of variable" + "### 9.1.1 Loading the Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The prediction process begins with loading of data from a CSV file. This data includes a vast array of variables that are essential for making accurate SWE predictions." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 9, "metadata": { "id": "LkzaZTCtgJ5R" }, @@ -53,12 +64,14 @@ "id": "zUwlJCu6ge2k" }, "source": [ - "Pre-processing: Next, the data undergoes a transformation. Dates are converted, irrelevant columns are discarded, and the data is reshaped to match the model's expectations." + "### 9.1.2 Preprocessing the Data\n", + "Once loaded, the data undergoes a transformation process to ensure it aligns with the model's requirements. This step includes converting dates, renaming columns for consistency, and selecting relevant features.
\n", + "Prepprocessing is crucial to ensure that the data is in the correct format for the model, which directly impacts the accurancy of the prefictions.\n" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 8, "metadata": { "id": "wXQ-4JBdgjdG" }, @@ -93,18 +106,26 @@ " return data\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9.2 Model Loading and Prediction" + ] + }, { "cell_type": "markdown", "metadata": { "id": "qZg6If3Lhecf" }, "source": [ - "**Loading Model**: The script retrieves the ExtraTree model and starts the process of making Predictions." + "### 9.2.1 Loading the Model\n", + "The script retrieves the pre-trained ExtraTree model, which is used to generate predictions based on the processed data." ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 7, "metadata": { "id": "d7EsL7aVhltv" }, @@ -129,12 +150,14 @@ "id": "NRBPyBVxhu3r" }, "source": [ - "**predict_swe:** Before prediction can commence, predict_swe undertakes the crucial task of preparing the input data." + "### 9.2.2 Predicting SWE\n", + "\n", + "The `predict_swe` function prepares the input data and generates predictions using the loaded model." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": { "id": "Uv3nAmw-h3EO" }, @@ -161,79 +184,26 @@ "id": "PYeFXD4xiJpi" }, "source": [ - "It fills missing values with a designated placeholder (-999), a common practice to ensure machine learning algorithms, can process the data without encountering errors due to missing values. This step reflects a balance between data integrity and computational requirements, enabling the model to make predictions even in the absence of complete information.\n", + "- It fills missing values with a designated placeholder (-999), a common practice to ensure machine learning algorithms, can process the data without encountering errors due to missing values. This step reflects a balance between data integrity and computational requirements, enabling the model to make predictions even in the absence of complete information.\n", "\n", - "At the core of predict_swe is the model's predict() method invocation. This step is where the machine learning model, trained on historical data, applies its learned patterns to the new, unseen data. The decision to drop geographical identifiers (lat, lon) before prediction underscores a focus on the environmental and temporal factors influencing SWE, aligning the model's inputs with its training regime.\n", + "- At the core of predict_swe is the model's `predict()` method invocation. This step is where the machine learning model, trained on historical data, applies its learned patterns to the new, unseen data. The decision to drop geographical identifiers (lat, lon) before prediction underscores a focus on the environmental and temporal factors influencing SWE, aligning the model's inputs with its training regime.\n", "\n", - "The function concludes by appending the model's predictions back to the original dataset as a new column, predicted_swe. This enrichment transforms the dataset from a static snapshot of past and present conditions into a dynamic forecast of future snow water equivalents. This step is critical for stakeholders relying on accurate SWE predictions." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Zmbl6QxXiVIE" - }, - "source": [ - "**Merge data:** merge_data meticulously combines the predicted SWE values with the original dataset. It employs conditional logic to adjust predictions based on specific criteria, such as nullifying predictions in the absence of key environmental data. This approach underscores a commitment to precision, ensuring that the predictions reflect a nuanced understanding of the environmental context." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "id": "aa_il7YriZlE" - }, - "outputs": [], - "source": [ - "def merge_data(original_data, predicted_data):\n", - " \"\"\"\n", - " Merge predicted SWE data with the original data.\n", - " Args: original_data (pd.DataFrame): Original input data.\n", - " predicted_data (pd.DataFrame): Dataframe with predicted SWE values.\n", - " Returns: pd.DataFrame: Merged dataframe.\n", - " \"\"\"\n", - " if \"date\" not in predicted_data:\n", - " predicted_data[\"date\"] = test_start_date\n", - " new_data_extracted = predicted_data[[\"date\", \"lat\", \"lon\", \"predicted_swe\"]]\n", - " print(\"original_data.columns: \", original_data.columns)\n", - " print(\"new_data_extracted.columns: \", new_data_extracted.columns)\n", - " print(\"new prediction statistics: \", new_data_extracted[\"predicted_swe\"].describe())\n", - " merged_df = original_data.merge(new_data_extracted, on=['date', 'lat', 'lon'], how='left')\n", - " merged_df.loc[merged_df['fsca'] == 237, 'predicted_swe'] = 0\n", - " merged_df.loc[merged_df['fsca'] == 239, 'predicted_swe'] = 0\n", - " merged_df.loc[merged_df['cumulative_fsca'] == 0, 'predicted_swe'] = 0\n", - " merged_df.loc[merged_df['air_temperature_tmmx'].isnull(), 'predicted_swe'] = 0\n", - " return merged_df" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ER1Ej6OPioDb" - }, - "source": [ - "**This function's Technical execution**\n", - "\n", - "Merging datasets based on date, latitude, and longitude—exemplifies the complex use of data science. It ensures that each predicted SWE value is accurately aligned with its corresponding geographical and temporal marker, preserving the integrity and utility of the predictions. This process not only highlights the technical sophistication of the SnowCast project but also its dedication to delivering reliable and actionable insights." + "- The function concludes by appending the model's predictions back to the original dataset as a new column, `predicted_swe`. This enrichment transforms the dataset from a static snapshot of past and present conditions into a dynamic forecast of future snow water equivalents. This step is critical for stakeholders relying on accurate SWE predictions." ] }, { "cell_type": "markdown", - "metadata": { - "id": "PmUzvNEYi9rl" - }, + "metadata": {}, "source": [ - "**Predict Function**\n", + "### 9.2.3 Predict Function\n", "\n", - "The predict function stands as the conductor, orchestrating the entire predictive process from start to finish. It starts by loading the pre-trained model, which embodies the project's strength of making predictions by preserving and leveraging the accumulated knowledge encapsulated within the model's parameters." + "The predict function is what manages the entire prediction process from start to finish. It starts by loading the pre-trained model, which embodies the project's strength of making predictions by preserving and leveraging the accumulated knowledge encapsulated within the model's parameters." ] }, { "cell_type": "code", - "execution_count": 12, - "metadata": { - "id": "ny9tjabpjJ4q" - }, + "execution_count": 5, + "metadata": {}, "outputs": [], "source": [ "def predict():\n", @@ -276,13 +246,75 @@ " print(f\"Copied to {latest_output_path}\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Following model loading, the function navigates the data landscape, loading new data for prediction and preprocessing it to align with the model's requirements. This step is critical, as it transforms raw data into a format that the model can interpret, ensuring the accuracy and relevance of the predictions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9.3 Post-Processing and Merging Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 9.3.1 Merging Predicted Data\n", + "\n" + ] + }, { "cell_type": "markdown", "metadata": { - "id": "d5DBPYh8jhDx" + "id": "Zmbl6QxXiVIE" }, "source": [ - "Following model loading, the function navigates the data landscape, loading new data for prediction and preprocessing it to align with the model's requirements. This step is critical, as it transforms raw data into a format that the model can interpret, ensuring the accuracy and relevance of the predictions." + "`merge_data` meticulously combines the predicted SWE values with the original dataset. It employs conditional logic to adjust predictions based on specific criteria, such as nullifying predictions in the absence of key environmental data. This approach underscores a commitment to precision, ensuring that the predictions reflect a nuanced understanding of the environmental context." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "aa_il7YriZlE" + }, + "outputs": [], + "source": [ + "def merge_data(original_data, predicted_data):\n", + " \"\"\"\n", + " Merge predicted SWE data with the original data.\n", + " Args: original_data (pd.DataFrame): Original input data.\n", + " predicted_data (pd.DataFrame): Dataframe with predicted SWE values.\n", + " Returns: pd.DataFrame: Merged dataframe.\n", + " \"\"\"\n", + " if \"date\" not in predicted_data:\n", + " predicted_data[\"date\"] = test_start_date\n", + " new_data_extracted = predicted_data[[\"date\", \"lat\", \"lon\", \"predicted_swe\"]]\n", + " print(\"original_data.columns: \", original_data.columns)\n", + " print(\"new_data_extracted.columns: \", new_data_extracted.columns)\n", + " print(\"new prediction statistics: \", new_data_extracted[\"predicted_swe\"].describe())\n", + " merged_df = original_data.merge(new_data_extracted, on=['date', 'lat', 'lon'], how='left')\n", + " merged_df.loc[merged_df['fsca'] == 237, 'predicted_swe'] = 0\n", + " merged_df.loc[merged_df['fsca'] == 239, 'predicted_swe'] = 0\n", + " merged_df.loc[merged_df['cumulative_fsca'] == 0, 'predicted_swe'] = 0\n", + " merged_df.loc[merged_df['air_temperature_tmmx'].isnull(), 'predicted_swe'] = 0\n", + " return merged_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ER1Ej6OPioDb" + }, + "source": [ + "### 9.3.2 Technical Execution of the function\n", + "\n", + "Merging datasets based on date, latitude, and longitude—exemplifies the complex use of data science. It ensures that each predicted SWE value is accurately aligned with its corresponding geographical and temporal marker, preserving the integrity and utility of the predictions. This process not only highlights the technical sophistication of the SnowCast project but also its dedication to delivering reliable and actionable insights." ] }, { @@ -291,9 +323,9 @@ "id": "slezgjgmjY6-" }, "source": [ - "**Delivering the Prediction**\n", + "## 9.4 Delivering Predictions\n", "\n", - "In its final act, the predict function executes predict_swe, merges the predictions with the original data, and saves the enriched dataset. The choice of a dynamically generated filename for saving predictions demonstrates an understanding of operational requirements, ensuring that each prediction cycle is uniquely identifiable.\n", + "Finally, the predict function executes predict_swe, merges the predictions with the original data, and saves the enriched dataset. The choice of a dynamically generated filename for saving predictions demonstrates an understanding of operational requirements, ensuring that each prediction cycle is uniquely identifiable.\n", "\n", "![](../img/Pred_Delivery.png)" ] @@ -304,7 +336,7 @@ "id": "u61bKY8mj0eE" }, "source": [ - "# Results" + "## 9.5 Results" ] }, { @@ -313,7 +345,7 @@ "id": "oRHveVzRj91l" }, "source": [ - "This is the whole process of how the predictions are converted into Images." + "### 9.5.1 Converting the Predictions into Images" ] }, { @@ -322,21 +354,21 @@ "id": "tHsM6EIRlQdQ" }, "source": [ - "**Convert result to image:**\n", + "These are the different functions used for this process of predicitons to Images\n", "\n", - "**convert csvs to images simple:** This Is the function that takes the raw data and converts them into Geographical images.\n", + "- **convert csvs to images simple:** This Is the function that takes the raw data and converts them into Geographical images.\n", "\n", - "**Data Loading:** This begins by ingesting the CSV containing SWE predictions, ensuring every data point is primed for visualization.\n", + "- **Data Loading:** This begins by ingesting the CSV containing SWE predictions, ensuring every data point is primed for visualization.\n", "\n", - "**Custom Colormap Creation:** It employs a custom colormap, crafted to represent various ranges of SWE, providing an intuitive visual understanding of snow coverage.\n", + "- **Custom Colormap Creation:** It employs a custom colormap, crafted to represent various ranges of SWE, providing an intuitive visual understanding of snow coverage.\n", "\n", - "**Geospatial Plotting:** This utilizes the geographical coordinates within the data to accurately place each prediction on the map, ensuring a realistic representation of SWE distribution.\n", + "- **Geospatial Plotting:** This utilizes the geographical coordinates within the data to accurately place each prediction on the map, ensuring a realistic representation of SWE distribution.\n", "\n", - "**Merge data:** The merge_data function combines the predicted SWE values with their corresponding geographical markers.\n", + "- **Merge data:** The merge_data function combines the predicted SWE values with their corresponding geographical markers.\n", "\n", - "**Conditional Adjustments**: Conditional adjustment refines the predicted values based on specific criteria, ensuring the visual representation aligns with realistic expectations of SWE.\n", + "- **Conditional Adjustments**: Conditional adjustment refines the predicted values based on specific criteria, ensuring the visual representation aligns with realistic expectations of SWE.\n", "\n", - "**Spatial Accuracy:** This aligns predictions with their exact geographical locations, ensuring that the visual output is as informative as it is accurate." + "- **Spatial Accuracy:** This aligns predictions with their exact geographical locations, ensuring that the visual output is as informative as it is accurate." ] }, { @@ -352,12 +384,15 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 3, "metadata": { "id": "qMeBEobUoJcW" }, "outputs": [], "source": [ + "import matplotlib.colors as mcolors\n", + "\n", + "\n", "colors = [\n", " (0.8627, 0.8627, 0.8627), # #DCDCDC - 0 - 1\n", " (0.8627, 1.0000, 1.0000), # #DCFFFF - 1 - 2\n", @@ -389,18 +424,18 @@ "id": "-gFem_1DlsPd" }, "source": [ - "**Convert csv to geotiff:** This function mainly helps in converting images to geographically accurate maps.\n", + "- **Convert csv to geotiff:** This function mainly helps in converting images to geographically accurate maps.\n", "\n", - "**Rasterization:** It transforms the CSV data into a raster format, suitable for creating detailed geospatial maps.\n", + "- **Rasterization:** It transforms the CSV data into a raster format, suitable for creating detailed geospatial maps.\n", "\n", - "**Resolution and Coverage:** This carefully defines the resolution and geographical extent of the output map, ensuring that it captures the full scope of the predictions.\n", + "- **Resolution and Coverage:** This carefully defines the resolution and geographical extent of the output map, ensuring that it captures the full scope of the predictions.\n", "\n", - "**Geospatial Alignment:** Geospatial Alignment utilizes rasterio and geopandas libraries to ensure that each pixel in the output map accurately represents the predicted SWE values at specific geographical coordinates." + "- **Geospatial Alignment:** Geospatial Alignment utilizes rasterio and geopandas libraries to ensure that each pixel in the output map accurately represents the predicted SWE values at specific geographical coordinates." ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 10, "metadata": { "id": "7h9706LXVba2" }, @@ -508,10 +543,10 @@ "id": "fMErFn7ul35G" }, "source": [ - "**Deploy images to website:**\n", + "### 9.5.2 Deploy images to website\n", "This is the process that helps in Deploying the visual insights\n", "\n", - "**copy files to right folder** --\n", + "**1. copy files to right folder** --\n", "\n", "Function: Bridging Computational Outputs with Public Access At the heart of our deployment strategy lies the copy_files_to_right_folder function.\n", "This function acts as the bridge, transferring the visual and data outputs of SnowCast from the secure confines of its computational environment to a publicly accessible web directory.\n", @@ -539,7 +574,7 @@ "\n", "\n", "\n", - "**create mapserver map config: Crafts interactive Maps**\n", + "**2. create mapserver map config: Crafts interactive Maps**\n", "\n", "The magic of SnowCast is not just in its predictions but in how these predictions are presented. The create_mapserver_map_config function crafts a MapServer configuration for each GeoTIFF prediction file, transforming static data into interactive, exploratory maps.\n", "\n", @@ -569,7 +604,7 @@ "id": "6V64pCx4mT1b" }, "source": [ - "**refresh available date list: Refreshing the Forecast**\n", + "**3. refresh available date list: Refreshing the Forecast**\n", "\n", "The refresh_available_date_list function ensures that the SnowCast portal remains current, reflecting the latest predictions and analyses. By dynamically updating the available date list with new predictions, it guarantees that users have access to the most recent insights.\n", "\n", @@ -597,9 +632,9 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python (base)", + "display_name": "Python 3", "language": "python", - "name": "base" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -611,7 +646,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.5" + "version": "3.11.4" } }, "nbformat": 4, diff --git a/book/chapters/model_training.ipynb b/book/chapters/model_training.ipynb index c586c46..2564cf1 100644 --- a/book/chapters/model_training.ipynb +++ b/book/chapters/model_training.ipynb @@ -2,16 +2,414 @@ "cells": [ { "cell_type": "markdown", - "source": [ - "# Model Training\n", - "\n", - "Detailed description of the model training process\n", - "Selection of parameters and training datasets\n" - ], + "id": "d1597c52f583ca0c", "metadata": { "collapsed": false }, - "id": "d1597c52f583ca0c" + "source": [ + "# Model Training\n", + "\n", + "In the field of Snow Water Equivalent (SWE) prediction, training models that accurately represent the complexities of environmental data is a critical task. This chapter delves into the intricacies of model training, focusing on the foundational BaseHole class, its extensions, and the specific machine learning models that utilize this structure.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "eb399ad0", + "metadata": {}, + "source": [ + "## 7.1 Base Hole Class\n", + "\n", + "\n", + "### 7.1.1 Overview\n", + "The BaseHole class is a meticulously crafted blueprint for building SWE predictors. It encapsulates the core processes of data handling, model training, and evaluation, ensuring that common functionalities are standardized and reusable. By designing BaseHole as an extendable class, specific predictor classes can inherit and customize its methods, allowing for flexibility in model creation while maintaining a consistent structure across different models.\n", + "\n", + "**Key Attributes**:\n", + "* all_ready_file: A path to the CSV file containing pre-processed data ready for training.\n", + "* classifier: The machine learning model used for prediction.\n", + "* holename: The name of the wormhole class, which is derived from the class name itself.\n", + "* train_x, train_y: Training input and target data, respectively.\n", + "* test_x, test_y: Testing input and target data, respectively.\n", + "* test_y_results: The predicted results on the test data.\n", + "* save_file: Path to save the trained model.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "9c13e761", + "metadata": {}, + "source": [ + "### 7.1.2 Core Functions\n", + "\n", + "**Preprocessing**: The model begins with preprocessing, a critical phase where raw data is transformed into a refined form suitable for training. The BaseHole class adeptly navigates this phase, loading data, cleaning it, and splitting it into training and testing sets. This preparatory step ensures that the models are fed data that is both digestible and informative, setting the stage for accurate predictions.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "7d9d9918", + "metadata": {}, + "outputs": [], + "source": [ + "def preprocessing(self):\n", + " '''\n", + " Preprocesses the data for training and testing.\n", + "\n", + " Returns:\n", + " None\n", + " '''\n", + " all_ready_pd = pd.read_csv(self.all_ready_file, header=0, index_col=0)\n", + " print(\"all columns: \", all_ready_pd.columns)\n", + " all_ready_pd = all_ready_pd[all_cols]\n", + " all_ready_pd = all_ready_pd.dropna()\n", + " train, test = train_test_split(all_ready_pd, test_size=0.2)\n", + " self.train_x, self.train_y = train[input_columns].to_numpy().astype('float'), train[['swe_value']].to_numpy().astype('float')\n", + " self.test_x, self.test_y = test[input_columns].to_numpy().astype('float'), test[['swe_value']].to_numpy().astype('float')" + ] + }, + { + "cell_type": "markdown", + "id": "5be695e3", + "metadata": {}, + "source": [ + "**Train**: The train function is responsible for training the machine learning model using the preprocessed data. This function prepares the model to make accurate predictions by learning patterns from the training data." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "27917c05", + "metadata": {}, + "outputs": [], + "source": [ + "def train(self):\n", + " '''\n", + " Trains the machine learning model.\n", + "\n", + " Returns:\n", + " None\n", + " '''\n", + " self.classifier.fit(self.train_x, self.train_y)" + ] + }, + { + "cell_type": "markdown", + "id": "44327d10", + "metadata": {}, + "source": [ + "**Test**: The test function evaluates the model's performance on a separate testing dataset, allowing for the assessment of its predictive accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "56bb3695", + "metadata": {}, + "outputs": [], + "source": [ + "def test(self):\n", + " '''\n", + " Tests the machine learning model on the testing data.\n", + "\n", + " Returns:\n", + " numpy.ndarray: The predicted results on the testing data.\n", + " '''\n", + " self.test_y_results = self.classifier.predict(self.test_x)\n", + " return self.test_y_results" + ] + }, + { + "cell_type": "markdown", + "id": "0e899771", + "metadata": {}, + "source": [ + "**Predict**: The predict function leverages the trained model to make predictions on new, unseen data, providing valuable insights into potential outcomes." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "8552bca5", + "metadata": {}, + "outputs": [], + "source": [ + "def predict(self, input_x):\n", + " '''\n", + " Makes predictions using the trained model on new input data.\n", + "\n", + " Args:\n", + " input_x (numpy.ndarray): The input data for prediction.\n", + "\n", + " Returns:\n", + " numpy.ndarray: The predicted results.\n", + " '''\n", + " return self.classifier.predict(input_x)" + ] + }, + { + "cell_type": "markdown", + "id": "c68cb61a", + "metadata": {}, + "source": [ + "More functions in this class which are being overridden in other classes:\n", + "\n", + " - **Evaluate**: The evaluate function, designed to be overridden, is where the performance metrics of the model are calculated and analyzed. This function is crucial for understanding the model's strengths and weaknesses.\n", + "\n", + " - **Get Model**: The get_model function, another overridable method, is responsible for returning the specific machine learning model object that will be used for training and prediction.\n", + "\n", + " - **Post-processing**: The post_processing function handles the final steps after model predictions are made, such as generating visualizations, analyzing feature importance, and saving results." + ] + }, + { + "cell_type": "markdown", + "id": "932903ce", + "metadata": {}, + "source": [ + "## 7.2 ETHole Class\n", + "\n", + "The ETHole class is designed to leverage the power of the Extra Trees Regressor, an ensemble learning method. This class is a specialized extension of the RandomForestHole class, inheriting its structure while introducing specific adaptations(model).\n", + "\n", + "**Why Extra Trees Regressor?**\n", + "\n", + " - The Extra Trees Regressor stands out because of its robustness in handling varied data distributions and its ability to capture intricate patterns without overfitting. Unlike traditional decision trees, which split the data by selecting the best feature thresholds, Extra Trees introduces additional randomness by selecting thresholds at random. This randomness helps in reducing variance, making the model less prone to overfitting, especially in high-dimensional spaces like environmental data.\n", + "\n", + "### 7.2.1 Custom Features\n", + "\n", + "To maximize the predictive power of the Extra Trees model, the ETHole class introduces several custom features that tailor the training process to the specific needs of SWE prediction.\n", + "\n", + " - **Custom Loss Function:** The custom_loss function in the ETHole class is a specialized loss function that penalizes errors differently based on the true value of SWE. In typical regression tasks, the goal is to minimize the average error across all predictions. However, in SWE prediction it’s crucial to be more accurate in certain ranges, such as when SWE values are high, as these may correspond to critical environmental conditions.\n", + " ```\n", + " \n", + " ```\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "4757a183", + "metadata": {}, + "outputs": [], + "source": [ + "def custom_loss(y_true, y_pred):\n", + " errors = np.abs(y_true - y_pred)\n", + " return np.where(y_true > 10, 2 * errors, errors)" + ] + }, + { + "cell_type": "markdown", + "id": "bba6ee1e", + "metadata": {}, + "source": [ + " - **Sample Weights:** Sample weights adjust the importance of each data point during the training process. The create_sample_weights method generates weights based on the SWE values, giving more importance to higher values, ensuring that the model focuses more on accurately predicting these critical instances.\n", + " ```\n", + "\n", + " ```\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "eb31413b", + "metadata": {}, + "outputs": [], + "source": [ + "def create_sample_weights(self, X, y, scale_factor, columns):\n", + " return (y - np.min(y)) / (np.max(y) - np.min(y)) * scale_factor" + ] + }, + { + "cell_type": "markdown", + "id": "e57053a9", + "metadata": {}, + "source": [ + "### 7.2.2 Training and Evaluation\n", + "\n", + "```\n", + "\n", + "```\n", + "\n", + "**Model Creation:**\n", + "The get_model() method in this class overrides the base method to return an instance of `ExtraTreeRegressor`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0358cded", + "metadata": {}, + "outputs": [], + "source": [ + "def get_model(self):\n", + " \"\"\"\n", + " Returns the Extra Trees Regressor model with specified hyperparameters.\n", + "\n", + " Returns:\n", + " ExtraTreesRegressor: The Extra Trees Regressor model.\n", + " \"\"\"\n", + "# return ExtraTreesRegressor(n_estimators=200, \n", + "# max_depth=None,\n", + "# random_state=42, \n", + "# min_samples_split=2,\n", + "# min_samples_leaf=1,\n", + "# n_jobs=5\n", + "# )\n", + " return ExtraTreesRegressor(n_jobs=-1, random_state=123)" + ] + }, + { + "cell_type": "markdown", + "id": "c0a60dcc", + "metadata": {}, + "source": [ + "**Train Method:** The train method in the ETHole class is designed to take full advantage of the Extra Trees model's capabilities. By incorporating sample weights, the model becomes more attuned to the nuances of the data, particularly in ranges that are more impactful in the real world." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "20355b27", + "metadata": {}, + "outputs": [], + "source": [ + "def train(self):\n", + " self.classifier.fit(self.train_x, self.train_y)\n", + " predictions = self.classifier.predict(self.train_x)\n", + " errors = np.abs(self.train_y - predictions)\n", + " weights = compute_sample_weight('balanced', errors)\n", + " self.classifier.fit(self.train_x, self.train_y, sample_weight=weights)" + ] + }, + { + "cell_type": "markdown", + "id": "4a53cf19", + "metadata": {}, + "source": [ + "The training process is carried in two main phases:\n", + " - **Initial Training:** The model is first trained on the entire training dataset without any sample weights.\n", + "\n", + " - **Weighted Training:** After the initial training, the model's predictions are compared with actual values, and sample weights are computed based on the errors. The model is then retrained using these weights, making it more sensitive to critical prediction errors." + ] + }, + { + "cell_type": "markdown", + "id": "6b7ca9c0", + "metadata": {}, + "source": [ + "### 7.2.3 Post-Processing\n", + "After training and making predictions, the post_processing method plays a key role in analyzing the model's performance. One of the primary tasks is to assess feature importance, which helps in understanding which input features (e.g., temperature, precipitation) were most influential in the model’s predictions.\n", + "\n", + "\n", + "```\n", + "\n", + "\n", + "```\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "95e89ce6", + "metadata": {}, + "outputs": [], + "source": [ + "def post_processing(self, chosen_columns=None):\n", + " feature_importances = self.classifier.feature_importances_\n", + " feature_names = self.feature_names\n", + " sorted_indices = np.argsort(feature_importances)[::-1]\n", + " sorted_importances = feature_importances[sorted_indices]\n", + " sorted_feature_names = feature_names[sorted_indices]\n", + "\n", + " plt.figure(figsize=(10, 6))\n", + " plt.bar(range(len(feature_names)), sorted_importances, tick_label=sorted_feature_names)\n", + " plt.xticks(rotation=90)\n", + " plt.xlabel('Feature')\n", + " plt.ylabel('Feature Importance')\n", + " plt.title('Feature Importance Plot (ET model)')\n", + " plt.tight_layout()\n", + " if chosen_columns == None:\n", + " feature_png = f'{work_dir}/testing_output/et-model-feature-importance-latest.png'\n", + " else:\n", + " feature_png = f'{work_dir}/testing_output/et-model-feature-importance-{len(chosen_columns)}.png'\n", + " plt.savefig(feature_png)\n", + " print(f\"Feature image is saved {feature_png}\")" + ] + }, + { + "cell_type": "markdown", + "id": "58397641", + "metadata": {}, + "source": [ + "The post-processing method generates a feature importance plot, which visually represents how much each feature contributed to the predictions. This is crucial for model interpretation, allowing researchers to understand which environmental factors most significantly impact SWE predictions." + ] + }, + { + "cell_type": "markdown", + "id": "bc8491b0", + "metadata": {}, + "source": [ + "## 7.3 Training\n", + "In the final stage of the training process, multiple models, including the ETHole, are trained and validated to determine the best performer. This process is encapsulated in a script that orchestrates the training and evaluation of several models, ensuring a comprehensive approach to model selection." + ] + }, + { + "cell_type": "markdown", + "id": "dc1f7599", + "metadata": {}, + "source": [ + "\n", + "The `main()` function in model_train_validate script serves as the entry point for handling the model training pipeline.\n", + "By coordinating various model types, including ETHole, the script ensures that each model is thoroughly trained and evaluated under consistent conditions. \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "935a91f3", + "metadata": {}, + "outputs": [], + "source": [ + "def main():\n", + " print(\"Train Models\")\n", + "\n", + " worm_holes = [ETHole()]\n", + "\n", + " for hole in worm_holes:\n", + " hole.preprocessing()\n", + " print(hole.train_x.shape)\n", + " print(hole.train_y.shape)\n", + " \n", + " hole.train()\n", + " hole.test()\n", + " hole.evaluate()\n", + " hole.save()\n", + "\n", + " print(\"Finished training and validating all the models.\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "e002ca98", + "metadata": {}, + "source": [ + "Each model created in this function is an instance of the ETHole class. One of the key strengths of this script is its modularity.\n", + "By simply adjusting the list of models (`worm_holes`), you can train and validate different algorithms without modifying the core workflow." + ] + }, + { + "cell_type": "markdown", + "id": "a1bcf597", + "metadata": {}, + "source": [ + "**This script provides a streamlined way to manage the training process, enabling efficient experimentation with different models and configurations.**" + ] } ], "metadata": { @@ -23,14 +421,14 @@ "language_info": { "codemirror_mode": { "name": "ipython", - "version": 2 + "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" + "pygments_lexer": "ipython3", + "version": "3.11.4" } }, "nbformat": 4, diff --git a/book/chapters/validation.ipynb b/book/chapters/validation.ipynb index 00bd4ac..970db26 100644 --- a/book/chapters/validation.ipynb +++ b/book/chapters/validation.ipynb @@ -11,52 +11,20 @@ "The goal of predicting Snow Water Equivalent exemplifies the integration of Machine learning with environmental science. This chapter delved into the testing and Evaluation part of this project." ] }, - { - "cell_type": "markdown", - "metadata": { - "id": "IFvX25Khr2dW" - }, - "source": [ - "To begin with, it is essential to grasp the function of the BaseHole class. This class represents the complete lifecycle of the project, guiding it from initial development through to its final deployment." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rkO0EdSFr4q1" - }, - "source": [ - "BaseHole class is a meticulously crafted blueprint for constructing models capable of predicting SWE. It offers a structured approach to handling data, training models, and making predictions with unparalleled precision." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Y-PZlyQKr48X" - }, - "source": [ - "Let’s briefly discuss what is happening in this BaseHole class--\n", - "\n", - "\n", - "\n", - "* Preprocessing: The model begins with preprocessing, a critical phase where raw data is transformed into a refined form suitable for training. The BaseHole class adeptly navigates this phase, loading data, cleaning it, and splitting it into training and testing sets. This preparatory step ensures that the models are fed data that is both digestible and informative, setting the stage for accurate predictions.\n", - "* Training: This is the center of Learning, with the data primed, the BaseHole class now moves on to the training phase. This is where the coalition of machine learning takes place as the class utilizes the power of its classifiers to learn from the training data. The model, through this process, uncovers patterns and insights hidden within the data, providing itself with the knowledge needed to predict SWE with confidence.\n" - ] - }, { "cell_type": "markdown", "metadata": { "id": "2B9qmiO7r5N9" }, "source": [ - "Now comes the part that is one of the main focus of this chapter—**Testing**\n", "\n", "Within the extensive array of functionalities provided by the BaseHole class, the testing process is akin to a rigorous examination.\n", - "Unveiling the test function:\n", "\n", - "So, what is a test?\n", + "## 8.1 Testing\n", "\n", - "The test function operates on a simple yet profound principle: it utilizes the model to predict outcomes based on the test dataset. By invoking the classifier's prediction method, the BaseHole class utilizes the trained model on the test data to forecast SWE values with precision.\n" + "### 8.1.1 So, what is a test?\n", + "\n", + "The test function operates on a simple yet profound principle: it utilizes the model to predict outcomes based on the test dataset which the model has not seen earlier during the training. This method is fundamental to understanding how well the model generalizes to new, unseen data." ] }, { @@ -82,9 +50,9 @@ "id": "qpzGEyDQsFaZ" }, "source": [ - "**The Mechanics of Testing**\n", + "### 8.1.2. The Mechanic of Testing\n", "\n", - "At its core, the test function embodies the essence of machine learning validation. It executes the trained model's prediction method on the test_x dataset—a collection of features that the model has not encountered during its training phase. The function then returns the predicted SWE values, encapsulated within test_y_results, offering a glimpse into the model's predictive accuracy and reliability." + "The `test()` method uses the trained model to make predictions on test_x, a dataset that was not part of the training process. The output, test_y_results, provides a preview of the model’s performance, offering insights into its predictive capabilities." ] }, { @@ -93,7 +61,7 @@ "id": "-0nRqBFQsICP" }, "source": [ - "# Validation/Evaluation" + "## 8.2 Validation/Evaluation" ] }, { @@ -102,7 +70,9 @@ "id": "toh-cEggsKZ0" }, "source": [ - "So now we have made a model, trained the model, and made predictions on a test dataset, but how to evaluate all of this? For this, we use multiple Evaluation metrics. A model needs to go through a rigorous validation process that assesses its effectiveness and accuracy. Evaluation is a testament to the model’s commitment to precision, ensuring that the predictions made are not only reliable but also meaningful." + "### 8.2.1 Importance of Evaluation\n", + "So now we have made a model, trained the model, and made predictions on a test dataset, but how to evaluate all of this?
\n", + "For this, we use multiple Evaluation metrics. A model needs to go through a rigorous validation process that assesses its effectiveness and accuracy. Evaluation is a testament to the model’s commitment to precision, ensuring that the predictions made are not only reliable but also meaningful." ] }, { @@ -120,7 +90,7 @@ "id": "dpbM36bgsRCB" }, "source": [ - "**Insights**\n", + "### 8.2.2 Insights\n", "\n", "Upon invoking the evaluation method, the class starts a detailed analysis of the model's predictions. By comparing these predictions against actual values from the test dataset, the method illuminates the model's strengths and areas for improvement.\n", "\n", @@ -151,7 +121,7 @@ "id": "PdYjCmmbsVzW" }, "source": [ - "# The Evaluation Process" + "### 8.2.3 The Evaluation Process" ] }, { @@ -160,7 +130,7 @@ "id": "IlRUU0lzsW4r" }, "source": [ - "Upon invocation, the evaluate method undertakes the task of computing these metrics, using the predictions generated by the RandomForestHole model (self.test_y_results) and comparing them against the actual values (self.test_y) from the test dataset. This comparison is the crux of the evaluation, offering a window into the model's predictive capabilities." + "The `evaluate()` method in the model classes is responsible for computing the above metrics, using the predictions generated by the model and comparing them against actual values from the test dataset.\n" ] }, { @@ -230,11 +200,11 @@ "id": "RjgBbKVVsbKF" }, "source": [ - "**Computing the Metrics:** Leveraging the metrics module from scikit-learn, the function calculates MAE, MSE, R2, and RMSE. Each of these calculations provides a different lens through which to view the model's performance, from average error rates (MAE, RMSE) to the model's explanatory power (R2) and the variance of its predictions (MSE).\n", + "- **Computing the Metrics:** Leveraging the metrics module from scikit-learn, the function calculates MAE, MSE, R2, and RMSE. Each of these calculations provides a different lens through which to view the model's performance, from average error rates (MAE, RMSE) to the model's explanatory power (R2) and the variance of its predictions (MSE).\n", "\n", - "**Interpreting the Results:** The function not only computes these metrics but also prints them out, offering immediate insight into the model's efficacy. This step is vital for iterative model improvement, allowing data scientists to diagnose and address specific areas where the model may fall short.\n", + "- **Interpreting the Results:** The function not only computes these metrics but also prints them out, offering immediate insight into the model's efficacy. This step is vital for iterative model improvement, allowing data scientists to diagnose and address specific areas where the model may fall short.\n", "\n", - "**Returning the Metrics:** Finally, the function encapsulates these metrics in a dictionary and returns it. This encapsulation allows for the metrics to be easily accessed, shared, and utilized in further analyses or reports, facilitating a deeper understanding of the model's impact and areas for enhancement.\n" + "- **Returning the Metrics:** Finally, the function encapsulates these metrics in a dictionary and returns it. This encapsulation allows for the metrics to be easily accessed, shared, and utilized in further analyses or reports, facilitating a deeper understanding of the model's impact and areas for enhancement.\n" ] }, { @@ -248,39 +218,36 @@ }, { "cell_type": "markdown", - "metadata": { - "id": "C1pxcIYgsvKD" - }, + "metadata": {}, + "source": [ + "### 8.2.4 Practical Example of Evaluation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, "source": [ - "All the necessary functions are called in the model_train_validate process" + "The script provides practical examples by loading pre-trained models and evaluating them using the test dataset:" ] }, { "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "D2hKsntLszCu" - }, + "execution_count": null, + "metadata": {}, "outputs": [], "source": [ - "def main():\n", - " print(\"Train Models\")\n", - " # Choose the machine learning models to train (e.g., RandomForestHole, XGBoostHole, ETHole)\n", - " worm_holes = [ETHole()]\n", - " for hole in worm_holes:\n", - " # Perform preprocessing for the selected model\n", - " hole.preprocessing()\n", - " print(hole.train_x.shape)\n", - " print(hole.train_y.shape)\n", - " # Train the machine learning model\n", - " hole.train()\n", - " # Test the trained model\n", - " hole.test()\n", - " # Evaluate the model's performance\n", - " hole.evaluate()\n", - " # Save the trained model\n", - " hole.save()\n", - " print(\"Finished training and validating all the models.\")" + "base_model = joblib.load(f\"{homedir}/Documents/GitHub/snowcast_trained_model/model/wormhole_random_forest_basic.joblib\")\n", + "basic_predicted_values = evaluate(base_model, all_features, all_labels, \"Base Model\")\n", + "\n", + "best_random = joblib.load(f\"{homedir}/Documents/GitHub/snowcast_trained_model/model/wormhole_random_forest.joblib\")\n", + "random_predicted_values = evaluate(best_random, all_features, all_labels, \"Optimized\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, the script loads two models—a base model and an optimized model—and evaluates their performance on the same test dataset. This side-by-side comparison allows for an assessment of how model optimization impacts predictive accuracy and overall performance." ] }, { @@ -289,7 +256,8 @@ "id": "JJDaq-uls1AR" }, "source": [ - "In conclusion, testing and validation form the bedrock of predictive excellence in the SnowCast project. They are not merely steps in the machine learning workflow but are the very processes that ensure the models we build are not just algorithms but are reliable interpreters of the natural world." + "Testing and validation form the bedrock of predictive excellence in the SnowCast project. They are not merely steps in the machine learning workflow but are essential processes that ensure the models we build are reliable interpreters of environmental data. By rigorously testing and evaluating models, we can trust that their predictions will be both accurate and meaningful in real-world applications.\n", + "\n" ] } ],