model Training

geo-smart · Aug 12, 2024 · 450de62 · 450de62
1 parent 3fb9afa
commit 450de62
Showing 1 changed file with 270 additions and 10 deletions.
diff --git a/book/chapters/model_training.ipynb b/book/chapters/model_training.ipynb
@@ -2,16 +2,276 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "source": [
-    "# Model Training\n",
-    "\n",
-    "Detailed description of the model training process\n",
-    "Selection of parameters and training datasets\n"
-   ],
+   "id": "d1597c52f583ca0c",
    "metadata": {
     "collapsed": false
    },
-   "id": "d1597c52f583ca0c"
+   "source": [
+    "# Model Training\n",
+    "\n",
+    "In the field of Snow Water Equivalent (SWE) prediction, training models that accurately represent the complexities of environmental data is a critical task. This chapter delves into the intricacies of model training, focusing on the foundational BaseHole class, its extensions, and the specific machine learning models that utilize this structure.\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eb399ad0",
+   "metadata": {},
+   "source": [
+    "## 7.1 Base Hole Class\n",
+    "\n",
+    "BaseHole class is a meticulously crafted blueprint for constructing models capable of predicting SWE. It offers a structured approach to handling data, training models, and making predictions with unparalleled precision.\n",
+    "\n",
+    "### 7.1.1 Overview\n",
+    "The BaseHole class is a meticulously crafted blueprint for building SWE predictors. It encapsulates a structured approach to data handling, from preprocessing to model training, testing, and prediction. The class is designed to be extended by specific predictor classes, allowing for customization of the machine learning model and operational methods.\n",
+    "\n",
+    "**Key Attributes**:\n",
+    "* all_ready_file: A path to the CSV file containing pre-processed data ready for training.\n",
+    "* classifier: The machine learning model used for prediction.\n",
+    "* holename: The name of the wormhole class, which is derived from the class name itself.\n",
+    "* train_x, train_y: Training input and target data, respectively.\n",
+    "* test_x, test_y: Testing input and target data, respectively.\n",
+    "* test_y_results: The predicted results on the test data.\n",
+    "* save_file: Path to save the trained model.\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9c13e761",
+   "metadata": {},
+   "source": [
+    "### 7.1.2 Core Functions\n",
+    "\n",
+    "**Preprocessing**: The model begins with preprocessing, a critical phase where raw data is transformed into a refined form suitable for training. The BaseHole class adeptly navigates this phase, loading data, cleaning it, and splitting it into training and testing sets. This preparatory step ensures that the models are fed data that is both digestible and informative, setting the stage for accurate predictions.\n",
+    "\n",
+    "```\n",
+    "def preprocessing(self):\n",
+    "        '''\n",
+    "        Preprocesses the data for training and testing.\n",
+    "\n",
+    "        Returns:\n",
+    "            None\n",
+    "        '''\n",
+    "        all_ready_pd = pd.read_csv(self.all_ready_file, header=0, index_col=0)\n",
+    "        print(\"all columns: \", all_ready_pd.columns)\n",
+    "        all_ready_pd = all_ready_pd[all_cols]\n",
+    "        all_ready_pd = all_ready_pd.dropna()\n",
+    "        train, test = train_test_split(all_ready_pd, test_size=0.2)\n",
+    "        self.train_x, self.train_y = train[input_columns].to_numpy().astype('float'), train[['swe_value']].to_numpy().astype('float')\n",
+    "        self.test_x, self.test_y = test[input_columns].to_numpy().astype('float'), test[['swe_value']].to_numpy().astype('float')\n",
+    "```\n",
+    "\n",
+    "**Train**: The train function is responsible for training the machine learning model using the preprocessed data. This function prepares the model to make accurate predictions by learning patterns from the training data.\n",
+    "```\n",
+    "def train(self):\n",
+    "        '''\n",
+    "        Trains the machine learning model.\n",
+    "\n",
+    "        Returns:\n",
+    "            None\n",
+    "        '''\n",
+    "        self.classifier.fit(self.train_x, self.train_y)\n",
+    "```\n",
+    "\n",
+    "**Test**: The test function evaluates the model's performance on a separate testing dataset, allowing for the assessment of its predictive accuracy.\n",
+    "```\n",
+    "def test(self):\n",
+    "        '''\n",
+    "        Tests the machine learning model on the testing data.\n",
+    "\n",
+    "        Returns:\n",
+    "            numpy.ndarray: The predicted results on the testing data.\n",
+    "        '''\n",
+    "        self.test_y_results = self.classifier.predict(self.test_x)\n",
+    "        return self.test_y_results\n",
+    "```\n",
+    "\n",
+    "**Predict**: The predict function leverages the trained model to make predictions on new, unseen data, providing valuable insights into potential outcomes.\n",
+    "```\n",
+    "    def predict(self, input_x):\n",
+    "        '''\n",
+    "        Makes predictions using the trained model on new input data.\n",
+    "\n",
+    "        Args:\n",
+    "            input_x (numpy.ndarray): The input data for prediction.\n",
+    "\n",
+    "        Returns:\n",
+    "            numpy.ndarray: The predicted results.\n",
+    "        '''\n",
+    "        return self.classifier.predict(input_x)\n",
+    "```\n",
+    "\n",
+    "**Evaluate**: The evaluate function, designed to be overridden, is where the performance metrics of the model are calculated and analyzed. This function is crucial for understanding the model's strengths and weaknesses.\n",
+    "```\n",
+    "   def evaluate(self):\n",
+    "        '''\n",
+    "        Evaluates the performance of the machine learning model.\n",
+    "\n",
+    "        Returns:\n",
+    "            None\n",
+    "        '''\n",
+    "        pass\n",
+    "```\n",
+    "\n",
+    "**Get Model**: The get_model function, another overridable method, is responsible for returning the specific machine learning model object that will be used for training and prediction.\n",
+    "```\n",
+    "    def get_model(self):\n",
+    "        '''\n",
+    "        Get the machine learning model.\n",
+    "\n",
+    "        Returns:\n",
+    "            object: The machine learning model.\n",
+    "        '''\n",
+    "        pass\n",
+    "```\n",
+    "\n",
+    "**Post-processing**: The post_processing function handles the final steps after model predictions are made, such as generating visualizations, analyzing feature importance, and saving results.\n",
+    "```\n",
+    "    def post_processing(self):\n",
+    "        '''\n",
+    "        Perform post-processing on the model's predictions.\n",
+    "\n",
+    "        Returns:\n",
+    "            None\n",
+    "        '''\n",
+    "        pass\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "932903ce",
+   "metadata": {},
+   "source": [
+    "## 7.2 ETHole Class\n",
+    "\n",
+    "The ETHole class is an extension of the BaseHole class, specifically designed to train and evaluate an Extra Trees Regressor model for SWE prediction. This model is particularly well-suited for handling complex datasets with many features and is known for its robustness and accuracy.\n",
+    "\n",
+    "**Why Extra Trees Regressor?**\n",
+    "\n",
+    " - The Extra Trees Regressor stands out because of its robustness in handling varied data distributions and its ability to capture intricate patterns without overfitting. Unlike traditional decision trees, which split the data by selecting the best feature thresholds, Extra Trees introduces additional randomness by selecting thresholds at random. This randomness helps in reducing variance, making the model less prone to overfitting, especially in high-dimensional spaces like environmental data.\n",
+    "\n",
+    "### 7.2.1 Custom Features\n",
+    "\n",
+    "To maximize the predictive power of the Extra Trees model, the ETHole class introduces several custom features that tailor the training process to the specific needs of SWE prediction.\n",
+    "\n",
+    " - **Custom Loss Function:** The custom_loss function in the ETHole class is a specialized loss function that penalizes errors differently based on the true value of SWE. In typical regression tasks, the goal is to minimize the average error across all predictions. However, in SWE prediction it’s crucial to be more accurate in certain ranges, such as when SWE values are high, as these may correspond to critical environmental conditions.\n",
+    "    ```\n",
+    "    def custom_loss(y_true, y_pred):\n",
+    "        errors = np.abs(y_true - y_pred)\n",
+    "        return np.where(y_true > 10, 2 * errors, errors)\n",
+    "    ```\n",
+    "\n",
+    " - **Sample Weights:** Sample weights adjust the importance of each data point during the training process. The create_sample_weights method generates weights based on the SWE values, giving more importance to higher values, ensuring that the model focuses more on accurately predicting these critical instances.\n",
+    "    ```\n",
+    "    def create_sample_weights(self, X, y, scale_factor, columns):\n",
+    "        return (y - np.min(y)) / (np.max(y) - np.min(y)) * scale_factor\n",
+    "    ```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e57053a9",
+   "metadata": {},
+   "source": [
+    "### 7.2.2 Training and Evaluation\n",
+    "\n",
+    "The train method in the ETHole class is designed to take full advantage of the Extra Trees model's capabilities. By incorporating sample weights, the model becomes more attuned to the nuances of the data, particularly in ranges that are more impactful in the real world.\n",
+    "\n",
+    "```\n",
+    "def train(self):\n",
+    "    self.classifier.fit(self.train_x, self.train_y)\n",
+    "    predictions = self.classifier.predict(self.train_x)\n",
+    "    errors = np.abs(self.train_y - predictions)\n",
+    "    weights = compute_sample_weight('balanced', errors)\n",
+    "    self.classifier.fit(self.train_x, self.train_y, sample_weight=weights)\n",
+    "```\n",
+    "\n",
+    "\n",
+    "The training process is carried in two main phases:\n",
+    " - **Initial Training:** The model is first trained on the entire training dataset without any sample weights.\n",
+    "\n",
+    " - **Weighted Training:** After the initial training, the model's predictions are compared with actual values, and sample weights are computed based on the errors. The model is then retrained using these weights, making it more sensitive to critical prediction errors."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b7ca9c0",
+   "metadata": {},
+   "source": [
+    "### 7.2.3 Post-Processing\n",
+    "After training and making predictions, the post_processing method plays a key role in analyzing the model's performance. One of the primary tasks is to assess feature importance, which helps in understanding which input features (e.g., temperature, precipitation) were most influential in the model’s predictions.\n",
+    "\n",
+    "\n",
+    "```\n",
+    "def post_processing(self, chosen_columns=None):\n",
+    "    feature_importances = self.classifier.feature_importances_\n",
+    "    feature_names = self.feature_names\n",
+    "    sorted_indices = np.argsort(feature_importances)[::-1]\n",
+    "    sorted_importances = feature_importances[sorted_indices]\n",
+    "    sorted_feature_names = feature_names[sorted_indices]\n",
+    "\n",
+    "    plt.figure(figsize=(10, 6))\n",
+    "    plt.bar(range(len(feature_names)), sorted_importances, tick_label=sorted_feature_names)\n",
+    "    plt.xticks(rotation=90)\n",
+    "    plt.xlabel('Feature')\n",
+    "    plt.ylabel('Feature Importance')\n",
+    "    plt.title('Feature Importance Plot (ET model)')\n",
+    "    plt.tight_layout()\n",
+    "    if chosen_columns == None:\n",
+    "        feature_png = f'{work_dir}/testing_output/et-model-feature-importance-latest.png'\n",
+    "    else:\n",
+    "        feature_png = f'{work_dir}/testing_output/et-model-feature-importance-{len(chosen_columns)}.png'\n",
+    "    plt.savefig(feature_png)\n",
+    "    print(f\"Feature image is saved {feature_png}\")\n",
+    "\n",
+    "```\n",
+    "\n",
+    "The post-processing method generates a feature importance plot, which visually represents how much each feature contributed to the predictions. This is crucial for model interpretation, allowing researchers to understand which environmental factors most significantly impact SWE predictions.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc8491b0",
+   "metadata": {},
+   "source": [
+    "## 7.3 Model Train and Validate\n",
+    "\n",
+    "In the final stage of the training process, multiple models, including the ETHole, are trained and validated to determine the best performer. This process is encapsulated in a script that orchestrates the training and evaluation of several models, ensuring a comprehensive approach to model selection."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc1f7599",
+   "metadata": {},
+   "source": [
+    "The `main function` serves as the entry point for this script, where different models are instantiated, trained, tested, and evaluated. This systematic approach ensures that all models are thoroughly vetted before deployment.\n",
+    "\n",
+    "```\n",
+    "def main():\n",
+    "    print(\"Train Models\")\n",
+    "\n",
+    "    worm_holes = [ETHole()]\n",
+    "\n",
+    "    for hole in worm_holes:\n",
+    "        hole.preprocessing()\n",
+    "        print(hole.train_x.shape)\n",
+    "        print(hole.train_y.shape)\n",
+    "        \n",
+    "        hole.train()\n",
+    "        hole.test()\n",
+    "        hole.evaluate()\n",
+    "        hole.save()\n",
+    "\n",
+    "    print(\"Finished training and validating all the models.\")\n",
+    "\n",
+    "```\n",
+    "\n",
+    "**This script provides a streamlined way to manage the training process, enabling efficient experimentation with different models and configurations.**"
+   ]
   }
  ],
  "metadata": {
@@ -23,14 +283,14 @@
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.6"
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
   }
  },
  "nbformat": 4,