Skip to content

Latest commit

 

History

History
248 lines (196 loc) · 14.1 KB

README.md

File metadata and controls

248 lines (196 loc) · 14.1 KB
Table of Contents
  1. About The Project
  2. How to Clone the Source Code
  3. Package Installation
  4. How to Run Experiments
  5. Contributing
  6. License
  7. Contact
  8. Citation

This repository contains supplementary code for the paper "Proposal of an Automated Feature Engineering Pipeline for High-Dimensional Tabular Regression Data Using Reinforcement Learning". Author: Julian Müller [email protected], on behalf of MBition GmbH.

Provider Information

Source code has been tested solely for our own use cases, which might differ from yours. This project is actively maintained and contributing is endorsed.

About The Project

‘automotive_feature-engineering’ is a Python package designed to automate the feature engineering process for large in-car communication datasets within the automotive industry. It simplifies the transformation of raw data into meaningful input features for machine learning models, enhancing efficiency and reducing computational overhead. It supports both static analysis and dynamic feature engineering through reinforcement learning techniques.

(back to top)

How to Clone the Source Code

To clone the source code of this repository to your local machine, follow these steps:

  1. Install Git: Make sure you have Git installed on your computer. If not, you can download it from git-scm.com.

  2. Open a Terminal/Command Prompt: Navigate to the directory where you want to clone the repository.

  3. Clone the Repository: Use the git clone command followed by the repository URL. Run the following command for HTTPS:

    git clone https://github.com/mercedes-benz/automotive_feature_engineering.git

    Or this one for SSH:

    git clone [email protected]:mercedes-benz/automotive_feature_engineering.git

(back to top)

Package Installation

pip install dist/automotive_feature_engineering-0.1.0-py3-none-any.whl

(back to top)

How to run Experiments

Method List

Index Method Parameters Description
0 `` - Do nothing with features
1 drop_correlated_features_09 - Drop highly correlated features with a correlation threshold of 0.9.
2 drop_correlated_features_095 - Drop highly correlated features with a correlation threshold of 0.95.
3 sns_handling_median_8 - Fill NaN values with the median for columns with more than 8 unique values.
4 sns_handling_median_32 - Fill NaN values with the median for columns with more than 32 unique values.
5 sns_handling_mean_8 - Fill NaN values with the mean for columns with more than 8 unique values.
6 sns_handling_mean_32 - Fill NaN values with the mean for columns with more than 32 unique values.
7 sns_handling_zero_8 - Fill NaN values with 0 for columns with more than 8 unique values.
8 sns_handling_zero_32 - Fill NaN values with 0 for columns with more than 32 unique values.
9 filter_by_variance - Removes columns with variance below 0.1 across datasets.
10 ohe - Applies one-hot encoding to categorical variables in datasets.
11 feature_importance_filter_00009999 - Filters out features from datasets that have an importance less than 0.00009999.
12 feature_importance_filter_00049999 - Filters out features from datasets that have an importance less than 0.00049999.
13 pca - Applies Principal Component Analysis transformation to reduce dimensionality.
14 polynominal_features - Enhances feature set by creating polynomial terms.
99 filter_by_variance_0 - Removes columns with only one unique value across datasets.

(back to top)

Documentation for Using the Static and Manual Method

The static and manual methods in the automotive_featureengineering package are designed to perform feature engineering on automotive data sets. The static method uses a predefined sequence of feature engineering steps, while the manual method allows users to specify their own sequence.

Parameter Type Description Default Value
df_train pd.DataFrame Training data. Required
df_test pd.DataFrame Test data. Required
model str Model to be used for feature selection. Options: etree, randomforest. Required
target_names_list List[str] List of target names. Required
import_joblib_path str, optional Path to import joblib file of previously exported feature engineering methods. None
alt_docu_path str, optional Alternative documentation path. None
alt_config Dict, optional Alternative configuration dictionary. None
unrelated_cols List[str], optional List of columns that are not considered in feature engineering. None
model_export bool Whether to export the model. False
fe_export_joblib bool Whether to export the feature engineering methods used. False
explainable bool If set to True, a pipeline without polynomial features is used. False

Prepare your training and testing datasets as pd.DataFrame.

With your data frames ready, you can now call the static method. You need to specify additional parameters such as the model type and target features list according to your specific needs. The static method does not require a method list as it uses a predefined sequence of methods.

# Import function
from automotive_feature_engineering import static

# Execute the static method
results = static(df_train, df_test, model, target_names_list)

If no method list is provided, the default pipeline will be used.

If you want to specify your own sequence of feature engineering steps, use the manual method. You need to provide a method list along with other parameters.

# Import function
from automotive_feature_engineering import manual
 
# Execute the manual method
results = manual(method_list, df_train, df_test, model, target_names_list)

(back to top)

Documentation for Using the RL Method

The RL method in the is designed to perform dynamic feature engineering on automotive data sets using reinforcement learning techniques. It processes input data frames to adaptively extract and engineer features that are essential for predictive modeling and further analysis.

Parameter Type Description Default Value
df_train pd.DataFrame Training data used in reinforcement learning. Required
df_train_origin pd.DataFrame Train data. Required
df_test_origin pd.DataFrame Test data. Required
model str Model to be used for feature selection. Options: etree, randomforest. Required
target_names_list List[str] List of target names. Required
rl_raster float Sampling rate of input data. Required
alt_docu str, optional Alternative documentation path. None
alt_config Dict, optional Alternative configuration dictionary. None
unrelated_cols List[str], optional List of columns that are not considered in feature engineering. None

Prepare your training and testing datasets as pd.DataFrame. Create a new training dataset instead of original training and testing datasets specifically for reinforcement learning.

Once your data frames are prepared, you can now call the RL method as well. You need to specify additional parameters such as the model type, target feature list, and other parameters tailored to your specific needs.

# Import function
from automotive_feature_engineering import rl

# Execute the rl method
results = rl(df_train, df_train_origin, df_test_origin, target_names_list, model, rl_raster, unrelated_cols, alt_config, alt_docu)

For more examples, please refer to the Documentation

(back to top)

Contributing

The instructions on how to contribute can be found in the file CONTRIBUTING.md in this repository.

(back to top)

License

The code is published under the MIT license. Further information on that can be found in the LICENSE.md file in this repository.

(back to top)

Citation

@article{key2023, title={}, author={}, year={2023}, url={} }

(back to top)