Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax for feature engineering #314

Open
tlienart opened this issue Oct 3, 2020 · 4 comments
Open

Syntax for feature engineering #314

tlienart opened this issue Oct 3, 2020 · 4 comments

Comments

@tlienart
Copy link
Collaborator

tlienart commented Oct 3, 2020

I stumbled upon https://github.com/matthieugomez/PairsMacros.jl today and it seems to be close to what we discussed with @vollmersj with respect to defining new columns with a formula-like syntax.

@matthieugomez sorry to ping you here but would you be interested in something like PairsMacros for general-purpose feature engineering to work with MLJ?

@AriMKatz
Copy link

AriMKatz commented Dec 4, 2020

There's also this: https://github.com/joshday/Telperion.jl

@ablaom
Copy link
Member

ablaom commented Oct 6, 2022

Continuing the discussion started by @indymnv at JuliaAI/MLJ.jl#970:

Existing MLJ transformers are documented here with the exception of InteractionTransformer, which was recently added to MLJModels, but is not documented or re-exported yet by MLJ.jl. Here's the list:

julia> using MLJModels

julia> models() do m
       m.package_name == "MLJModels" &&
       !m.is_supervised
       end
11-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = InteractionTransformer, package_name = MLJModels, ... )
 (name = OneHotEncoder, package_name = MLJModels, ... )
 (name = Standardizer, package_name = MLJModels, ... )
 (name = UnivariateBoxCoxTransformer, package_name = MLJModels, ... )
 (name = UnivariateDiscretizer, package_name = MLJModels, ... )
 (name = UnivariateFillImputer, package_name = MLJModels, ... )
 (name = UnivariateStandardizer, package_name = MLJModels, ... )
 (name = UnivariateTimeTypeToContinuous, package_name = MLJModels, ... )

A "fancier" version of InteractionTransformer, based on R type "formulas", has been planned, but no-one has really found the time to work on it.

There is a project in progress to roll out a feature_importance method for models that support that, with the idea of adding feature selection tools, such as recursive feature elimination.

TableTransforms.jl referenced by @juliohm is very active but not yet integrated with MLJ, although we are working towards doing so in the future (at least several months off). I think that is good place to contribute generic table transformers, such as encoders. Some feature engineering tools, such as RFE, will probably not make sense there, as they require supervised learners, for example.

@indymnv It would be helpful if you can identify specific encoders or other tools you use frequently that are missing from MLJ (or TableTransforms.jl) so they can be prioritised.

@indymnv
Copy link

indymnv commented Oct 6, 2022

@ablaom Thanks for all the information, in general in my work with ML I use the following encoders a lot.

  1. For categorical variables
  • Ordinal Encoding: replaces categories by numbers arbitrarily or ordered by target @ablaom says: done - use coerce from ScientificTypes.jl
  • Frequency Encoder: replaces categories by the observation count or percentage
  • One-Hote Encoder: done.
  • grouped tail encoder: groups infrequent categories
  1. For dates and other cyclic variables:
  • Cyclical encoder: creates variables using sine and cosine
  1. For some numerical variables:
  • Equal Frequency Discretiser: sorts variables into equal frequency intervals @ablaom says: done - UnivariateDiscretizer
  • Equal Width Discretiser: sorts variables into equal-width intervals.
  1. Transformations:
  • Logarithm @ablaom says done - any kind of ordinary function can be inserted in pipeline or used in TransformedTargetModel wrapper
  • Box-Cox @ablaom says: done (with learned exponent) -UnivariateBoxCoxTransformer
  • Yeo-Johnson
  1. Standardization and Normalization @ablaom says done - Standardizer

  2. Feature Selection:

  • I use the feature selection, built-in ml models from scikit-learn, or Boruta.

For now, in Julia I have only used One-Hot-encoder, I have not checked the transformations.

[Edit]: As a context, I frequently work with linear/logistic regression models, Decision-Tree, Random Forest and GBM.

@ablaom
Copy link
Member

ablaom commented Oct 10, 2022

Thanks @indymnv . That's most helpful. PR's for missing items welcome 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants