Syntax for feature engineering #314

tlienart · 2020-10-03T16:05:57Z

I stumbled upon https://github.com/matthieugomez/PairsMacros.jl today and it seems to be close to what we discussed with @vollmersj with respect to defining new columns with a formula-like syntax.

@matthieugomez sorry to ping you here but would you be interested in something like PairsMacros for general-purpose feature engineering to work with MLJ?

AriMKatz · 2020-12-04T01:24:36Z

There's also this: https://github.com/joshday/Telperion.jl

ablaom · 2022-10-06T23:14:21Z

Continuing the discussion started by @indymnv at JuliaAI/MLJ.jl#970:

Existing MLJ transformers are documented here with the exception of InteractionTransformer, which was recently added to MLJModels, but is not documented or re-exported yet by MLJ.jl. Here's the list:

julia> using MLJModels

julia> models() do m
       m.package_name == "MLJModels" &&
       !m.is_supervised
       end
11-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = InteractionTransformer, package_name = MLJModels, ... )
 (name = OneHotEncoder, package_name = MLJModels, ... )
 (name = Standardizer, package_name = MLJModels, ... )
 (name = UnivariateBoxCoxTransformer, package_name = MLJModels, ... )
 (name = UnivariateDiscretizer, package_name = MLJModels, ... )
 (name = UnivariateFillImputer, package_name = MLJModels, ... )
 (name = UnivariateStandardizer, package_name = MLJModels, ... )
 (name = UnivariateTimeTypeToContinuous, package_name = MLJModels, ... )

A "fancier" version of InteractionTransformer, based on R type "formulas", has been planned, but no-one has really found the time to work on it.

There is a project in progress to roll out a feature_importance method for models that support that, with the idea of adding feature selection tools, such as recursive feature elimination.

TableTransforms.jl referenced by @juliohm is very active but not yet integrated with MLJ, although we are working towards doing so in the future (at least several months off). I think that is good place to contribute generic table transformers, such as encoders. Some feature engineering tools, such as RFE, will probably not make sense there, as they require supervised learners, for example.

@indymnv It would be helpful if you can identify specific encoders or other tools you use frequently that are missing from MLJ (or TableTransforms.jl) so they can be prioritised.

indymnv · 2022-10-06T23:56:19Z

@ablaom Thanks for all the information, in general in my work with ML I use the following encoders a lot.

For categorical variables

Ordinal Encoding: replaces categories by numbers arbitrarily or ordered by target @ablaom says: done - use coerce from ScientificTypes.jl
Frequency Encoder: replaces categories by the observation count or percentage
One-Hote Encoder: done.
grouped tail encoder: groups infrequent categories

For dates and other cyclic variables:

Cyclical encoder: creates variables using sine and cosine

For some numerical variables:

Equal Frequency Discretiser: sorts variables into equal frequency intervals @ablaom says: done - UnivariateDiscretizer
Equal Width Discretiser: sorts variables into equal-width intervals.

Transformations:

Logarithm @ablaom says done - any kind of ordinary function can be inserted in pipeline or used in TransformedTargetModel wrapper
Box-Cox @ablaom says: done (with learned exponent) -UnivariateBoxCoxTransformer
Yeo-Johnson

Standardization and Normalization @ablaom says done - Standardizer
Feature Selection:

I use the feature selection, built-in ml models from scikit-learn, or Boruta.

For now, in Julia I have only used One-Hot-encoder, I have not checked the transformations.

[Edit]: As a context, I frequently work with linear/logistic regression models, Decision-Tree, Random Forest and GBM.

ablaom · 2022-10-10T02:44:05Z

Thanks @indymnv . That's most helpful. PR's for missing items welcome 😉

ablaom mentioned this issue Oct 11, 2020

Recursive Feature Elimination RFE - Feature Request? JuliaAI/MLJ.jl#426

Closed

This was referenced Oct 13, 2021

Allow fitting arbitrary @formulas JuliaAI/MLJGLMInterface.jl#13

Closed

Add transformer defined by R-style formula #406

Open

ablaom mentioned this issue Oct 6, 2022

Encoders for feature engineering JuliaAI/MLJ.jl#970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntax for feature engineering #314

Syntax for feature engineering #314

tlienart commented Oct 3, 2020

AriMKatz commented Dec 4, 2020

ablaom commented Oct 6, 2022 •

edited

Loading

indymnv commented Oct 6, 2022 •

edited by ablaom

Loading

ablaom commented Oct 10, 2022

Syntax for feature engineering #314

Syntax for feature engineering #314

Comments

tlienart commented Oct 3, 2020

AriMKatz commented Dec 4, 2020

ablaom commented Oct 6, 2022 • edited Loading

indymnv commented Oct 6, 2022 • edited by ablaom Loading

ablaom commented Oct 10, 2022

ablaom commented Oct 6, 2022 •

edited

Loading

indymnv commented Oct 6, 2022 •

edited by ablaom

Loading