Enhance treatment of missing value in one-hot encoder #458

ablaom · 2022-05-04T21:41:22Z

There is now missing value handling in OneHotEncoder but this simply propagates the missing values. I guess it might be nice to offer some other popular options for handling missing values which might be complicated to handle in a post-processing step. See also the discussion here.

@Chandu-4444 @Frank-III @olivierlabayle

The text was updated successfully, but these errors were encountered:

Chandu-4444 · 2022-05-05T19:05:19Z

The current implementation comes under the all-missing case. This is the easiest and most straightforward case I can say. Any other cases like all-zero, and category can also be implemented, and I guess I can use a part of my previous commit (link) for incorporating these. A simple modification to it and the current implementation for handling missing values in OneHotEncoder can enable all the above-mentioned methods.

Any other ideas would be most welcomed.

ablaom · 2022-05-05T20:34:33Z

all-zero looks like the simplest. One question for category is how to handle missing values that appear for a feature that did not havemissing values in training (fit). Here's a proposal for this:

We introduce a new hyper-parameter features_with_missing which can either be: (i) a vector of feature names, (ii) the symbol :all, (iii) the symbol :auto. For such features, when specified as a vector, we will always have the extra missing category, regardless of the existence of missing values in the input for transform. If features_with_missings == :auto then the actual list used is inferred from the training data: a feature is on the list if missing appears for that feature in the training data. If features_with_missings === :all then every feature gets the extra missing category.

In transform, if missing appears for a feature not on the list, then an informative error is thrown, explaining the possibility that the problem can be corrected by retraining and explicitly specifying features_with_missings appropriately.

The default could be :all or :auto. Maybe :auto is okay. It might lead to a surprise for the user that never reads documentation, but the error message explains what to do.

We will also need a hyper-parameter to specify the kind of missing handling - :propogate, :all_zero or :category. Name suggestion: ~~missing_handling~~ handle_missing (for consistency with sk-learn). Default: :propogate. If missing_handling is not :category, and features_with_missing is not it's default value, then clean! should issue a warning that features_with_missing is being ignored. Or we could combine the two new hyper-parameters into one somehow, although I'm not sure how to do this without creating cognitive dissonance.

I wonder how this is handled elsewhere. Of course, often one-hot encoding is sometimes implemented as a "static" transformer (no seperate training step) and this doesn't come up. This is not, however, an argument for making it static, in my view. I think it is preferable to have a consistent number of spawned features in the output, each time transform is called. That is, by training just once, you can arrange that the number of spawned features does not depend on whether there are - or are not - missing values in a particular field to be transformed. For otherwise downstream operations, expecting a certain number of features might fail unexpectedly.

Anyone have a different suggestion?

Probably good to introduce the two options in separate PR's, starting with the easiest all-zero case.

Chandu-4444 · 2022-05-06T17:25:53Z

This page can help relate a few things said by @ablaom.

Chandu-4444 · 2022-05-07T16:06:58Z

Is this how the output should be for the minimal all-zero case?

julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)

julia> enc = OneHotEncoder(missing_handling = "all-zero")

# After some steps ...

(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
 name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
 name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
name_missing = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

ablaom · 2022-05-08T22:34:25Z

No, rather it's the same as the current behaviour, except instead of missings, use zeros. You don't need to spawn an extra column in this case:

julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)

julia> enc = OneHotEncoder(handle_missing = :all_zero)

# After some steps ...

(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
 name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
 name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0])

However, note that this means we cannot have drop_last=true in this case, because then we can't distinguish missing from the last class. So I suggest clean! needs to check this. I suggest that if handle_missing == :all_zero, then clean! changes drop_last to false, if it is true, issuing a warning in that case.

Also:

let's use the name handle_missing for consistency with sk-learn
let's use symbols for it's values, not strings

olivierlabayle · 2022-08-01T12:28:54Z

Thank you for adding the support for propagating missing values! I think I have identified a bug if the first value in a vector is missing:

using MLJModels, CategoricalArrays, MLJBase
X = (x=categorical([missing, 1, 2, 1]),)
t  = OneHotEncoder(drop_last = true)
f, _, report = MLJBase.fit(t, 1, X)

This is due to this line. I think replacing by classes(col) should work?

ablaom · 2022-08-01T20:42:58Z

Yes, great catch, that's a bug: #467

Are you willing an able to make a PR with a test?

olivierlabayle · 2022-08-02T13:56:30Z

I can give it a try if it's as easy as my suggestion, can you grant me access to the repo?

ablaom · 2022-08-02T19:45:27Z

Done. You have an invitation to accept.

olivierlabayle · 2022-08-03T07:49:37Z

#468

ablaom mentioned this issue Aug 1, 2022

A column with missing as first value trips OneHotEncoder #467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance treatment of missing value in one-hot encoder #458

Enhance treatment of missing value in one-hot encoder #458

ablaom commented May 4, 2022

Chandu-4444 commented May 5, 2022 •

edited

Loading

ablaom commented May 5, 2022 •

edited

Loading

Chandu-4444 commented May 6, 2022 •

edited

Loading

Chandu-4444 commented May 7, 2022

ablaom commented May 8, 2022 •

edited

Loading

olivierlabayle commented Aug 1, 2022

ablaom commented Aug 1, 2022

olivierlabayle commented Aug 2, 2022

ablaom commented Aug 2, 2022 •

edited

Loading

olivierlabayle commented Aug 3, 2022

Enhance treatment of missing value in one-hot encoder #458

Enhance treatment of missing value in one-hot encoder #458

Comments

ablaom commented May 4, 2022

Chandu-4444 commented May 5, 2022 • edited Loading

ablaom commented May 5, 2022 • edited Loading

Chandu-4444 commented May 6, 2022 • edited Loading

Chandu-4444 commented May 7, 2022

ablaom commented May 8, 2022 • edited Loading

olivierlabayle commented Aug 1, 2022

ablaom commented Aug 1, 2022

olivierlabayle commented Aug 2, 2022

ablaom commented Aug 2, 2022 • edited Loading

olivierlabayle commented Aug 3, 2022

Chandu-4444 commented May 5, 2022 •

edited

Loading

ablaom commented May 5, 2022 •

edited

Loading

Chandu-4444 commented May 6, 2022 •

edited

Loading

ablaom commented May 8, 2022 •

edited

Loading

ablaom commented Aug 2, 2022 •

edited

Loading