Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance treatment of missing value in one-hot encoder #458

Open
ablaom opened this issue May 4, 2022 · 10 comments
Open

Enhance treatment of missing value in one-hot encoder #458

ablaom opened this issue May 4, 2022 · 10 comments

Comments

@ablaom
Copy link
Member

ablaom commented May 4, 2022

There is now missing value handling in OneHotEncoder but this simply propagates the missing values. I guess it might be nice to offer some other popular options for handling missing values which might be complicated to handle in a post-processing step. See also the discussion here.

@Chandu-4444 @Frank-III @olivierlabayle

@Chandu-4444
Copy link
Contributor

Chandu-4444 commented May 5, 2022

The current implementation comes under the all-missing case. This is the easiest and most straightforward case I can say. Any other cases like all-zero, and category can also be implemented, and I guess I can use a part of my previous commit (link) for incorporating these. A simple modification to it and the current implementation for handling missing values in OneHotEncoder can enable all the above-mentioned methods.

Any other ideas would be most welcomed.

@ablaom
Copy link
Member Author

ablaom commented May 5, 2022

all-zero looks like the simplest. One question for category is how to handle missing values that appear for a feature that did not havemissing values in training (fit). Here's a proposal for this:

We introduce a new hyper-parameter features_with_missing which can either be: (i) a vector of feature names, (ii) the symbol :all, (iii) the symbol :auto. For such features, when specified as a vector, we will always have the extra missing category, regardless of the existence of missing values in the input for transform. If features_with_missings == :auto then the actual list used is inferred from the training data: a feature is on the list if missing appears for that feature in the training data. If features_with_missings === :all then every feature gets the extra missing category.

In transform, if missing appears for a feature not on the list, then an informative error is thrown, explaining the possibility that the problem can be corrected by retraining and explicitly specifying features_with_missings appropriately.

The default could be :all or :auto. Maybe :auto is okay. It might lead to a surprise for the user that never reads documentation, but the error message explains what to do.

We will also need a hyper-parameter to specify the kind of missing handling - :propogate, :all_zero or :category. Name suggestion: missing_handling handle_missing (for consistency with sk-learn). Default: :propogate. If missing_handling is not :category, and features_with_missing is not it's default value, then clean! should issue a warning that features_with_missing is being ignored. Or we could combine the two new hyper-parameters into one somehow, although I'm not sure how to do this without creating cognitive dissonance.


I wonder how this is handled elsewhere. Of course, often one-hot encoding is sometimes implemented as a "static" transformer (no seperate training step) and this doesn't come up. This is not, however, an argument for making it static, in my view. I think it is preferable to have a consistent number of spawned features in the output, each time transform is called. That is, by training just once, you can arrange that the number of spawned features does not depend on whether there are - or are not - missing values in a particular field to be transformed. For otherwise downstream operations, expecting a certain number of features might fail unexpectedly.

Anyone have a different suggestion?


Probably good to introduce the two options in separate PR's, starting with the easiest all-zero case.

@Chandu-4444
Copy link
Contributor

Chandu-4444 commented May 6, 2022

This page can help relate a few things said by @ablaom.

@Chandu-4444
Copy link
Contributor

Is this how the output should be for the minimal all-zero case?

julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)

julia> enc = OneHotEncoder(missing_handling = "all-zero")

# After some steps ...

(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
 name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
 name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
name_missing = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

@ablaom
Copy link
Member Author

ablaom commented May 8, 2022

No, rather it's the same as the current behaviour, except instead of missings, use zeros. You don't need to spawn an extra column in this case:

julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)

julia> enc = OneHotEncoder(handle_missing = :all_zero)

# After some steps ...

(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
 name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
 name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0])

However, note that this means we cannot have drop_last=true in this case, because then we can't distinguish missing from the last class. So I suggest clean! needs to check this. I suggest that if handle_missing == :all_zero, then clean! changes drop_last to false, if it is true, issuing a warning in that case.

Also:

  • let's use the name handle_missing for consistency with sk-learn
  • let's use symbols for it's values, not strings

@olivierlabayle
Copy link
Collaborator

Thank you for adding the support for propagating missing values! I think I have identified a bug if the first value in a vector is missing:

using MLJModels, CategoricalArrays, MLJBase
X = (x=categorical([missing, 1, 2, 1]),)
t  = OneHotEncoder(drop_last = true)
f, _, report = MLJBase.fit(t, 1, X)

This is due to this line. I think replacing by classes(col) should work?

@ablaom
Copy link
Member Author

ablaom commented Aug 1, 2022

Yes, great catch, that's a bug: #467

Are you willing an able to make a PR with a test?

@olivierlabayle
Copy link
Collaborator

I can give it a try if it's as easy as my suggestion, can you grant me access to the repo?

@ablaom
Copy link
Member Author

ablaom commented Aug 2, 2022

Done. You have an invitation to accept.

@olivierlabayle
Copy link
Collaborator

#468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants