-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance treatment of missing value in one-hot encoder #458
Comments
The current implementation comes under the Any other ideas would be most welcomed. |
We introduce a new hyper-parameter In transform, if The default could be We will also need a hyper-parameter to specify the kind of missing handling - I wonder how this is handled elsewhere. Of course, often one-hot encoding is sometimes implemented as a "static" transformer (no seperate training step) and this doesn't come up. This is not, however, an argument for making it static, in my view. I think it is preferable to have a consistent number of spawned features in the output, each time Anyone have a different suggestion? Probably good to introduce the two options in separate PR's, starting with the easiest |
Is this how the output should be for the minimal julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)
julia> enc = OneHotEncoder(missing_handling = "all-zero")
# After some steps ...
(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
name_missing = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
|
No, rather it's the same as the current behaviour, except instead of
However, note that this means we cannot have Also:
|
Thank you for adding the support for propagating missing values! I think I have identified a bug if the first value in a vector is missing: using MLJModels, CategoricalArrays, MLJBase
X = (x=categorical([missing, 1, 2, 1]),)
t = OneHotEncoder(drop_last = true)
f, _, report = MLJBase.fit(t, 1, X) This is due to this line. I think replacing by |
Yes, great catch, that's a bug: #467 Are you willing an able to make a PR with a test? |
I can give it a try if it's as easy as my suggestion, can you grant me access to the repo? |
Done. You have an invitation to accept. |
There is now missing value handling in
OneHotEncoder
but this simply propagates the missing values. I guess it might be nice to offer some other popular options for handling missing values which might be complicated to handle in a post-processing step. See also the discussion here.@Chandu-4444 @Frank-III @olivierlabayle
The text was updated successfully, but these errors were encountered: