Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultinomialNBClassifier not available. #753

Closed
f0lie opened this issue Mar 21, 2021 · 6 comments
Closed

MultinomialNBClassifier not available. #753

f0lie opened this issue Mar 21, 2021 · 6 comments

Comments

@f0lie
Copy link

f0lie commented Mar 21, 2021

Describe the bug
Some classifiers are not showing up for some reason.

To Reproduce
The car data is from here. https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

cars_data = coerce(cars, :buying=>OrderedFactor,
                         :maint=>OrderedFactor,
                         :doors=>OrderedFactor,
                         :persons=>OrderedFactor,
                         :lug_boot=>OrderedFactor,
                         :safety=>OrderedFactor,
                         :class=>OrderedFactor)
schema(cars_data)


┌──────────┬─────────────────────────────────┬──────────────────┐
│ _.names  │ _.types                         │ _.scitypes       │
├──────────┼─────────────────────────────────┼──────────────────┤
│ buying   │ CategoricalValue{String,UInt32} │ OrderedFactor{4} │
│ maint    │ CategoricalValue{String,UInt32} │ OrderedFactor{4} │
│ doors    │ CategoricalValue{String,UInt32} │ OrderedFactor{4} │
│ persons  │ CategoricalValue{String,UInt32} │ OrderedFactor{3} │
│ lug_boot │ CategoricalValue{String,UInt32} │ OrderedFactor{3} │
│ safety   │ CategoricalValue{String,UInt32} │ OrderedFactor{3} │
│ class    │ CategoricalValue{String,UInt32} │ OrderedFactor{4} │
└──────────┴─────────────────────────────────┴──────────────────┘
models(matching(cars_data[:, Not(:class)], cars_data[:,:class]))


7-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = BetaML, ... )
 (name = DecisionTreeClassifier, package_name = DecisionTree, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = RandomForestClassifier, package_name = BetaML, ... )
 (name = RandomForestClassifier, package_name = DecisionTree, ... )

Expected behavior
Clearly MultinomialNBClassifier is supposed to be here. There is probably some way to use OneHotEncoder or something to transform the data to get it to work but it's impossible for me to figure out that from the documentation.

Versions
The latest version in the pkg.

@f0lie
Copy link
Author

f0lie commented Mar 21, 2021

I figured it out, MultinomialNBClassifier needs ints to be passed to it so the solution is to convert everything to with int.(data).

@f0lie
Copy link
Author

f0lie commented Mar 22, 2021

It would be nice to have documentation that spells out how to deal with typing issues like this. I am not sure if this could be a documentation improvement or a better more clear warning message. Querying models is very useful but it's challenging to figure why something isn't hooking up as it should.

@f0lie f0lie closed this as completed Mar 22, 2021
@ablaom
Copy link
Member

ablaom commented Mar 22, 2021

Thanks for reporting anyway. Yes, your data needs to have the Count scitype, which you can see from here:

julia> info("MultinomialNBClassifier", pkg="ScikitLearn").input_scitype
Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:Count)

Do you have a specific suggestion how to improve the documentation? The MLJ documentation already has this section. The packages providing the models (ScikitLearn.jl and NaiveBayes.jl) have their own documentation, but we don't really have any control over that.

If you try to use the model with data of wrong type, you do get an informative message:

X, y = make_moons()
model = (@load MultinomialNBClassifier pkg=ScikitLearn)()

julia> machine(model, X, y)
┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=MultinomialNBClassifier @121`:
│ scitype(X) = Table{AbstractArray{Continuous,1}}
│ input_scitype(model) = Table{var"#s45"} where var"#s45"<:(AbstractArray{var"#s13",1} where var"#s13"<:Count).
└ @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/machines.jl:91
Machine{MultinomialNBClassifier,} @789 trained 0 times; caches data
  args: 
    1:  Source @615`Table{AbstractArray{Continuous,1}}`
    2:  Source @897`AbstractArray{Multiclass{2},1}`

Can you think of a way to improve this?

@f0lie
Copy link
Author

f0lie commented Mar 24, 2021

Can you think of a way to improve this?

Documentation is a hard thing to get right. Because we have to consider the skill level and familiarity of the user to get the right level of information. At the start, I struggled with understanding the error messages but then I realized what it was trying to tell them and it did make sense. This is my first time using Julia as well as machine learning. So I had the challenge of understanding Julia-type errors and machine learning too.

┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=MultinomialNBClassifier @121`:
│ scitype(X) = Table{AbstractArray{Continuous,1}}
│ input_scitype(model) = Table{var"#s45"} where var"#s45"<:(AbstractArray{var"#s13",1} where var"#s13"<:Count).
└ @ MLJBase ~/.julia/packages/ML

This error message is informative if you know how to read it. That's not something a new user like myself can do quickly without some pain. For example, │ input_scitype(model) = Table{var"#s45"} where var"#s45"<:(AbstractArray{var"#s13",1} where var"#s13"<:Count).. Now I know that it's telling me that it's expecting types of Count. But there is a lot of extra information in there like var#s45. This is the same problem with C++ templates, you get a lot of extra information that's technically correct but requires time to read through.

For other older languages like Python and sklearn, you could google things and get some nice StackOverflow explaining the error and how to fix it. But MJL and Julia are much newer and there aren't many questions.

https://stackoverflow.com/search?q=MLJ+Julia
https://stackoverflow.com/search?q=Python+Sklearn

For "Julia MLJ" I get 4 results. For "Python Sklearn", I get 4,300 results. StackOverflowing error messages make a huge difference for beginners. I am not sure if there anything that Julia authors can do other than write documentation and hope for the best though.

But one thing that can be done is a clearer overview of typing and why it's important to the project. The one thing that really helped me figure out how to use MJL is querying models. It would be nice to have a more complete version of the "working with categorical data" where you take data and transform it to use it with different models. Right now it's talked about the pieces to do so but not a complete guide on how to do so. It can be confusing to figure out how to string together functionality together when you aren't familiar with the library. For example, if you wanted to use a Neural Network with categorical data, you have to transform the data into continuous types and it isn't clear how or why to do that in the documentation.

A second major issue is that measures are basically undocumented. There is no explanation on how to use them and what to do if things aren't going when. For example, with things like Neural Networks, you can't get accuracy right away because it's a probabilistic model. You get a very cryptic error message telling you that but there is nothing in the documentation on how to fix it.

After combing through the documentation I found out you can just do predict_mode and pass that to evaluation to fix that. But that was after combing the documentation and guides to find a little piece of code that did that.

There isn't much guidance on how to do the same querying on measures instead of models. While the documentation does have a nice matching function to find all the models that work with your data, I didn't figure its counterpart with measures. Maybe there is a feature to pass a machine to measures and it spits out what measures you can do but I didn't find that in the documentation. I got around this by looking at things with info manually and piecing together scitypes together.

Overall, I think a page explaining the design and how MJL wraps things scitypes would clear things up. Everything is designed to be plugging into each other but there isn't much help when things don't or when you don't understand why it's not. That page could include more robust examples of using queries to find the right models and transforming data to fit other models.

@ablaom
Copy link
Member

ablaom commented Mar 25, 2021

@f0lie thanks indeed for taking the time to give such detailed feedback. Very much appreciated. Creating a link to it here.

Documentation is a hard thing to get right.

Indeed. And one would often want to devote finite resources elsewhere 😄

There isn't much guidance on how to do the same querying on measures instead of models.

Good point.

Maybe there is a feature to pass a machine to measures

Good idea! JuliaAI/MLJBase.jl#529

FYI. I think some of the early "data ingestion" stuff is covered in this workshop: https://github.com/ablaom/MachineLearningInJulia2020 or at https://alan-turing-institute.github.io/DataScienceTutorials.jl/ . But I think about how to include this better in the manual.

Again many thanks.

@ablaom
Copy link
Member

ablaom commented Mar 25, 2021

And generally a lot of julia questions get posted/answered on Julia Discourse https://discourse.julialang.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants