Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samples no longer work #33

Open
alunap opened this issue Sep 17, 2024 · 5 comments
Open

Samples no longer work #33

alunap opened this issue Sep 17, 2024 · 5 comments

Comments

@alunap
Copy link

alunap commented Sep 17, 2024

If I run the example code (any of them) I get a failure.

using MLJ, MLJText, TextAnalysis

docs = ["Hi my name is Sam.", "How are you today?"]
tfidf_transformer = TfidfTransformer()
mach = machine(tfidf_transformer, tokenize.(docs))
MLJ.fit!(mach)

tfidf_mat = transform(mach, tokenize.(docs))

will fail:

ERROR: MethodError: no method matching tokenize(::String)

Closest candidates are:
  tokenize(::Type{S}, ::T) where {S<:Languages.Language, T<:AbstractString}
   @ TextAnalysis ~/.julia/packages/TextAnalysis/bdhyq/src/deprecations.jl:4
  tokenize(::S, ::T) where {S<:Languages.Language, T<:AbstractString}
   @ TextAnalysis ~/.julia/packages/TextAnalysis/bdhyq/src/tokenizer.jl:19

Stacktrace:
 [1] _broadcast_getindex_evalf
   @ ./broadcast.jl:709 [inlined]
 [2] _broadcast_getindex
   @ ./broadcast.jl:682 [inlined]
 [3] getindex
   @ ./broadcast.jl:636 [inlined]
 [4] copy
   @ ./broadcast.jl:942 [inlined]
 [5] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(tokenize), Tuple{Vector{String}}})
   @ Base.Broadcast ./broadcast.jl:903
 [6] top-level scope
   @ REPL[11]:1
@ablaom
Copy link
Member

ablaom commented Sep 18, 2024

Thanks for posting.

The tokenize method is owned by TextAnalysis, and I see that TextAnalysis 0.8 no longer supports the version dispatching on strings. One workaround is to use TextAnalysis 0.7.5.

Posted a query here about what the correct method is now.

@pazzo83
Copy link
Collaborator

pazzo83 commented Sep 18, 2024

It looks like it requires a "language" to be set, but it's not actually used:
https://github.com/JuliaText/TextAnalysis.jl/blame/master/src/tokenizer.jl

I think one thing we can do is just use the tokenization from WordTokenzers directly. I can look at making this change.

@ablaom
Copy link
Member

ablaom commented Sep 18, 2024

Thanks @pazzo83 for looking into this. This doesn't appear to affect MLJText tests, which are still passing (the compat for TextAnaysis includes the latest 0.8).

So maybe all that's needed is to update the docstrings.

@ablaom
Copy link
Member

ablaom commented Oct 8, 2024

@pazzo83 Any chance of a PR to fix this?

@pazzo83
Copy link
Collaborator

pazzo83 commented Oct 26, 2024

Sorry for the delay! I realized that we only need to update the Readme (since this is the code that is being referred to). I will be doing that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants