Keep multi-words in sparknlp.annotator.Tokenizer together #9021
Unanswered
a-kliuieva
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to extract keywords using
sparknlp.annotator.YakeKeywordExtraction
, but first I need to tokenize my text.My Spark df looks something like this:
After applying
sparknlp.annotator.Tokenizer
I need to keep all multiwords (like 'solar system', 'cosmic rays', 'milky way', etc.) together (as a single token).If I use the following pipeline, my multiwords are broken into separate tokens:
If I add
.setExceptions([" "])
parameter to theTokenizer()
, then I get my entire string as one token (that is also wrong).I've tried a different approach. I modified my dataframe to have each phrase as a new row:
then applied the following pipeline:
In this case, all multiwords are not split during tokenization and keep together.
However, when I apply
YakeKeywordExtraction
I get the following error:IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in YakeKeywordExtraction_02d8c88211de.
I compared the structure of output after usual tokenization (first approach) and the one I get by grouping separate tokens - they are completely identical except
begin
andend
values. So I don't understand what is wrong.So, If there is a way to keep multiwords during tokenization together I'll be very grateful for the recommendations!!!
Beta Was this translation helpful? Give feedback.
All reactions