Keep multi-words in sparknlp.annotator.Tokenizer together #9021

a-kliuieva · 2022-06-04T17:46:56Z

a-kliuieva
Jun 4, 2022

I want to extract keywords using sparknlp.annotator.YakeKeywordExtraction, but first I need to tokenize my text.
My Spark df looks something like this:

+---+-----------------------------------------+
|id |                                     text|
+---+-----------------------------------------+
|1  |sun, venus, solar system, mars, milky way|
+---+-----------------------------------------+
|2  |moon, cosmic rays, stars, orion nebula   |
+---+-----------------------------------------+

After applying sparknlp.annotator.Tokenizer I need to keep all multiwords (like 'solar system', 'cosmic rays', 'milky way', etc.) together (as a single token).

If I use the following pipeline, my multiwords are broken into separate tokens:

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") 
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setContextChars([","]) 
tokenization = Pipeline(stages=[documentAssembler, tokenizer])
df = tokenization.fit(df).transform(df)

If I add .setExceptions([" "]) parameter to the Tokenizer(), then I get my entire string as one token (that is also wrong).

I've tried a different approach. I modified my dataframe to have each phrase as a new row:

+---+-----------------+
|id |             text|
+---+-----------------+
|1  |              sun|
+---+-----------------+
|1  |            venus|
+---+-----------------+
|1  |     solar system|
+---+-----------------+
|1  |             mars|
+---+-----------------+
|1  |        milky way|
+---+-----------------+
|2  |             moon|
+---+-----------------+
|2  |             ... |
+---+-----------------+

then applied the following pipeline:

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") 
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setContextChars([","]) \
    .setExceptions([" "])
tokenization = Pipeline(stages=[documentAssembler, tokenizer])
df = tokenization.fit(df).transform(df)

group_token = df.groupBy("id").agg(F.collect_list("tokens"))
group_token = group_token.withColumn("tokens", F.flatten("collect_list(tokens)"))

In this case, all multiwords are not split during tokenization and keep together.
However, when I apply YakeKeywordExtraction I get the following error:

yake_keywords = YakeKeywordExtraction() \
    .setInputCols(["tokens"]) \
    .setOutputCol("keywords") \
    .setThreshold(0.5) \
    .setMinNGrams(1) \
    .setMaxNGrams(1) \
    .setNKeywords(5) \
    .setWindowSize(2)

yake_df = yake_keywords.transform(group_token)

IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in YakeKeywordExtraction_02d8c88211de.

I compared the structure of output after usual tokenization (first approach) and the one I get by grouping separate tokens - they are completely identical except begin and end values. So I don't understand what is wrong.

So, If there is a way to keep multiwords during tokenization together I'll be very grateful for the recommendations!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep multi-words in sparknlp.annotator.Tokenizer together #9021

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Keep multi-words in sparknlp.annotator.Tokenizer together #9021

a-kliuieva Jun 4, 2022

Replies: 0 comments

a-kliuieva
Jun 4, 2022