Encoding issue with non-English text #47

omid-jf · 2020-12-09T04:17:37Z

A non-English unicode string as input to preprocessor.clean with preprocessor.OPT.EMOJI option returns random meaningless characters.
And this is happening only on version 0.6.0

The cause of this issue seems to be line 50 of preprocess.py

To reproduce:
import preprocessor as p
p.set_options(p.OPT.URL, p.OPT.EMOJI, p.OPT.SMILEY)
print(p.clean("внесла предложение призвать всех избегать применять незаконные"))

Smolky · 2020-12-13T16:46:44Z

I have found a similar behavior in Spanish. The problem here is that characters with diacritics are removed, but no error is triggered.

For example, the text:

# Note the second letter starting from the left
Sí, efectivamente, el Servicio de Vigilancia

It is transformed into:

S, efectivamente, el Servicio de Vigilancia

In my case, this is the code I have used

df['tweet'] = df['tweet'].apply (lambda x: p.clean (x))

The dataframe is a readed from a CSV file in UTF8 (without BOM)

cayorodriguez · 2021-01-04T13:22:42Z

I think the offending option that destroys everything that is not an English character is:
def preprocess_emojis(self, tweet_string, repl):
processed = Patterns.EMOJIS_PATTERN.sub(repl, tweet_string)
return processed.encode('ascii', 'ignore').decode('ascii')
There should be a better way to clean emojis
And there is:
https://github.com/carpedm20/emoji
Maybe this library should be in charge of demoji-fying, although it stubbornly adds aliases, like: :flexed_biceps:

ashaheedq · 2021-03-14T21:47:23Z

The same issue appears with Arabic texts using windows 10.
if the emoji option is on it will delete all characters.

sara-02 · 2022-02-18T06:38:14Z

Can confirm the same for Hindi and Codemixed Hindi-English as well.

tsainez · 2022-06-17T01:49:08Z

Also removes all Korean text!

omid-jf added the bug label Dec 9, 2020

omid-jf linked a pull request Mar 29, 2021 that will close this issue

Fix support for non-English texts #49

Open

Sdsmetamask added this to @Sdsmetamask's untitled project Apr 8, 2024

GuilhermeFiorinn added this to Atividade avaliativa 2 (sala de aula) Apr 18, 2024

GuilhermeFiorinn moved this to Backlog in Atividade avaliativa 2 (sala de aula) Apr 18, 2024

GuilhermeFiorinn removed this from Atividade avaliativa 2 (sala de aula) Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issue with non-English text #47

Encoding issue with non-English text #47

omid-jf commented Dec 9, 2020

Smolky commented Dec 13, 2020

cayorodriguez commented Jan 4, 2021 •

edited

Loading

ashaheedq commented Mar 14, 2021 •

edited

Loading

sara-02 commented Feb 18, 2022

tsainez commented Jun 17, 2022

Encoding issue with non-English text #47

Encoding issue with non-English text #47

Comments

omid-jf commented Dec 9, 2020

Smolky commented Dec 13, 2020

cayorodriguez commented Jan 4, 2021 • edited Loading

ashaheedq commented Mar 14, 2021 • edited Loading

sara-02 commented Feb 18, 2022

tsainez commented Jun 17, 2022

cayorodriguez commented Jan 4, 2021 •

edited

Loading

ashaheedq commented Mar 14, 2021 •

edited

Loading