Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue with non-English text #47

Open
omid-jf opened this issue Dec 9, 2020 · 5 comments · May be fixed by #49
Open

Encoding issue with non-English text #47

omid-jf opened this issue Dec 9, 2020 · 5 comments · May be fixed by #49
Labels

Comments

@omid-jf
Copy link

omid-jf commented Dec 9, 2020

A non-English unicode string as input to preprocessor.clean with preprocessor.OPT.EMOJI option returns random meaningless characters.
And this is happening only on version 0.6.0

The cause of this issue seems to be line 50 of preprocess.py

To reproduce:
import preprocessor as p
p.set_options(p.OPT.URL, p.OPT.EMOJI, p.OPT.SMILEY)
print(p.clean("внесла предложение призвать всех избегать применять незаконные"))

@omid-jf omid-jf added the bug label Dec 9, 2020
@Smolky
Copy link

Smolky commented Dec 13, 2020

I have found a similar behavior in Spanish. The problem here is that characters with diacritics are removed, but no error is triggered.

For example, the text:

# Note the second letter starting from the left
Sí, efectivamente, el Servicio de Vigilancia

It is transformed into:

S, efectivamente, el Servicio de Vigilancia

In my case, this is the code I have used

df['tweet'] = df['tweet'].apply (lambda x: p.clean (x))

The dataframe is a readed from a CSV file in UTF8 (without BOM)

@cayorodriguez
Copy link

cayorodriguez commented Jan 4, 2021

I think the offending option that destroys everything that is not an English character is:
def preprocess_emojis(self, tweet_string, repl):
processed = Patterns.EMOJIS_PATTERN.sub(repl, tweet_string)
return processed.encode('ascii', 'ignore').decode('ascii')
There should be a better way to clean emojis
And there is:
https://github.com/carpedm20/emoji
Maybe this library should be in charge of demoji-fying, although it stubbornly adds aliases, like: :flexed_biceps:

@ashaheedq
Copy link

ashaheedq commented Mar 14, 2021

The same issue appears with Arabic texts using windows 10.
if the emoji option is on it will delete all characters.

@omid-jf omid-jf linked a pull request Mar 29, 2021 that will close this issue
@sara-02
Copy link

sara-02 commented Feb 18, 2022

Can confirm the same for Hindi and Codemixed Hindi-English as well.

@tsainez
Copy link

tsainez commented Jun 17, 2022

Also removes all Korean text!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants