-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issue with non-English text #47
Comments
I have found a similar behavior in Spanish. The problem here is that characters with diacritics are removed, but no error is triggered. For example, the text:
It is transformed into:
In my case, this is the code I have used
The dataframe is a readed from a CSV file in UTF8 (without BOM) |
I think the offending option that destroys everything that is not an English character is: |
The same issue appears with Arabic texts using windows 10. |
Can confirm the same for Hindi and Codemixed Hindi-English as well. |
Also removes all Korean text! |
A non-English unicode string as input to preprocessor.clean with preprocessor.OPT.EMOJI option returns random meaningless characters.
And this is happening only on version 0.6.0
The cause of this issue seems to be line 50 of preprocess.py
To reproduce:
import preprocessor as p
p.set_options(p.OPT.URL, p.OPT.EMOJI, p.OPT.SMILEY)
print(p.clean("внесла предложение призвать всех избегать применять незаконные"))
The text was updated successfully, but these errors were encountered: