Checking this repo against the Quote-500K Dataset #185
TomLucidor
started this conversation in
General
Replies: 1 comment 2 replies
-
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Currently there is a dataset of quotes that are useful for language models and that I need to exclusively pick out the English ones to prevent data being unclean. There are a lot of quotes that are foreign (Spanish, French, Italian, Hindi, Chinese etc.) but I also realized that some of them are buggy (e.g. bad characters, numerals, formatting) and I would like to adventure into cleaning them out. Here are the ones being flagged as non-English (some are however English)
expected_foreign.txt
Questions:
Note: I will drop the "false negative" case once the data is handled carefully, as the original is 98KB
Beta Was this translation helpful? Give feedback.
All reactions