-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wiki-based dataset cleaning #5
Comments
Thank you for gathering our finds in one place! I just have some questions to be sure that everything will be clear for the person implementing the fix:
|
Filtering of The dataset has 8,607,928 entries. Most of the repetitions are of the form "Audio (XX) (file)" where the middle parentheses are optional and case can change. e.g.: Less frequent repetitions but still easy to yank out:
|
Working to implement filters based on the |
Filtering of The dataset has 167,398 entries with some repeating boilerplate:
5,659 (3.4%) entries have a
80,455 entries (48% 😞) have a |
@SaulLu Should all filters for all wikis go into clean_helpers/filter_wiki_meta.py? |
It's up to you! 😄 I guess I would have put other filtering specific to WIki dataset also in the python script |
Filtering of The dataset has 83,023 entries. Some are very short. Similar issue as
Actually all pathologies seem restricted to |
Wiki-based dataset that are not wikipedia have at least the following issues:
Some solutions already mentioned to deal with those:
type
field inmeta
: keeping onlytext
types seems very strongtitle
field in meta: allows to remove user pages for example@SaulLu @cakiki
The text was updated successfully, but these errors were encountered: