Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wiki-based dataset cleaning #5

Open
TevenLeScao opened this issue Mar 1, 2022 · 7 comments
Open

Wiki-based dataset cleaning #5

TevenLeScao opened this issue Mar 1, 2022 · 7 comments

Comments

@TevenLeScao
Copy link
Contributor

TevenLeScao commented Mar 1, 2022

Wiki-based dataset that are not wikipedia have at least the following issues:

  • template text being repeated
  • non-article pages (users, categories)
  • some wikipedia formatting noise (may be only in the non-article pages though)

Some solutions already mentioned to deal with those:

  • Looking at the type field in meta: keeping only text types seems very strong
  • Deduplicating to remove templates
  • Looking at the title field in meta: allows to remove user pages for example

@SaulLu @cakiki

@SaulLu
Copy link
Collaborator

SaulLu commented Mar 2, 2022

Thank you for gathering our finds in one place!

I just have some questions to be sure that everything will be clear for the person implementing the fix:

  1. Can we add an example of "template text being repeated" 😄 . I just want to be sure to understand what you include inside. Do the example like ← May 29, 2021 May 31, 2021 → May 30 fall into this category?

  2. For "template text being repeated" and "non-article pages (users, categories)" do we want to keep some or do we want to remove everything?

  3. Deduplicating to remove templates

    On this, I also have a question to be sure to be aligned. What type of deduplication do you have in mind? Deduplication of the same documents(=example) ?

@cakiki
Copy link
Member

cakiki commented Mar 2, 2022

Filtering of lm_en_wiktionary_filtered.

The dataset has 8,607,928 entries.

Most of the repetitions are of the form "Audio (XX) (file)" where the middle parentheses are optional and case can change.

e.g.: Audio (file) and Audio (AU) (file) and audio (file) etc. (Those occur around 350K times, ~4%)

Less frequent repetitions but still easy to yank out:

  • This entry needs a photograph or drawing for illustration. Please try to find a suitable image on Wikimedia Commons or upload one there yourself!
  • This entry needs pronunciation information. If you are familiar with the IPA then please add some!
  • This entry needs audio files. If you are a native speaker with a microphone, please record some and upload them. (For audio required quickly, visit WT:APR.)
  • This entry is part of the phrasebook project, which presents criteria for inclusion based on utility, simplicity and commonality.
  • This entry needs pronunciation information. If you are familiar with the IPA or enPR then please add some!
  • A user has added this entry to requests for verification(+) If it cannot be verified that this term meets our attestation criteria, it will be deleted. Feel free to edit this entry as normal, but do not remove {{rfv}} until the request has been resolved.
  • A user has added this entry to requests for deletion(+). Please see that page for discussion and justifications. You may continue to edit this entry while the discussion proceeds, but please mention significant edits at the RFD discussion and ensure that the intention of votes already cast is not left unclear. Do not remove the {{rfd}} until the debate has finished.
  • This entry needs quotations to illustrate usage. If you come across any interesting, durably archived quotes then please add them!
  • This entry is part of the phrasebook project, which presents criteria for inclusion based on utility, simplicity and commonality.. Audio (file)
  • This entry needs pronunciation information. If you are familiar with the IPA then please add some!. This entry needs audio files. If you are a native speaker with a microphone, please record some and upload them. (For audio required quickly, visit WT:APR.)

@SaulLu
Copy link
Collaborator

SaulLu commented Mar 3, 2022

Working to implement filters based on the meta and title fields

@cakiki
Copy link
Member

cakiki commented Mar 3, 2022

Filtering of lm_ar_wikisource_filtered:

The dataset has 167,398 entries with some repeating boilerplate:

  • | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | (Could be either more or less numbers)
  • Similar to above but with section names (I don't think this is problematic, very few repetitions of that kind)

5,659 (3.4%) entries have a content_model of proofread-page meaning user transcriptions of un-OCR'able PDFs and Images, and that comes with a bit of markdown that is sometimes problematic, e.g.:

  • repeated {{dhr}} which is used to transcribe "two or more text rows between paragraphs, or increase space above and below a graphical object", so we could replace every occurrence with \n.
  • styling commands such as : {{وسط|{{xxx-larger|'''كربلاء'''}}}} (I'm not sure this is problematic)

80,455 entries (48% 😞) have a type of auxiliary_text. -> This can probably all go; I can't find any good content in there.

@cakiki
Copy link
Member

cakiki commented Mar 3, 2022

@SaulLu Should all filters for all wikis go into clean_helpers/filter_wiki_meta.py?

@SaulLu
Copy link
Collaborator

SaulLu commented Mar 3, 2022

It's up to you! 😄

I guess I would have put other filtering specific to WIki dataset also in the python script

@cakiki
Copy link
Member

cakiki commented Mar 3, 2022

Filtering of lm_ar_wiktionary_filtered.

The dataset has 83,023 entries. Some are very short.

Similar issue as lm_en_wiktionary_filtered above with "Audio (US) (ملف)" but this seems to be restricted to auxiliary_text which we are filtering out:

return [eval(meta)["type"] == "text" for meta in examples["meta"]]

Actually all pathologies seem restricted to auxiliary_text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants