Wiki-based dataset cleaning #5

TevenLeScao · 2022-03-01T19:12:34Z

Wiki-based dataset that are not wikipedia have at least the following issues:

template text being repeated
non-article pages (users, categories)
some wikipedia formatting noise (may be only in the non-article pages though)

Some solutions already mentioned to deal with those:

Looking at the type field in meta: keeping only text types seems very strong
Deduplicating to remove templates
Looking at the title field in meta: allows to remove user pages for example

@SaulLu @cakiki

The text was updated successfully, but these errors were encountered:

SaulLu · 2022-03-02T09:12:09Z

Thank you for gathering our finds in one place!

I just have some questions to be sure that everything will be clear for the person implementing the fix:

Can we add an example of "template text being repeated" 😄 . I just want to be sure to understand what you include inside. Do the example like ← May 29, 2021 May 31, 2021 → May 30 fall into this category?
For "template text being repeated" and "non-article pages (users, categories)" do we want to keep some or do we want to remove everything?
Deduplicating to remove templates

On this, I also have a question to be sure to be aligned. What type of deduplication do you have in mind? Deduplication of the same documents(=example) ?

cakiki · 2022-03-02T21:47:15Z

Filtering of lm_en_wiktionary_filtered.

The dataset has 8,607,928 entries.

Most of the repetitions are of the form "Audio (XX) (file)" where the middle parentheses are optional and case can change.

e.g.: Audio (file) and Audio (AU) (file) and audio (file) etc. (Those occur around 350K times, ~4%)

Less frequent repetitions but still easy to yank out:

This entry needs a photograph or drawing for illustration. Please try to find a suitable image on Wikimedia Commons or upload one there yourself!
This entry needs pronunciation information. If you are familiar with the IPA then please add some!
This entry needs audio files. If you are a native speaker with a microphone, please record some and upload them. (For audio required quickly, visit WT:APR.)
This entry is part of the phrasebook project, which presents criteria for inclusion based on utility, simplicity and commonality.
This entry needs pronunciation information. If you are familiar with the IPA or enPR then please add some!
A user has added this entry to requests for verification(+) If it cannot be verified that this term meets our attestation criteria, it will be deleted. Feel free to edit this entry as normal, but do not remove {{rfv}} until the request has been resolved.
A user has added this entry to requests for deletion(+). Please see that page for discussion and justifications. You may continue to edit this entry while the discussion proceeds, but please mention significant edits at the RFD discussion and ensure that the intention of votes already cast is not left unclear. Do not remove the {{rfd}} until the debate has finished.
This entry needs quotations to illustrate usage. If you come across any interesting, durably archived quotes then please add them!
This entry is part of the phrasebook project, which presents criteria for inclusion based on utility, simplicity and commonality.. Audio (file)
This entry needs pronunciation information. If you are familiar with the IPA then please add some!. This entry needs audio files. If you are a native speaker with a microphone, please record some and upload them. (For audio required quickly, visit WT:APR.)

SaulLu · 2022-03-03T10:31:19Z

Working to implement filters based on the meta and title fields

cakiki · 2022-03-03T10:39:22Z

Filtering of lm_ar_wikisource_filtered:

The dataset has 167,398 entries with some repeating boilerplate:

| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | (Could be either more or less numbers)
Similar to above but with section names (I don't think this is problematic, very few repetitions of that kind)

5,659 (3.4%) entries have a content_model of proofread-page meaning user transcriptions of un-OCR'able PDFs and Images, and that comes with a bit of markdown that is sometimes problematic, e.g.:

repeated {{dhr}} which is used to transcribe "two or more text rows between paragraphs, or increase space above and below a graphical object", so we could replace every occurrence with \n.
styling commands such as : {{وسط|{{xxx-larger|'''كربلاء'''}}}} (I'm not sure this is problematic)

80,455 entries (48% 😞) have a type of auxiliary_text. -> This can probably all go; I can't find any good content in there.

cakiki · 2022-03-03T15:22:28Z

@SaulLu Should all filters for all wikis go into clean_helpers/filter_wiki_meta.py?

SaulLu · 2022-03-03T16:03:43Z

It's up to you! 😄

I guess I would have put other filtering specific to WIki dataset also in the python script

cakiki · 2022-03-03T19:50:35Z

Filtering of lm_ar_wiktionary_filtered.

The dataset has 83,023 entries. Some are very short.

Similar issue as lm_en_wiktionary_filtered above with "Audio (US) (ملف)" but this seems to be restricted to auxiliary_text which we are filtering out:

catalogue_data/clean_helpers/filter_wiki_meta.py

Line 5 in 12ddc9a

return [eval(meta)["type"] == "text" for meta in examples["meta"]]

Actually all pathologies seem restricted to auxiliary_text.

This was referenced Mar 3, 2022

Add wiki filter on "type" meta field #12

Merged

Add User filter on wiki page #13

Merged

cakiki mentioned this issue Mar 4, 2022

Add substring remover mapper #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wiki-based dataset cleaning #5

Wiki-based dataset cleaning #5

TevenLeScao commented Mar 1, 2022 •

edited

Loading

SaulLu commented Mar 2, 2022 •

edited

Loading

cakiki commented Mar 2, 2022

SaulLu commented Mar 3, 2022

cakiki commented Mar 3, 2022 •

edited

Loading

cakiki commented Mar 3, 2022

SaulLu commented Mar 3, 2022

cakiki commented Mar 3, 2022

Wiki-based dataset cleaning #5

Wiki-based dataset cleaning #5

Comments

TevenLeScao commented Mar 1, 2022 • edited Loading

SaulLu commented Mar 2, 2022 • edited Loading

cakiki commented Mar 2, 2022

SaulLu commented Mar 3, 2022

cakiki commented Mar 3, 2022 • edited Loading

cakiki commented Mar 3, 2022

SaulLu commented Mar 3, 2022

cakiki commented Mar 3, 2022

TevenLeScao commented Mar 1, 2022 •

edited

Loading

SaulLu commented Mar 2, 2022 •

edited

Loading

cakiki commented Mar 3, 2022 •

edited

Loading