Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wiki filter on "type" meta field #12

Merged
merged 2 commits into from
Mar 3, 2022
Merged

Conversation

SaulLu
Copy link
Collaborator

@SaulLu SaulLu commented Mar 3, 2022

This PR add a filter that filters out all the examples that doesn't have their "type" field inside their "meta" value equal to "text".

I've tested it on lm_en_wikinews_filtered, here's the logs:

03/03/2022 11:46:20 - INFO - __main__ - Applied filter: filter_wiki_non_text_type
03/03/2022 11:46:20 - INFO - __main__ -      Initial number of samples: 54387 samples
03/03/2022 11:46:20 - INFO - __main__ -      Removed samples: 24736 samples
03/03/2022 11:46:20 - INFO - __main__ -      Removed percentage: 45.48 %

Partially solves #5

@HugoLaurencon
Copy link
Contributor

HugoLaurencon commented Mar 3, 2022

Hey @SaulLu have you tested this code?

To me, iffilter_wiki_non_text_type is a function we should pass to the .filter method of datasets, it should take a single example as input, and return a boolean, not a list of booleans

Maybe your code works too, don't know

clean_helpers/filter_wiki_meta.py Outdated Show resolved Hide resolved
@lvwerra
Copy link
Collaborator

lvwerra commented Mar 3, 2022

@HugoLaurencon I think it should work if .filter(some_filter, batched=True), no?

@lvwerra
Copy link
Collaborator

lvwerra commented Mar 3, 2022

LGMT! 🚀

@SaulLu SaulLu merged commit b2d7d51 into master Mar 3, 2022
@SaulLu
Copy link
Collaborator Author

SaulLu commented Mar 3, 2022

Remaining todo: have the list of datasets we want this filter apply on

@thomasw21 thomasw21 deleted the LS/wiki_meta_title_filter branch April 25, 2022 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants