Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add User filter on wiki page #13

Merged
merged 4 commits into from
Mar 3, 2022
Merged

Add User filter on wiki page #13

merged 4 commits into from
Mar 3, 2022

Conversation

SaulLu
Copy link
Collaborator

@SaulLu SaulLu commented Mar 3, 2022

This PR add a filter that filters out all the examples whose "title" field insed their "meta" value starts with "User ". The goal here is to remove these type of examples:

"Users in this category indicate they have knowledge of language Nepali."
"Users in this category indicate they have skill level 1 for language Low Saxon."

I've tested it on lm_en_wikinews_filtered, here's the logs:

03/03/2022 12:33:09 - INFO - __main__ - Applied filter: filter_user_titles
03/03/2022 12:33:09 - INFO - __main__ -      Initial number of samples: 54387 samples
03/03/2022 12:33:09 - INFO - __main__ -      Removed samples: 345 samples
03/03/2022 12:33:09 - INFO - __main__ -      Removed percentage: 0.63 %

I've also checked by hand all the filtered out examples and all of them corresponds to examples that we want to filter out.

Partially solves #5

Remaining todo: have the list of datasets we want this filter apply on

Copy link
Collaborator

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@SaulLu SaulLu merged commit 0d5b6b9 into master Mar 3, 2022
@thomasw21 thomasw21 deleted the LS/user_filter_wiki branch April 25, 2022 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants