You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The repo mentions that the dataset is composed of only these languages: en, de, fr, and es. Is there any possibility of contamination with other languages in the dataset? I would greatly appreciate your response. Thank you in advance.
The text was updated successfully, but these errors were encountered:
Thanks for your question -- it is very likely that there are also other languages present in the dataset. This is because the language of a document is identified using a FastText classifier and any document with score >= 0.5 is considered to be of the respective language and is kept in the corpus. I would expect that documents with lower language scores are more likely to contain text in other languages -- so if you want to filter such instances out you can filter the dataset based on a higher language score (e.g., RefinedWeb uses 0.6, C4 goes as high as 0.99).
The repo mentions that the dataset is composed of only these languages: en, de, fr, and es. Is there any possibility of contamination with other languages in the dataset? I would greatly appreciate your response. Thank you in advance.
The text was updated successfully, but these errors were encountered: