Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid argument when running cc_net #82

Open
Practicinginhell opened this issue Nov 7, 2023 · 2 comments
Open

Invalid argument when running cc_net #82

Practicinginhell opened this issue Nov 7, 2023 · 2 comments

Comments

@Practicinginhell
Copy link

Practicinginhell commented Nov 7, 2023

Hi everyone, I try to run the cc net using this command python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1. But the invalid argument value for sequence type happened for -l argument. Thank you in advance for any help.

@hicotton02
Copy link

the -l is for the language. This was for an older version of CC Net. The original project has been archived, but you can remove the "-l en" part and edit the file here:
https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/cc/cc_net/cc_net/mine.py#L88C37-L88C37

and add the languages you want. for example to just have en, you would do:

lang_whitelist: Sequence[str] = [ "en" ]

@Practicinginhell
Copy link
Author

Practicinginhell commented Nov 7, 2023

Thank you! I fixed it with the same way that you mentioned above. But I wonder why they don't update the Readme in cc_net module. I think this is a issue related to func_argparse that don't receive subsequent arguments as a Sequence because this error still happened even when I used the original cc_net repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants