Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish artifact building error #110

Open
hicotton02 opened this issue Mar 20, 2024 · 2 comments
Open

Spanish artifact building error #110

hicotton02 opened this issue Mar 20, 2024 · 2 comments

Comments

@hicotton02
Copy link

When running the 'run_prep_artifacts.sh' script for 'es' there is an error when getting the wikipedia dataset. Hugginface does not have a prebuilt dataset for spanish and when line 53 fails in wikipedia_downloader.py and enters the exception, "beam_runner" doesnt seem to be a valid parameter.

If I comment that out, and put a valid date (20230801 is not a valid date anymore):

        except Exception as _:
            # if that fails, load from original huggingface dataset and process
            ds_iterator = load_dataset(
                "wikipedia", language=self._lang, date="20240320",
                cache_dir=self._cache_dir, #beam_runner="DirectRunner",
                split="train"
            )
            logger.info(f"{str(self)} Load {self._lang}-wiki from 20240320")

I get an error like this:

Traceback (most recent call last):                                                                                                                                                   | 0.00/7.40k [00:00<?, ?B/s]
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/http.py", line 419, in _info
    await _file_info(
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/http.py", line 832, in _file_info
    r.raise_for_status()
  File "/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1060, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('https://dumps.wikimedia.org/eswiki/20220301/dumpstatus.json')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/app/src/artifacts/downloaders/wikipedia_downloader.py", line 53, in run
    ds_iterator = load_dataset(
                  ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/datasets/load.py", line 2575, in load_dataset
    return builder_instance.as_streaming_dataset(split=split)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/datasets/builder.py", line 1382, in as_streaming_dataset
    splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/d41137e149b2ea90eead07e7e3f805119a8c22dd1d5b61651af8e3e3ee736001/wikipedia.py", line 977, in _split_generators
    with open(downloaded_files["info"], encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/datasets/streaming.py", line 75, in wrapper
    return function(*args, download_config=download_config, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/datasets/download/streaming_download_manager.py", line 512, in xopen
    file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
        ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/http.py", line 358, in _open
    size = size or self.info(path, **kwargs)["size"]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/http.py", line 432, in _info
    raise FileNotFoundError(url) from exc
FileNotFoundError: https://dumps.wikimedia.org/eswiki/20220301/dumpstatus.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/app/src/prep_artifacts.py", line 186, in <module>
    main(artifacts_dir=args.artifacts_dir,
  File "/usr/app/src/prep_artifacts.py", line 122, in main
    wikipedia.run(logger=logger)
  File "/usr/app/src/artifacts/downloaders/wikipedia_downloader.py", line 60, in run
    ds_iterator = load_dataset(
                  ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.11/site-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.11/site-packages/datasets/builder.py", line 1078, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/d41137e149b2ea90eead07e7e3f805119a8c22dd1d5b61651af8e3e3ee736001/wikipedia.py", line 981, in _split_generators
    multistream_dump_info["status"] == "done"
AssertionError: Specified dump (https://dumps.wikimedia.org/eswiki/20240320/) multistream status is not 'done': waiting
@hicotton02
Copy link
Author

I found that the dataset isnt ready for 20240320 yet, so I went back another snapshot and trying again

@hicotton02
Copy link
Author

changing the date to 20240301 and removing the beam_runner parameter seems to have worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant