-
Notifications
You must be signed in to change notification settings - Fork 70
Dask on Dataproc GCP #85
Comments
Sounds great, thanks @anudeepbablu! |
Seeing this error:
|
Looks related to dask/dask-yarn#158 which there is a workaround for. |
Once you have things working could you write this up as a documentation page in rapidsai/deployment? |
I have opened PR rapidsai/deployment#99 to update the Dataproc instructions |
Update: PR on-hold awaiting this issue to be resolved -- Google team needs to upgrade the dask rapids installation to |
@jacobtomlinson Blocked by this error, fails to load the OUTPUT``` > File already exists. Ready to load at /rapids_hpo/data/airlines.parquet --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) Cell In[7], line 1 ----> 1 df = prepare_dataset(use_full_dataset=True)Cell In[6], line 34, in prepare_dataset(use_full_dataset) File /opt/conda/miniconda3/envs/dask-rapids/lib/python3.9/contextlib.py:79, in ContextDecorator.call..inner(*args, **kwds) File /opt/conda/miniconda3/envs/dask-rapids/lib/python3.9/site-packages/cudf/io/parquet.py:420, in read_parquet(filepath_or_buffer, engine, columns, filters, row_groups, strings_to_categorical, use_pandas_metadata, use_python_file_object, categorical_partitions, open_file_options, *args, **kwargs) File /opt/conda/miniconda3/envs/dask-rapids/lib/python3.9/contextlib.py:79, in ContextDecorator.call..inner(*args, **kwds) File /opt/conda/miniconda3/envs/dask-rapids/lib/python3.9/site-packages/cudf/io/parquet.py:243, in _process_dataset(paths, fs, filters, row_groups, categorical_partitions) File /opt/conda/miniconda3/envs/dask-rapids/lib/python3.9/site-packages/pyarrow/dataset.py:749, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) File /opt/conda/miniconda3/envs/dask-rapids/lib/python3.9/site-packages/pyarrow/dataset.py:451, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) File /opt/conda/miniconda3/envs/dask-rapids/lib/python3.9/site-packages/pyarrow/_dataset.pyx:1885, in pyarrow._dataset.DatasetFactory.finish() File /opt/conda/miniconda3/envs/dask-rapids/lib/python3.9/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /opt/conda/miniconda3/envs/dask-rapids/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status() ArrowInvalid: Error creating dataset. Could not read schema from '/rapids_hpo/data/airlines.parquet': Could not open Parquet input source '/rapids_hpo/data/airlines.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
|
Looks like there is something wrong with the dataset. Could you share a link to the full notebook? |
In that case my guess from the error would be that either the dataset in GCS is corrupted, or the client/workers can't access the file correctly. It would help if you could share a complete example of what you ran to get the error. Happy to sync up if that's easier. Side-note: As you're using the |
I loaded the data locally and it works. Seems to be an issue on Dataproc side. I was considering loading the parquet dataset into a BigQuery table as this notebook example does. But I think we should still find time to sync. OUTPUT
|
Update: For this issue, I will be testing the bigquery_dataproc_dask_xgboost.ipynb that shows how to use Dask to process dataset from big query leveraging Dask-BigQuery connector on a Dataproc cluster. Whereas hpo_demo.ipynb will be tested on a EC2 instance as tracked by PR #243 |
@jacobtomlinson I am seeing this error when Starting a YarnCluster
|
Looks related to this issue dask/dask-yarn#155 |
As a workaround can you try |
@jacobtomlinson Blocked by this when attempting to read from BigQuery table.. I have tried following the instructions to enable authentication with service account but no success, we might need to pair on this?
TRACEBACK
|
The error message also says |
Are there any plans for a version release that doesn't require a workaround like this?
|
Closing in favour of GoogleCloudDataproc/initialization-actions#1137 |
Will contribute notebooks to show workflow to run Dask on GCP Dataproc
The text was updated successfully, but these errors were encountered: